| 14:00-14:05 |
Welcome |
| 14:05-14:50 |
|
Ioannis Papoutsis, National Technical University of Athens, Greece
Towards Reliable Earth Observation Foundation Models
Earth observation is undergoing a major transition. Instead of training a separate model and collecting new labels for every task, we are moving toward foundation models that learn from vast amounts of unlabelled satellite data. These self-supervised models capture common structures across sensors, geographic regions, and spatial scales, offering a more adaptable and scalable alternative to traditional approaches. Yet this shift raises critical questions: To what extent can a single model truly generalize? How robust is it when deployed in unfamiliar areas or on unseen data types? And crucially, how can we quantify its uncertainty?
In this talk, I will present recent advances in self-supervised training and zero-shot uncertainty estimation for EO foundation models. The focus is on how such models can become dependable tools for scientific discovery and operational applications—while recognizing their current limitations and the challenges that lie ahead.
Bio
Ioannis Papoutsis is an Assistant Professor of Remote Sensing and Artificial Intelligence at the National Technical University of Athens (NTUA), and an Adjunct Researcher at both the National Observatory of Athens and the Archimedes/Athena Research Center. He holds a diploma in Electrical and Computer Engineering and a PhD in Satellite Remote Sensing from NTUA, an MSc in Telecommunications from University College London, and an MBA from Alba Business School. He leads the OrionLab research group, which focuses on big satellite data analytics and machine learning for Earth Observation, with emphasis on natural disaster management and climate change impact monitoring. His research interests include foundational models in remote sensing, particularly self-supervised learning for multi-modal EO data, vision-language models for remote sensing image interpretation, and earth system deep learning for spatiotemporal forecasting. He coordinates four research projects — ThinkingEarth, MeDiTwin, DeepCube, and SeasFire — which investigate the application of AI in addressing environmental challenges. He has also served as Operations Manager of the Greek node of the European Space Agency (ESA) Hubs for Sentinel data distribution, and as Copernicus Emergency Management Services Manager for Risk and Recovery.
|
|
| 14:50-15:05 |
Spatio-Temporal Forecasting of PS–InSAR Displacement with a PointNet-Inspired Deep Learning Model
Takayuki Shinohara (National Institute of Advanced Industrial Science and Technology, Japan), Takayuki Shinohara (National Institute of Advanced Industrial Science and Technology, Japan)
Persistent Scatterer InSAR (PS–InSAR) yields a genuine three–dimensional point cloud: each scatterer is identified by fixed coordinates (x, y, z) and an accompanying displacement sequence D_u1,..., D_uT. Most existing forecasting studies treat every series in isolation and, as a result, discard the spatial context that governs tectonic, volcanic, and anthropogenic deformation. We present PointNet–PSI, a spatio–temporal model that couples a PointNet–style point cloud encoder with MOMENT, a recent foundation model for general time–series prediction. The permutation–invariant PointNet front–end ingests the unordered PS–InSAR cloud, compresses local geometry and kinematic similarity into latent descriptors, then concatenates these descriptors with the raw displacement history. The enriched embeddings are passed to MOMENT’s transformer backbone, which produces multi–step forecasts for every scatterer. In this hybrid design the network learns where through spatial aggregation of neighbouring points and when through MOMENT’s long–range temporal attention, while retaining the large receptive field and data–efficient pre–training advantages of the base model. We validate the approach on the European Ground Motion Service Basic 2019–2023 vertical–velocity product. We adopt a hindcast protocol: observations from 20192020 serve as context, and all 60 samples of 2021 form the strictly held-out forecast horizon. Compared with strong per–point sequence models (LSTM, Temporal Fusion Transformer, and vanilla MOMENT) and naive PointNet, PointNet–PSI reduces the test RMSE by about 17%.
|
| 15:05-15:20 |
From Forest to Urban: Data Efficient Tree Segmentation with Self-Supervised Pretraining on Height-Based Voronoi Maps
Jonas Geiselhart (University of Stuttgart, Germany), Luca Reichmann (University of Stuttgart, Germany), Alina Roitberg (University of Hildesheim, Germany)
We propose a self-supervised pretraining framework for tree segmentation in airborne VHR imagery that exploits both color and infrared (RGBI) data and height maps. Our key idea is pairing height maps and Voronoi decomposition to create auto-labels, enabling pretraining without human annotations. The model is fine-tuned on a small, manually annotated urban dataset, with postprocessing refining results across diverse settings. To validate our idea, we introduce a composite dataset consisting of three parts: (1) An autolabeled forest dataset used for height-driven pretraining, (2) an annotated urban tree dataset used for fine-tuning and (3) a small test dataset with manual trees for validation. Our approach achieves F1-scores of 0.65 (urban) and 0.60 (suburban). This also demonstrates that the proposed height-driven pretraining outperforms the conventional training by 0.44 in urban environments. In summary, we contribute a fully automatic framework to detect trees in large and diverse regions of land using models that were trained by a simple self-supervised mechanism utilizing height data of forest regions. Additionally, we analyze the transfer capabilities with a small finetuning dataset. Code, models, and data are available on GitHub.
|
| 15:20-15:35 |
Distribution Modeling and GenAI-Assisted Projection for SAR Incremental Learning
Heqing Huang (Beihang University, China & University of Stirling, UK), Fei Gao (Beihang University, China), Vahid Akbari (University of Stirling, UK)
In class incremental learning for synthetic aperture radar (SAR) imagery, models must acquire new categories while retaining knowledge of previous ones. Generative replay can mitigate forgetting by synthesizing old class samples. However, vanilla generative networks, such as variational autoencoder (VAE), prioritize pixel level reconstruction and do not inherently enforce class separability, which may not be optimal for incremental recognition. To address this issue, we analyze the distribution of the dataset used. The class-wise latent distributions are modeled via flow-based density estimation, enabling the generation of representative, in-distribution exemplars. Then we combine with current-task data, the exemplars support a feature projection between old and new latent spaces, from which a numerically optimized closed-form classifier is reconstructed. This dual use of learned distributions both constrains generative replay to in-distribution regions and calibrates decision boundaries to reduce drift. Experiments on SAR benchmarks demonstrate that our approach achieves state-of-the-art accuracy while maintaining a superior stability and plasticity trade-off.
|
| 15:35-15:50 |
Open-Vocabulary Semantic Segmentation in Remote Sensing via Hierarchical Attention Masking and Model Composition
Mohammadreza Heidarianbaei (Leibniz University Hannover, Germany), Mareike Dorozynski (Leibniz University Hannover, Germany), Hubert Kanyamahanga (Leibniz University Hannover, Germany), Max Mehltretter (Leibniz University Hannover, Germany), Franz Rottensteiner (Leibniz University Hannover, Germany)
In this paper, we propose ReSeg-CLIP, a new training-free Open-Vocabulary Semantic Segmentation method for remote sensing data. To compensate the problems of vision language models such as CLIP in semantic segmentation caused by inappropriate interactions within the self-attention layers, we introduce a hierarchical scheme utilizing masks generated by SAM to constrain the interactions at multiple scales. We also present a model composition approach that averages the parameters of multiple RS-specific CLIP variants, taking advantage of a new weighting scheme that evaluates representational quality using varying text prompts. Our method achieves state-of-the-art results across three RS benchmarks without additional training.
https://github.com/aemrhb/ReSeg-CLIP.
|
| 15:50-16:15 |
Coffee/Tea Break |
| 16:15-17:00 |
|
Javiera Castillo Navarro, Conservatoire National des Arts et Métier, France
Vision and language in remote sensing: Learning what images cannot tell us alone
Earth Observation (EO) data analysis plays a major role in how we understand our planet and its dynamics. However, the models we usually rely on for this analysis often use rigid taxonomies and single-modality supervision, far from the flexible way humans perceive and interpret the world. Human understanding is inherently multimodal: we combine vision, language, and contextual knowledge to make sense of complex environments. How can we build models that move in this more “human-like” direction?
Recent advances in vision–language and multimodal learning offer a promising path. By integrating satellite imagery with natural language descriptions, ecological knowledge, or other complementary modalities, these models can capture richer semantics, transfer across datasets, and interact with users through open-ended queries. They enable remote sensing systems to go beyond fixed labels and learn more general, expressive representations of the Earth’s surface.
In this talk, we will explore recent works showing how multimodal integration can reshape remote sensing: from open-vocabulary understanding to weakly supervised ecological grounding, and toward models that capture both shared and complementary cross-modal information. These advances pave the way for more flexible mapping and deeper environmental and scientific insights.
Bio
Javiera Castillo Navarro is an Assistant Professor (maître de conférences) at Cnam, Paris. She previously worked as a postdoctoral researcher at EPFL in the ECEO laboratory, focusing on vision–language models and multimodal learning. She completed her PhD at Université de Bretagne-Sud, where she developed semi-supervised learning methods for semantic segmentation and classification of Earth observation images. Her research centers on computer vision, representation learning, and multimodal learning, with a particular interest in vision and language models.
|
|
| 17:00-17:15 |
Enhancing Marine Pollution Detection in Remote Sensing via Self-Supervised Boundary Awareness
Shuaiyu Chen (University of Exeter, UK), Chunbo Luo (University of Exeter, UK), Peng Ren (China University of Petroleum), Zeyu Fu (University of Exeter, UK)
Accurate Marine Pollution Detection is challenging due to the vague, irregular, and low-contrast nature of pollutant boundaries. Existing boundary-aware remote sensing segmentation methods often rely on explicit boundary annotations or hand-crafted attention modules, limiting their effectiveness in marine environments where annotations are scarce and structures are complex. In this work, we introduce a fully Self-Supervised Boundary-Awareness (SSBA) block that can be seamlessly integrated into existing segmentation architectures for MPD. Our SSBA block integrates a VSS-based global extractor, a boundary-focused local extractor with deformable and frequency features, and an attention-guided fusion module to adaptively combine semantics and edges for boundaryaware prediction. To further enhance spatial sensitivity, we develop a boundary-aware attention module trained via boundary reconstruction, enabling dynamic focus on critical boundary regions. Experimental results on two marine pollution datasets show that our method consistently provides state-of-the-art performance, particularly under weak boundary conditions.
|
| 17:15-17:30 |
SpecBPP: A Self-Supervised Learning Approach for Hyperspectral Representation and Soil Organic Carbon Estimation
Daniel La’ah Ayuba (University Of Surrey, UK), Jean-Yves Guillemaut (University Of Surrey, UK), Belen Marti-Cardona (University Of Surrey, UK), Oscar Mendez Maldonado (University Of Surrey, UK)
Self-supervised learning has revolutionized representation learning in vision and language, but remains underexplored for hyperspectral imagery (HSI), where the sequential structure of spectral bands offers unique opportunities. In this work, we propose Spectral Band Permutation Prediction (SpecBPP), a novel self-supervised learning framework that leverages the inherent spectral continuity in HSI. Instead of reconstructing masked bands, SpecBPP challenges a model to recover the correct order of shuffled spectral segments, encouraging global spectral understanding. We implement a curriculum-based training strategy that progressively increases permutation difficulty to manage the factorial complexity of the permutation space. Applied to Soil Organic Carbon (SOC) estimation using EnMAP satellite data, our method achieves state-of-the-art results, outperforming both masked autoencoder (MAE) and joint-embedding predictive (JEPA) baselines. Fine-tuned on limited labeled samples, our model yields an R2 of 0.9456, RMSE of 1.1053%, and RPD of 4.19, significantly surpassing traditional and self-supervised benchmarks. Our results demonstrate that spectral order prediction is a powerful pretext task for hyperspectral understanding, opening new avenues for scientific representation learning in remote sensing and beyond.
|
| 17:30-17:55 |
Challenge Results |
| 17:55-18:00 |
Closing |