MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
Swadhin Das, Raksha Sharma
TL;DR
This work tackles remote sensing image captioning (RSIC) by addressing the limitations of single-stream encoder–decoder models in capturing complex spatial patterns and semantic relationships. It introduces MsEdF, a multi-stream framework that fuses two complementary CNN encoders to enrich visual features and employs a stacked GRU decoder with Local Weighted Stacking to preserve intermediate linguistic cues and improve semantic modeling. The authors provide a rigorous ablation study and show that encoder fusion plus LWS consistently yields state-of-the-art or highly competitive results across SYDNEY, UCM, and RSICD datasets, with notable gains in CIDEr and BLEU metrics. The approach offers a practical path toward more accurate and context-aware RSIC, with potential extensions in retrieval-guided decoding and domain-adaptive training for unseen imagery.
Abstract
Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.
