A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

Swadhin Das; Raksha Sharma

A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

Swadhin Das, Raksha Sharma

TL;DR

This work tackles remote sensing image captioning by integrating TextGCN-derived word embeddings into a multi-layer LSTM encoder–decoder and a fairness-aware comparison-based beam search. TextGCN captures word relationships at sentence and corpus levels, using a PMI-based adjacency and a two-layer graph convolution with a precomputed, non-trainable embedding matrix. The model uses a ResNet-based image encoder with $2048$-dimensional features, an embedding size of $256$, and LSTM hidden sizes of $256$ and $512$, achieving superior performance on RSICD across seven metrics, including BLEU-$1$ to BLEU-$4$, METEOR, ROUGE-L, and CIDEr. The results demonstrate that addressing domain-specific vocabulary and search fairness yields clear gains in caption quality, with qualitative examples validating improved descriptions for RS imagery.

Abstract

Remote sensing images are highly valued for their ability to address complex real-world issues such as risk management, security, and meteorology. However, manually captioning these images is challenging and requires specialized knowledge across various domains. This letter presents an approach for automatically describing (captioning) remote sensing images. We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs. The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels. Furthermore, we advance our approach with a comparison-based beam search method to ensure fairness in the search strategy for generating the final caption. We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks. We evaluated our method across three datasets using seven metrics: BLEU-1 to BLEU-4, METEOR, ROUGE-L, and CIDEr. The results demonstrate that our approach significantly outperforms other state-of-the-art encoder-decoder methods.

A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

TL;DR

-dimensional features, an embedding size of

, and LSTM hidden sizes of

and

, achieving superior performance on RSICD across seven metrics, including BLEU-

to BLEU-

, METEOR, ROUGE-L, and CIDEr. The results demonstrate that addressing domain-specific vocabulary and search fairness yields clear gains in caption quality, with qualitative examples validating improved descriptions for RS imagery.

Abstract

Paper Structure (22 sections, 1 equation, 3 figures, 3 tables)

This paper contains 22 sections, 1 equation, 3 figures, 3 tables.

Introduction
Proposed Method
Encoded Representation of Image
Encoded Representation of the Input Text
Text Graph Convolution Network
Comparison Based Beam Search
Multi-Layer Decoding Strategy
Experiments
Dataset and Performance Metrics
Experimental Setup
Results and Analysis
The Performance of the Proposed Method
The Effect of Pretrained Word Embeddings
Effect of Embedding Vector Size of TextGCN on Our Model
Visual Examples
...and 7 more sections

Figures (3)

Figure 1: Architecture of the Proposed Model
Figure 2: Visual Examples of Proposed RSIC Model
Figure 3: Visual Examples of Proposed RSIC Model

A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

TL;DR

Abstract

A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)