Table of Contents
Fetching ...

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

Yunbin Tu, Liang Li, Li Su, Chenggang Yan, Qingming Huang

TL;DR

The paper tackles change captioning under distractors such as illumination and viewpoint changes by introducing Distractors-Immune Representation Learning (DIRL), which correlates corresponding channels and decorrelates others to produce stable, discriminative image representations. A Cross-modal Contrastive Regularization (CCR) is then applied during decoding to align attended difference features with generated words using an InfoNCE objective. The approach jointly learns robust difference features and improves cross-modal alignment, achieving state-of-the-art results on four public datasets with substantial improvements over prior methods. The work includes comprehensive ablations, analysis under varying distractors, and publicly available code to support reproducibility and practical deployment.

Abstract

Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets. The code is available at https://github.com/tuyunbin/DIRL.

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

TL;DR

The paper tackles change captioning under distractors such as illumination and viewpoint changes by introducing Distractors-Immune Representation Learning (DIRL), which correlates corresponding channels and decorrelates others to produce stable, discriminative image representations. A Cross-modal Contrastive Regularization (CCR) is then applied during decoding to align attended difference features with generated words using an InfoNCE objective. The approach jointly learns robust difference features and improves cross-modal alignment, achieving state-of-the-art results on four public datasets with substantial improvements over prior methods. The work includes comprehensive ablations, analysis under varying distractors, and publicly available code to support reproducibility and practical deployment.

Abstract

Change captioning aims to succinctly describe the semantic change between a pair of similar images, while being immune to distractors (illumination and viewpoint changes). Under these distractors, unchanged objects often appear pseudo changes about location and scale, and certain objects might overlap others, resulting in perturbational and discrimination-degraded features between two images. However, most existing methods directly capture the difference between them, which risk obtaining error-prone difference features. In this paper, we propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations and decorrelates different ones in a self-supervised manner, thus attaining a pair of stable image representations under distractors. Then, the model can better interact them to capture the reliable difference features for caption generation. To yield words based on the most related difference features, we further design a cross-modal contrastive regularization, which regularizes the cross-modal alignment by maximizing the contrastive alignment between the attended difference features and generated words. Extensive experiments show that our method outperforms the state-of-the-art methods on four public datasets. The code is available at https://github.com/tuyunbin/DIRL.
Paper Structure (25 sections, 16 equations, 12 figures, 12 tables)

This paper contains 25 sections, 16 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: The examples of change captioning under different scenarios. The first and second cases show that object moving and dropping, respectively. The third one shows the color change under distractors (viewpoint and illumination changes), where the real change is overwhelmed by pseudo changes. The last one shows with only distractors. Changed objects are shown in red boxes.
  • Figure 2: The framework of our method, where the core blocks are distractors-immune representation learning network and cross-modal contrastive regularization. FC and Concat are short for the fully-connect layer and concatenation operation.
  • Figure 3: Visualization of change localization and captioning results of DIRL+CCR and Transformer. GT is short for ground-truth and changed objects are shown in red boxes.
  • Figure 4: Visualization of captioning performance under varied viewpoint changes.
  • Figure 5: The effects of two trade-off parameters of $\lambda_d$ and $\lambda_c$ on CLEVR-DC.
  • ...and 7 more figures