Table of Contents
Fetching ...

Leveraging Textual Compositional Reasoning for Robust Change Captioning

Kyu Ri Park, Jiyoung Park, Seong Tae Kim, Hong Joo Lee, Jung Uk Kim

TL;DR

The paper addresses the limitation of visual-only change captioning by introducing CORTEX, a plug-and-play framework that injects explicit textual compositional reasoning into change detection. It deploys a Reasoning-aware Text Extraction module to generate relational descriptions via a Vision-Language Model and an Image-Text Dual Alignment module to fuse static intra-scene and dynamic cross-scene cues with visual features. Through the alignment of text and image representations and the combination with image-level change cues, CORTEX achieves consistent improvements on CLEVR-Change, CLEVR-DC, and Spot-the-Diff datasets, demonstrating improved fine-grained relational understanding and robustness to viewpoint changes. The work highlights the value of textual reasoning signals in enhancing change captioning, offering a practical, modular enhancement for existing visual detectors.

Abstract

Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.

Leveraging Textual Compositional Reasoning for Robust Change Captioning

TL;DR

The paper addresses the limitation of visual-only change captioning by introducing CORTEX, a plug-and-play framework that injects explicit textual compositional reasoning into change detection. It deploys a Reasoning-aware Text Extraction module to generate relational descriptions via a Vision-Language Model and an Image-Text Dual Alignment module to fuse static intra-scene and dynamic cross-scene cues with visual features. Through the alignment of text and image representations and the combination with image-level change cues, CORTEX achieves consistent improvements on CLEVR-Change, CLEVR-DC, and Spot-the-Diff datasets, demonstrating improved fine-grained relational understanding and robustness to viewpoint changes. The work highlights the value of textual reasoning signals in enhancing change captioning, offering a practical, modular enhancement for existing visual detectors.

Abstract

Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.

Paper Structure

This paper contains 29 sections, 10 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: (a) Existing methods struggle to estimate changes because compositional reasoning cues are not explicitly represented in the image (i.e., object relationships (yellow arrows), spatial arrangements (blue circles)). (b) In contrast, our method incorporates explicit textual compositional reasoning cues to enhance scene understanding, thereby enabling more accurate change description.
  • Figure 2: Overview of the proposed Compositional Reasoning-aware Text-guided (CORTEX) framework for change captioning, which combines the three modules. We introduce (1) Image-level change detector, which captures change cues between the two images; (2) RTE module, which extracts compositional reasoning sentence for each scene; and (3) ITDA module, which reinforces same-scene understanding for static alignment and identifies changes in dynamic alignment in cross-scene.
  • Figure 3: Overview of the ITDA module. (a) Static alignment matches each image with its corresponding compositional texts extracted by the RTE module. (b) Dynamic alignment matches each image with texts from the cross-scene to highlight the changes. $\bigoplus$ denotes concatenation.
  • Figure 4: Visualization examples in the CLEVR-Change dataset (Blue/red: correct/incorrect compositional reasoning cues).
  • Figure S.1: Qualitative results of the CLEVR-Change dataset. Correct and incorrect predictions are highlighted using blue and red, respectively.
  • ...and 9 more figures