Leveraging Textual Compositional Reasoning for Robust Change Captioning
Kyu Ri Park, Jiyoung Park, Seong Tae Kim, Hong Joo Lee, Jung Uk Kim
TL;DR
The paper addresses the limitation of visual-only change captioning by introducing CORTEX, a plug-and-play framework that injects explicit textual compositional reasoning into change detection. It deploys a Reasoning-aware Text Extraction module to generate relational descriptions via a Vision-Language Model and an Image-Text Dual Alignment module to fuse static intra-scene and dynamic cross-scene cues with visual features. Through the alignment of text and image representations and the combination with image-level change cues, CORTEX achieves consistent improvements on CLEVR-Change, CLEVR-DC, and Spot-the-Diff datasets, demonstrating improved fine-grained relational understanding and robustness to viewpoint changes. The work highlights the value of textual reasoning signals in enhancing change captioning, offering a practical, modular enhancement for existing visual detectors.
Abstract
Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.
