Context-aware Difference Distilling for Multi-change Captioning

Yunbin Tu; Liang Li; Li Su; Zheng-Jun Zha; Chenggang Yan; Qingming Huang

Context-aware Difference Distilling for Multi-change Captioning

Yunbin Tu, Liang Li, Li Su, Zheng-Jun Zha, Chenggang Yan, Qingming Huang

TL;DR

This work tackles multi-change captioning, where multiple simultaneous changes and viewpoint variations make robust description challenging. It introduces Context-Aware Difference Distilling (CARD), which decouples common context features and difference context features from image pairs, applies a consistency constraint to align common contexts, and an HSIC-based constraint to enforce independence of difference contexts. The locally common features are mined under common-context guidance, then locally-differing features are augmented by difference contexts to form an omni-representation, which a transformer decoder translates into captions. Across three public datasets, CARD achieves state-of-the-art results, demonstrating strong generalization to unaligned pairs and challenging surveillance/remote-sensing scenarios, with code released for replication.

Abstract

Multi-change captioning aims to describe complex and coupled changes within an image pair in natural language. Compared with single-change captioning, this task requires the model to have higher-level cognition ability to reason an arbitrary number of changes. In this paper, we propose a novel context-aware difference distilling (CARD) network to capture all genuine changes for yielding sentences. Given an image pair, CARD first decouples context features that aggregate all similar/dissimilar semantics, termed common/difference context features. Then, the consistency and independence constraints are designed to guarantee the alignment/discrepancy of common/difference context features. Further, the common context features guide the model to mine locally unchanged features, which are subtracted from the pair to distill locally difference features. Next, the difference context features augment the locally difference features to ensure that all changes are distilled. In this way, we obtain an omni-representation of all changes, which is translated into linguistic sentences by a transformer decoder. Extensive experiments on three public datasets show CARD performs favourably against state-of-the-art methods.The code is available at https://github.com/tuyunbin/CARD.

Context-aware Difference Distilling for Multi-change Captioning

TL;DR

Abstract

Paper Structure (26 sections, 14 equations, 9 figures, 11 tables)

This paper contains 26 sections, 14 equations, 9 figures, 11 tables.

Introduction
Related Work
Methodology
Image Pair Encoding
Context-Aware Difference Distilling
Context Feature Decoupling
Difference Distilling
Caption Generation
Joint Training
Experiments
Datasets
Evaluation Metrics
Implementation Details
Performance Comparison
Ablation Study and Analysis
...and 11 more sections

Figures (9)

Figure 1: Three examples about multi-change captioning. (a) includes certain object changes; (b) consists of object and background changes; (c) shows both object changes and irrelevant viewpoint change. These changes are shown in colored boxes.
Figure 2: The overall architecture of our method, which consists of (a) Image Pair Encoding (Sec. \ref{['image pair']}), (b) Context-Aware DiffeRence Distilling (CARD) (Sec. \ref{['clip']}), and (c) Caption Generation (Sec. \ref{['caption generation']}). Herein, CARD is the major component to learn the robust difference features by context features decoupling and context-aware difference distilling. $S^*$ stands for ground-truth sentences.
Figure 3: Visualization of context features on CLEVR-Multi-Change and LEVIR-CC. The red and green colors indicate common context features in "before" and "after" images, while blue and purple colors denote difference context features in "before" and "after" images.
Figure 4: Qualitative examples on the three datasets. For each example, we visualize the captions generated by the SOTA method MCCFormers-D qiu2021describing and our CARD, as well as the change localization of CARD. The successful cases of CARD are shown in the green box, while the sub-optimal case is shown in the red box.
Figure 5: Visualization of alignment of common objects (shown in yellow boxes) on the three datasets, where the results are obtained by MCCFormers-D and our CARD.
...and 4 more figures

Context-aware Difference Distilling for Multi-change Captioning

TL;DR

Abstract

Context-aware Difference Distilling for Multi-change Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (9)