Top-Down Framework for Weakly-supervised Grounded Image Captioning

Chen Cai; Suchen Wang; Kim-hui Yap; Yi Wang

Top-Down Framework for Weakly-supervised Grounded Image Captioning

Chen Cai, Suchen Wang, Kim-hui Yap, Yi Wang

TL;DR

The paper tackles weakly-supervised grounded image captioning by eliminating dependency on region proposals and object detectors. It introduces a one-stage top-down Vision Transformer encoder augmented with a relation semantic [REL] token and a Recurrent Grounding Module that generates Visual Language Attention Maps (VLAMs) for grounding during caption generation. The model jointly optimizes multi-label relation classification and captioning with an L_MLC + L_XE objective, achieving state-of-the-art grounding on Flick30k-Entities and competitive results on MSCOCO. This detector-free approach enables efficient, interpretable grounding and captioning by leveraging global image representations and relation context, with potential for stronger backbones in future work.

Abstract

Weakly-supervised grounded image captioning (WSGIC) aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. Recent two-stage solutions mostly apply a bottom-up pipeline: (1) encode the input image into multiple region features using an object detector; (2) leverage region features for captioning and grounding. However, utilizing independent proposals produced by object detectors tends to make the subsequent grounded captioner overfitted in finding the correct object words, overlooking the relation between objects, and selecting incompatible proposal regions for grounding. To address these issues, we propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level. Specifically, we encode the image into visual token representations and propose a Recurrent Grounding Module (RGM) in the decoder to obtain precise Visual Language Attention Maps (VLAMs), which recognize the spatial locations of the objects. In addition, we explicitly inject a relation module into our one-stage framework to encourage relation understanding through multi-label classification. This relation semantics served as contextual information facilitating the prediction of relation and object words in the caption. We observe that the relation semantic not only assists the grounded captioner in generating a more accurate caption but also improves the grounding performance. We validate the effectiveness of our proposed method on two challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The experimental results demonstrate that our method achieves state-of-the-art grounding performance.

Top-Down Framework for Weakly-supervised Grounded Image Captioning

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 10 figures, 9 tables)

This paper contains 22 sections, 6 equations, 10 figures, 9 tables.

Introduction
Related Works
Grounded image captioning
Visual grounding
Weakly supervised object localization
Methodology
Top-down Image Encoder
Visual relation semantic modeling
Selection of relation classes
Grounded Language Decoder
Recurrent grounding module
Object bounding box generation
Grounding enhanced language module
Training and Objectives
Experiment
...and 7 more sections

Figures (10)

Figure 1: (a) Two-stage pipeline: use object-focused region (bottom-up) features and soft-attention BUTD for captioning and grounding. (b) One-stage framework (ours): use raw RGB image as input for captioning and grounding. Instead of selecting salient region features, the one-stage method allows us to calculate similarity metrics between words and the entire image for grounding. Besides, we explicitly model the relation semantics (e.g., "looking though") and utilize them as contextual information to assist in predicting the desired groundable (e.g., "telescope") words in the caption.
Figure 2: Overview of the proposed weakly supervised grounded image captioning model. It consists of a top-down image encoder and a grounded language decoder. The image encoder uses the Vision Transformer backbone to encode the RGB raw image into [CLS] and patches token representations. A new [REL] token is concatenated with frozen patch representations and trained to model the relation semantic information. The decoder utilizes these visual representations as input to generate the caption $\mathbf{Y}$, and computes the Visual-Language Attention Maps (VLAMs) $\mathbf{M}$ for localization based on dot product similarity between visual $\mathbf{V}$ and word $\mathbf{u}_t$ representations. The top-left part of the figure shows the generated caption, and the grounded object regions computed using the VLAMs (e.g., $\mathbf{m}_{jacket}$).
Figure 3: The illustration of the Recurrent Grounding Module (RGM) and grounding process. We enhance the similarity attention metrics $\mathbf{s}_{people}^*$ ($\mathbf{s}_{t}^*$) for the current time step recurrently by conditioning on the $\mathbf{s}_{of}^*$ ($\mathbf{s}_{t-1}^*$) from the previous time step. This enables us to compute more precise $\mathbf{m}_{people}$ for grounding (shown in Figure \ref{['fig:fig9']}). On the right of the figure, we show the $\mathbf{s}^*$ that is computed concurrently with each generated word. During the testing stage, we reshape & up-sample the $\mathbf{s}^*$ to $\mathbf{m}$, and localize the groundable objects words using $\mathbf{m}_{people}^*$ and $\mathbf{m}_{beach}^*$.
Figure 4: The statistics of 72 most frequently appearing relation words in the Flickr30K-Entities dataset.
Figure 5: The statistics of 62 most frequently appearing relation words in the MSCOCO captioning dataset.
...and 5 more figures

Top-Down Framework for Weakly-supervised Grounded Image Captioning

TL;DR

Abstract

Top-Down Framework for Weakly-supervised Grounded Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)