LocCa: Visual Pretraining with Location-aware Captioners

Bo Wan; Michael Tschannen; Yongqin Xian; Filip Pavetic; Ibrahim Alabdulmohsin; Xiao Wang; André Susano Pinto; Andreas Steiner; Lucas Beyer; Xiaohua Zhai

LocCa: Visual Pretraining with Location-aware Captioners

Bo Wan, Michael Tschannen, Yongqin Xian, Filip Pavetic, Ibrahim Alabdulmohsin, Xiao Wang, André Susano Pinto, Andreas Steiner, Lucas Beyer, Xiaohua Zhai

TL;DR

This paper proposes a simple visual pretraining method with location-aware captioners (LocCa), which outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.

Abstract

Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.

LocCa: Visual Pretraining with Location-aware Captioners

TL;DR

Abstract

Paper Structure (46 sections, 6 figures, 11 tables)

This paper contains 46 sections, 6 figures, 11 tables.

Introduction
Related Works
Location-aware Captioner
Pretraining tasks
Model details
Architecture
Autogressive decoding
Parallel prediction
Objective
Discussion
Experiments
Experimental setup
Pretraining dataset
Baselines
Implementation details
...and 31 more sections

Figures (6)

Figure 1: Overview of LocCa. LocCa consists of a standard vision transformer and a transformer decoder. The vision transformer takes image pixel as input, produces visual tokens as cross attention input to the transformer decoder. The transformer decoder is trained to read out rich information from the visual tokens. We adopt the following three task for pretraining: Cap, AREF and GCAP.
Figure 2: Result on COCO detection with a limit of 25 output boxes. For reward tuned models we show both the results before (dark blue and orange) and after (light blue and orange) reinforce tuningpinto2023tuning.
Figure 3: Ablation studies on (a) impact of different pretrained image resolutions on string token; and (b) string vs special token of box coordinates with pretrained res 224. The results are the average Acc@0.5 of the val&test splits on RefCOCO/+.
Figure 4: Ablation studies on string vs special tokenization for box coordinates. The image resolution is 224.
Figure 5: Ablation studies on impact of different pretrained image resolutions on string token, we use string tokenization for box coordinates.
...and 1 more figures

LocCa: Visual Pretraining with Location-aware Captioners

TL;DR

Abstract

LocCa: Visual Pretraining with Location-aware Captioners

Authors

TL;DR

Abstract

Table of Contents

Figures (6)