Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Jia Cheng Hu; Roberto Cavicchioli; Alessandro Capotondi

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi

TL;DR

The paper addresses image captioning by removing the fixed-length input bottleneck through an Expansion mechanism that distributes input content into a longer sequence during forward processing and then reconstitutes the original length in the backward pass. It introduces ExpansionNet v2, a Swin-Transformer–backboned encoder–decoder architecture employing Static Expansion in the encoder and Dynamic Expansion in the decoder, achieving strong MS-COCO 2014 results and competitive nocaps performance while enabling faster End-to-End training. The work provides detailed ablations showing dynamic expansion as the primary performance driver and demonstrates favorable training/inference costs compared to non-generative methods, with clear gaps to large-scale pretraining-based models. Overall, the Expansion mechanism offers a principled way to enrich sequence representations in image captioning and can be integrated with other approaches to further boost performance while maintaining efficiency.

Abstract

We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches. To support this claim, we design a novel architecture ExpansionNet v2 that achieved strong results on the MS COCO 2014 Image Captioning challenge and the State of the Art in its respective category, with a score of 143.7 CIDErD in the offline test split, 140.8 CIDErD in the online evaluation server and 72.9 AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End training algorithm up to 2.8 times faster than established alternatives. Source code available at: https://github.com/jchenghu/ExpansionNet_v2

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

TL;DR

Abstract

Paper Structure (23 sections, 16 equations, 4 figures, 9 tables)

This paper contains 23 sections, 16 equations, 4 figures, 9 tables.

Introduction
Related Works
Method
Static and Dynamic Expansion
Expansion coefficient
Forward Expansion
Backward expansion
Block Static Expansion
Architecture
Training objectives
Results
Experimental Setup
Dataset
Model details
Training algorithm
...and 8 more sections

Figures (4)

Figure 1: The expansion mechanism distributes the input data into another one featuring a different sequence length during the forward phase and performs the reverse operation in the backward pass. In this way, the network is enabled to process the sequence unconstrained by the number of elements.
Figure 2: Static Expansion and Auto-regressive Dynamic Expansion scheme and example. Assuming an input length of $L=3$. In the Static Expansion setting, an expansion coefficient of $N_{E}=5$ leads to an expanded sequence of length $5$. In contrast, in the Dynamic Expansion, an expansion coefficient of $N_{E}=3$ generates an expanded sequence of $L \cdot N_{E} = 9$. For the sake of simplicity, the double operation stream, the expansion biases and the gated result combination are omitted in the illustration. The difference between the Auto-regressive Dynamic Expansion and the bidirectional one lies in the Masked Matrix Multiplication.
Figure 3: ExpansionNet v2 architecture.
Figure 4: Attention visualization of a single decoder head in ExpansionNet v2.

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

TL;DR

Abstract

Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)