Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi
TL;DR
The paper addresses image captioning by removing the fixed-length input bottleneck through an Expansion mechanism that distributes input content into a longer sequence during forward processing and then reconstitutes the original length in the backward pass. It introduces ExpansionNet v2, a Swin-Transformer–backboned encoder–decoder architecture employing Static Expansion in the encoder and Dynamic Expansion in the decoder, achieving strong MS-COCO 2014 results and competitive nocaps performance while enabling faster End-to-End training. The work provides detailed ablations showing dynamic expansion as the primary performance driver and demonstrates favorable training/inference costs compared to non-generative methods, with clear gaps to large-scale pretraining-based models. Overall, the Expansion mechanism offers a principled way to enrich sequence representations in image captioning and can be integrated with other approaches to further boost performance while maintaining efficiency.
Abstract
We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches. To support this claim, we design a novel architecture ExpansionNet v2 that achieved strong results on the MS COCO 2014 Image Captioning challenge and the State of the Art in its respective category, with a score of 143.7 CIDErD in the offline test split, 140.8 CIDErD in the online evaluation server and 72.9 AllCIDEr on the nocaps validation set. Additionally, we introduce an End to End training algorithm up to 2.8 times faster than established alternatives. Source code available at: https://github.com/jchenghu/ExpansionNet_v2
