Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto
TL;DR
The paper addresses the latency of autoregressive vision-language models by introducing NARVL, a non-autoregressive all-in-one model with a parallel decoder based on Learnable Query Tokens and a novel Query-CTC loss. By marginalizing over decoder paths, NARVL models the joint distribution of target tokens and enables parallel generation, achieving accuracy on par with state-of-the-art AR models while delivering substantial inference speedups across VQA, visual grounding, visual entailment, and image captioning. Key contributions include the design of a fixed-length query-token decoder, the Q-CTC loss formulation, and knowledge distillation strategies to bolster long-sequence outputs. The approach demonstrates that joint, parallel decoding is viable for heterogeneous vision-language tasks and offers practical speed advantages for real-world applications.
Abstract
Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.
