Table of Contents
Fetching ...

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Kunyu Shi, Qi Dong, Luis Goncalves, Zhuowen Tu, Stefano Soatto

TL;DR

The paper addresses the latency of autoregressive vision-language models by introducing NARVL, a non-autoregressive all-in-one model with a parallel decoder based on Learnable Query Tokens and a novel Query-CTC loss. By marginalizing over decoder paths, NARVL models the joint distribution of target tokens and enables parallel generation, achieving accuracy on par with state-of-the-art AR models while delivering substantial inference speedups across VQA, visual grounding, visual entailment, and image captioning. Key contributions include the design of a fixed-length query-token decoder, the Q-CTC loss formulation, and knowledge distillation strategies to bolster long-sequence outputs. The approach demonstrates that joint, parallel decoding is viable for heterogeneous vision-language tasks and offers practical speed advantages for real-world applications.

Abstract

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.

Non-autoregressive Sequence-to-Sequence Vision-Language Models

TL;DR

The paper addresses the latency of autoregressive vision-language models by introducing NARVL, a non-autoregressive all-in-one model with a parallel decoder based on Learnable Query Tokens and a novel Query-CTC loss. By marginalizing over decoder paths, NARVL models the joint distribution of target tokens and enables parallel generation, achieving accuracy on par with state-of-the-art AR models while delivering substantial inference speedups across VQA, visual grounding, visual entailment, and image captioning. Key contributions include the design of a fixed-length query-token decoder, the Q-CTC loss formulation, and knowledge distillation strategies to bolster long-sequence outputs. The approach demonstrates that joint, parallel decoding is viable for heterogeneous vision-language tasks and offers practical speed advantages for real-world applications.

Abstract

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.
Paper Structure (15 sections, 1 equation, 5 figures, 11 tables)

This paper contains 15 sections, 1 equation, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Comparison of inference speed and performance between NARVL (non-autoregressive) and its autoregressive counterpart on four vision-language tasks: Visual entailment (VE), Visual grounding (VE), Visual Question Answering (VQA), and Image captioning (IC). From (a), we see that NARVL speeds up the inference of AR by a factor between 1.4 and 12.7, while achieving on-par performance.
  • Figure 2: The overview of NARVL. NARVL borrows the encoder from OFA wang2022ofa, where the embedding sequence of input text and image CNN (ResNet) feature are concatenated in the input token sequence. Unlike the standard transformer decoder that generates outputs sequentially, conditioning on the generated sequence, our non-autoregressive decoder takes a sequence of tokens that are learnable weights, and generates outputs for all tokens in parallel. As the output sequence length is unknown, we set the number of of learnable query tokens to a value (hyperparameter) larger than the largest target sequence length. The loss used, Q-CTC, is described in Eq. \ref{['eq:LQC']}.
  • Figure 3: Comparison of various design of Transformer decoder. (a) Standard Auto-regressive Transformer decoder; (b) The existing non-autoregressive Transformer decoders for audio and language tasks; (c) The proposed non-autoregressive Transformer decoder in NARVL. During training, an AR decoder (a) uses teacher forcing with causal masks, where tokens can only attend to previous tokens, while all tokens can attend to each other in the decoders of (b) and (c). NARVL decoder has dedicated query tokens inputs, instead of using the outputs of the encoder as inputs in (b). This design avoids the large latency of the decoder due to the long output sequence from the encoder.
  • Figure 4: We test the proposed NARVL on various vison-language tasks, including Visual Question Answering (VQA), Visual grounding (VG), Image Captioning (IC) and Visual Entailment (VE). The inputs and outputs of each tasks are illustrated here, and all types outputs are unified within the sequence formulation.
  • Figure 5: The comparison of accuracy and inference speed with NAR (non-autoregressive) and AR (autoregressive) models for varying model sizes: Tiny, Base and Huge. Speed is measured in wall clock time. NAR models significantly outperform their AR counterparts under the same inference time budget on the RefCOCO validation set.