Table of Contents
Fetching ...

LIVE: Learnable In-Context Vector for Visual Question Answering

Yingzhe Peng, Chenduo Hao, Xu Yang, Jiawei Peng, Xinting Hu, Xin Geng

TL;DR

This study proposes Learnable In-Context VEctor (LIVE) to distill essential task information from demonstrations, improving ICL performance in LMMs and shows that LIVE can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods.

Abstract

As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it into the LLM to help solve the corresponding task. However, although useful in simple NLP tasks, these non-learnable methods fail to handle complex multimodal tasks like Visual Question Answering (VQA). In this study, we propose Learnable In-Context VEctor (LIVE) to distill essential task information from demonstrations, improving ICL performance in LMMs. Experiments show that LIVE can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods. The code is available at \url{https://github.com/ForJadeForest/LIVE-Learnable-In-Context-Vector}.

LIVE: Learnable In-Context Vector for Visual Question Answering

TL;DR

This study proposes Learnable In-Context VEctor (LIVE) to distill essential task information from demonstrations, improving ICL performance in LMMs and shows that LIVE can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods.

Abstract

As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it into the LLM to help solve the corresponding task. However, although useful in simple NLP tasks, these non-learnable methods fail to handle complex multimodal tasks like Visual Question Answering (VQA). In this study, we propose Learnable In-Context VEctor (LIVE) to distill essential task information from demonstrations, improving ICL performance in LMMs. Experiments show that LIVE can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods. The code is available at \url{https://github.com/ForJadeForest/LIVE-Learnable-In-Context-Vector}.
Paper Structure (31 sections, 8 equations, 7 figures, 23 tables)

This paper contains 31 sections, 8 equations, 7 figures, 23 tables.

Figures (7)

  • Figure 1: (a) Conventional ICL is more sensitive to the ICD selection and requires more inference time. (b) LIVE is more robust and reduces inference time by inputting a shift vector.
  • Figure 2: The LIVE training pipeline: (a) The distribution $\mathcal{P}(\bm{\hat{x}} | \bm{V}, \bm{\alpha}; \mathcal{M})$ of LMMs output when using LIVE. (b) Adding LIVE into the representations of the query to simulate the shift effect brought by demonstrations. (c) The distribution $\mathcal{P}(\bm{\hat{x}} | \bm{X}_D; \mathcal{M})$ of LMMs output when using demonstrations.
  • Figure 3: The total number of FLOPs and real inference time consumption of ICL, Zero-Shot, LIVE for 1000 query samples.
  • Figure 4: Accuracy (%) of LIVE and LoRA with different size of training set.
  • Figure 5: T-SNE visualization of first answer token representations over 200 queries.
  • ...and 2 more figures