Table of Contents
Fetching ...

VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, Sung Ju Hwang

TL;DR

VideoICL tackles the challenge of out-of-distribution video understanding without fine-tuning by introducing a similarity-based selection of demonstrations and a confidence-based iterative inference framework that expands effective context within token-length limits. The method uses a linear cosine-similarity score across text and video embeddings to pick top-k demonstrations and then feeds the model successive small bundles (m demonstrations) until a confidence threshold is met, quantified via minimum token probability. Empirical results across six datasets and four tasks show substantial improvements over zero-shot baselines and several baselines, including larger models and fine-tuned methods, with an average gain of 0.256 in absolute accuracy and up to 0.143 BLEU-4 in captioning. The work demonstrates that training-free video ICL can outperform larger models on OOD data, offering scalable, task-agnostic generalization for video understanding with practical inference efficiency considerations.

Abstract

Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications. Code will be released at https://github.com/KangsanKim07/VideoICL

VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

TL;DR

VideoICL tackles the challenge of out-of-distribution video understanding without fine-tuning by introducing a similarity-based selection of demonstrations and a confidence-based iterative inference framework that expands effective context within token-length limits. The method uses a linear cosine-similarity score across text and video embeddings to pick top-k demonstrations and then feeds the model successive small bundles (m demonstrations) until a confidence threshold is met, quantified via minimum token probability. Empirical results across six datasets and four tasks show substantial improvements over zero-shot baselines and several baselines, including larger models and fine-tuned methods, with an average gain of 0.256 in absolute accuracy and up to 0.143 BLEU-4 in captioning. The work demonstrates that training-free video ICL can outperform larger models on OOD data, offering scalable, task-agnostic generalization for video understanding with practical inference efficiency considerations.

Abstract

Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications. Code will be released at https://github.com/KangsanKim07/VideoICL

Paper Structure

This paper contains 37 sections, 2 theorems, 8 equations, 10 figures, 5 tables, 1 algorithm.

Key Result

Proposition 1

Let $a(n)$ be the expected accuracy of VideoICL with a maximum of $n$ confidence-based iterations. Then, where $\mathrm{TPR}$ and $\mathrm{FPR}$ stand for the true positive rate (i.e., recall) and the false positive rate of the confidence estimation method, respectively.

Figures (10)

  • Figure 1: Motivation.Top left: Video LMMs perform poorly in out-of-distribution videos, such as crime videos. Bottom left: In-Context Learning (ICL), which is usually employed to solve this problem, is infeasible for video tasks, since the in-context demonstrations are too long. Right:VideoICL alleviates this problem by selecting the most relevant demonstrations (e.g., 2-shot) by similarity-based example selection, and iteratively performing inference with different sets of demonstrations at each step (confidence-based iterative inference).
  • Figure 2: Our Methodology. Given a test query $Q_{test}$ consisting of a video and some text, each are embedded into a vector. Similarity-based Example Selection: Based on the cosine similarity between the query vector and the embeddings in the database of pre-encoded examples, we retrieve top-$k$ most similar examples. This stage takes negligible time cost since it only generates features from test samples and calculates the similarities with pre-encoded features. Confidence-Based Iterative Inference: Starting from the top of the list, each set of $m$ examples are used as in-context examples for the query $Q_{test}$, until the confidence for the generated answer exceeds the threshold.
  • Figure 3: Qualitative Results. We show three real hand-picked test samples from the main benchmarks. The first and third examples are from the UCF-Crime sultani_real-world_2019 (video classification) task, and the second one is from the Sports-QA li_sports-qa_2024 (open-ended QA) task. The leftmost column shows the given question, which the vanilla (non-ICL) model makes an incorrect prediction. The second and third columns show the first and second confidence-based iterations, with the selected in-context demonstrations at each iteration.
  • Figure 4: Most confident examples. The numbers on each bar represent the number of test samples where the corresponding iteration ended up having the highest confidence score. The x-axis represents the proportion of each iteration.
  • Figure 5: Qualitative result on the Animal Kingdom dataset.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 1: Asymptotic model accuracy
  • Proposition : Asymptotic Model Accuracy
  • proof