VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, Sung Ju Hwang
TL;DR
VideoICL tackles the challenge of out-of-distribution video understanding without fine-tuning by introducing a similarity-based selection of demonstrations and a confidence-based iterative inference framework that expands effective context within token-length limits. The method uses a linear cosine-similarity score across text and video embeddings to pick top-k demonstrations and then feeds the model successive small bundles (m demonstrations) until a confidence threshold is met, quantified via minimum token probability. Empirical results across six datasets and four tasks show substantial improvements over zero-shot baselines and several baselines, including larger models and fine-tuned methods, with an average gain of 0.256 in absolute accuracy and up to 0.143 BLEU-4 in captioning. The work demonstrates that training-free video ICL can outperform larger models on OOD data, offering scalable, task-agnostic generalization for video understanding with practical inference efficiency considerations.
Abstract
Recent advancements in video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities. However, their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data. Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs. While In-context learning (ICL) with demonstration examples has shown promising generalization performance in language tasks and image-language tasks without fine-tuning, applying ICL to video-language tasks faces challenges due to the limited context length in Video LMMs, as videos require longer token lengths. To address these issues, we propose VideoICL, a novel video in-context learning framework for OOD tasks that introduces a similarity-based relevant example selection strategy and a confidence-based iterative inference approach. This allows to select the most relevant examples and rank them based on similarity, to be used for inference. If the generated response has low confidence, our framework selects new examples and performs inference again, iteratively refining the results until a high-confidence response is obtained. This approach improves OOD video understanding performance by extending effective context length without incurring high costs. The experimental results on multiple benchmarks demonstrate significant performance gains, especially in domain-specific scenarios, laying the groundwork for broader video comprehension applications. Code will be released at https://github.com/KangsanKim07/VideoICL
