Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong; Shulin Tian; Shuai Liu; Shuangrui Ding; Yuhang Zang; Xiaoyi Dong; Yuhang Cao; Jiaqi Wang; Ziwei Liu

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Yuhao Dong, Shulin Tian, Shuai Liu, Shuangrui Ding, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Ziwei Liu

TL;DR

The paper tackles the gap in video understanding benchmarks that test static knowledge by introducing Demo-driven Video In-Context Learning (Demo-ICL), where models learn procedural skills from in-context demonstrations and apply them to new target videos. It proposes Demo-ICL-Bench, a 1200-question benchmark built from instructional HowTo100M videos with text and video demonstrations and a demonstration-selection setting, and a two-stage training pipeline for Demo-ICL consisting of video supervised fine-tuning followed by information-assisted Direct Preference Optimization. The approach is evaluated against both proprietary and open-source MLLMs, showing that current models struggle with demo-driven video ICL and that the proposed training strategy yields improved performance on both demo-driven and general video understanding tasks. The work demonstrates a scalable path toward human-like learning from demonstrations in video understanding, with potential implications for robotics and real-world task adaptation.

Abstract

Despite the growing video understanding capabilities of recent Multimodal Large Language Models (MLLMs), existing video benchmarks primarily assess understanding based on models' static, internal knowledge, rather than their ability to learn and adapt from dynamic, novel contexts from few examples. To bridge this gap, we present Demo-driven Video In-Context Learning, a novel task focused on learning from in-context demonstrations to answer questions about the target videos. Alongside this, we propose Demo-ICL-Bench, a challenging benchmark designed to evaluate demo-driven video in-context learning capabilities. Demo-ICL-Bench is constructed from 1200 instructional YouTube videos with associated questions, from which two types of demonstrations are derived: (i) summarizing video subtitles for text demonstration; and (ii) corresponding instructional videos as video demonstrations. To effectively tackle this new challenge, we develop Demo-ICL, an MLLM with a two-stage training strategy: video-supervised fine-tuning and information-assisted direct preference optimization, jointly enhancing the model's ability to learn from in-context examples. Extensive experiments with state-of-the-art MLLMs confirm the difficulty of Demo-ICL-Bench, demonstrate the effectiveness of Demo-ICL, and thereby unveil future research directions.

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

TL;DR

Abstract

Paper Structure (33 sections, 2 equations, 4 figures, 8 tables)

This paper contains 33 sections, 2 equations, 4 figures, 8 tables.

Introduction
Related Work
Demo-ICL: Procedural Knowledge Learning from In-Context Demonstrations
Demo-driven Video In-Context Learning
Dataset Construction
Learning from In-Context Demonstrations
Video Supervised Fine-tuning
Information-Assisted Preference Optimization
Experiments
Demo-ICL-Bench
General Video Understanding
Analysis Experiments
Conclusion
Implementation Details
Experiment Details
...and 18 more sections

Figures (4)

Figure 1: Overview of the Demo-driven Video In-Context Learning Task with three distinct settings: (1) Text-demo in-context learning, where text instructions act as the demonstrations; (2) Video-demo in-context learning, where a video demonstration acts as the reference; and (3) Demonstration Selection, which requires identifying the most relevant video demonstrations among the video candidate pool and using them to guide video in-context learning.
Figure 2: Overview of Data Construction and Training Strategy. (i) illustrates our coarse-to-fine dataset collection pipeline (\ref{['sec:construction']}); (ii) presents the tailored two-stage training strategy for training the Demo-ICL model (\ref{['sec:training']}).
Figure 3: Visualization of Text-demo In-Context Learning. This figure provides 2 examples to illustrate the text-demo in-context learning task, where the text instructions will be provided along with the target video as the inputs.
Figure 4: Visualization of Video-demo In-Context Learning. This figure provides 2 examples to illustrate the video-demo in-context learning task, where a video demonstration will be provided together with the target video input.

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

TL;DR

Abstract

Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)