Table of Contents
Fetching ...

Affordance-First Decomposition for Continual Learning in Video-Language Understanding

Mengzhu Xu, Hanzhi Liu, Ningkang Peng, Qianyu Chen, Canran Xiao

TL;DR

This work addresses continual video–language learning by separating a slowly varying, interaction-centered affordance substrate from a plastic, query-driven LLM scheduler. It introduces Affordance-First Decomposition (AFD), with a stable affordance head generating tokens and prototypes, and a conflict-aware per-layer router that grows capacity only when needed via rank-expansion. Stability is enforced through weak alignment and teacher consistency on the affordance head, while question-only replay distills knowledge to the LLM scheduler, enabling rehearsal-free adaptation. Empirically, AFD sets new state-of-the-art performance across domain-time incremental VideoQA and ViLCo benchmarks with minimal forgetting, and extensive ablations/analyzes confirm the benefits of the explicit stability/plasticity split and targeted adaptation.

Abstract

Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.

Affordance-First Decomposition for Continual Learning in Video-Language Understanding

TL;DR

This work addresses continual video–language learning by separating a slowly varying, interaction-centered affordance substrate from a plastic, query-driven LLM scheduler. It introduces Affordance-First Decomposition (AFD), with a stable affordance head generating tokens and prototypes, and a conflict-aware per-layer router that grows capacity only when needed via rank-expansion. Stability is enforced through weak alignment and teacher consistency on the affordance head, while question-only replay distills knowledge to the LLM scheduler, enabling rehearsal-free adaptation. Empirically, AFD sets new state-of-the-art performance across domain-time incremental VideoQA and ViLCo benchmarks with minimal forgetting, and extensive ablations/analyzes confirm the benefits of the explicit stability/plasticity split and targeted adaptation.

Abstract

Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8% forgetting on domain-incremental VideoQA, ViLCo R@1@0.5 of 29.6% (MQ) and 20.7% (NLQ) with 18.4% stAP@0.25 (VQ), and 39.5% accuracy with -1.6% forgetting on time-incremental iVQA. Overall, AFD offers an explicit, interpretable split between a stable interaction-centered substrate and targeted adaptation.

Paper Structure

This paper contains 52 sections, 3 theorems, 48 equations, 12 figures, 4 tables.

Key Result

Lemma 1

Under (A1)--(A4), for any $k>i$,

Figures (12)

  • Figure 1: Under a stream of video–language reasoning tasks, existing methods rely on prompt/adapter add-ons with static routing and post-hoc stabilization, leaving the stability–plasticity trade-off implicit. AFD instead anchors evidence in a slowly varying affordance substrate and applies query-routed, conflict-triggered adapter updates, explicitly separating stability from plasticity.
  • Figure 2: Overview of the proposed Affordance-First Decomposition (AFD) framework for continual video–language question answering. A stream of video–language tasks arrives over time, each video is encoded and mapped by a shared affordance head into slowly varying affordance tokens and prototypes, while questions are embedded and stored for replay to route per-layer LoRA adapters in the LLM-backbone scheduler. Stability loss $\mathcal{L}_{\text{aff}}$ acts only on the affordance head, whereas task and replay losses $(\mathcal{L}_{\text{task}}+\mathcal{L}_{\text{replay}})$ act only on the routed adapters, explicitly separating a stable affordance substrate from a plastic reasoning module.
  • Figure 3: Affordance stability and coverage. (a) Soft Top-$L$ mixtures increase verb/action coverage without requiring more scheduler capacity. (b) Drift distributions concentrate near zero with small spread across tasks, consistent with a slowly varying shared space.
  • Figure 4: Case study.
  • Figure 5: Sensitivity on the five key hyperparameters. Defaults marked with $^{\star}$. AFD remains stable across broad ranges; extremes show mild degradation.
  • ...and 7 more figures

Theorems & Definitions (6)

  • Lemma 1: Single-step bound
  • proof
  • Theorem 1: Task-wise forgetting
  • proof
  • Proposition 1: Scheduler regret
  • proof