Table of Contents
Fetching ...

HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

Toan Nguyen, Yang Liu, Celso De Melo, Flora D. Salim

TL;DR

HyperTokens is introduced, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed, and meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks are proposed.

Abstract

Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.

HyperTokens: Controlling Token Dynamics for Continual Video-Language Understanding

TL;DR

HyperTokens is introduced, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed, and meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks are proposed.

Abstract

Continual VideoQA with multimodal LLMs is hindered by interference between tasks and the prohibitive cost of storing task-specific prompts. We introduce HyperTokens, a transformer-based token generator that produces fine-tuning tokens on demand, giving explicit control over prompt updates while keeping memory fixed. To suppress forgetting, we propose meta-inspired regularisers that look ahead to avoid task-specific sharp directions and anchor the evolving generator to prior tasks. We further connect our objective to sharpness-aware optimisation, providing insight into why it encourages flatter cross-task minima and improves retention. Beyond regularisation, HyperTokens exploits lightweight auxiliary multimodal supervision through shared generation weights; guided by a causal perspective, we design feasible objectives and surrogate mutual-information losses to regularise anti-causal cross-modal directions. Across two standard continual VideoQA benchmarks, HyperTokens achieves higher average accuracy with substantially lower forgetting. Finally, we introduce a challenging cross-modal ImageQA->VideoQA protocol and show that HyperTokens enables robust continual transfer in this setting.
Paper Structure (73 sections, 1 theorem, 40 equations, 7 figures, 8 tables)

This paper contains 73 sections, 1 theorem, 40 equations, 7 figures, 8 tables.

Key Result

Theorem 4.1

Let $J_\phi(z) \coloneqq \partial H_\phi(z) / \partial \phi$ denote the Jacobian of $H_\phi$ w.r.t. $\phi$. For the current task $t$, define the gradient $g(\phi) \coloneqq \nabla_\phi \mathcal{L}_{\text{NLL}}^t(\phi)$ and the lookahead displacement Define the LA-Reg objective over previous tasks $\tau < t$ with stored codes $z^\tau$ and reference parameters $\phi^\star$ by (which matches Eq. eq

Figures (7)

  • Figure 1: HyperTokens overview. (Left) Continual adaptation with HyperTokens for VideoQA and cross-modal transfer VisualQA$\rightarrow$VideoQA. A fixed-size generator synthesises task-specific fine-tuning tokens. (Middle) Task-code learning via a multimodal contrastive objective with a prototype bank. (Right) Transformer-based token-generator architecture.
  • Figure 2: Geometry of LA-Reg in optimisation space. LA-Reg steers optimisation into the shared low-loss region (green)—a flatter minima basin across tasks—by balancing progress along the task-$t$ direction and alignment with the task-$(t\!-\!1)$ anchor direction. Note that we regularise in the output-prompt space (not parameter space).
  • Figure 3: HyperTokens consistently surpasses Bisecle across tasks with higher average accuracy and lower average forgetting. We exclude TP and CW since forgetting happens only from DC onwards.
  • Figure 4: Token analysis.Left: Mean image--video token similarity across adapter layers after Visual7W$\rightarrow$NExT-QA continual training. Middle/Right: t-SNE visualisations of token representations at representative mid and late layers after NExT-QA continual VideoQA training.
  • Figure 5: Qualitative Examples 1--2. HyperTokens predicts the correct answer (green), whereas Bisecle produces an incorrect one (red) in continual VideoQA.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 4.1: LA-Reg as task-wise sharpness-aware regularisation
  • proof