Table of Contents
Fetching ...

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao

TL;DR

This work proposes CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces that improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones.

Abstract

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

TL;DR

This work proposes CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces that improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones.

Abstract

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update () from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.
Paper Structure (43 sections, 12 equations, 8 figures, 11 tables)

This paper contains 43 sections, 12 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Given a context clip and instruction, standard retrieval often returns semantically relevant but temporally incoherent clips, yielding State Errors or Identity Errors. In contrast, CAST models the state transition ($\Delta$) to retrieve a causally plausible continuation and rerank generation candidates toward more coherent continuations.
  • Figure 2: Illustration of our CVR benchmark protocol. In contrast to standard global retrieval, our benchmark introduces State Negatives (temporally misaligned clips from the same video) and Identity Negatives (appearance-misaligned clips from different videos) to diagnose consistency failures beyond semantic matching.
  • Figure 3: Overview of the CAST adapter. CAST operates as a lightweight adapter that aggregates visual history $\mathcal{H}_t$, anchor state $v_{t-1}$, and instruction $q_t$. Through a dual-path transition predictor, it estimates a residual update $\Delta$ that encourages causally consistent state evolution while retaining identity cues through the residual connection.
  • Figure 4: Qualitative retrieval examples on the CVR benchmark. Given the same procedural context and instruction, context-agnostic retrieval often returns semantically relevant but temporally inconsistent clips, producing either a State Error or an Identity Error. By contrast, CAST retrieves the correct continuation in both cases by modeling state transitions conditioned on visual history.
  • Figure 5: Retrieval quality breakdown. We categorize top-1 retrieval outcomes as Exact Match (green), Identity Consistent but State-misaligned (blue), or Identity Inconsistent (gray). CAST yields identity-consistent outcomes in 81.0% of queries (green+blue), substantially improving over the text-only baseline. This differs from the Ident. Acc. metric in Table \ref{['tab:ablation_inf']}, which requires the ground-truth clip to rank above all identity negatives.
  • ...and 3 more figures