Table of Contents
Fetching ...

Dynamic Reflections: Probing Video Representations with Text Alignment

Tyler Zhu, Tengda Han, Leonidas Guibas, Viorica Pătrăucean, Maks Ovsjanikov

TL;DR

This work extends the Platonic Representation Hypothesis to the temporal domain by systematically probing video-text alignment across 121 encoders. It introduces a test-time framework that scales visual context (frames) and textual context (captions) and measures cross-modal similarity via mutual $k$-NN, optimizing over encoder layers. The authors demonstrate substantial gains in alignment with richer test-time data and provide a saturation-based scaling law that accurately predicts these gains, with VideoMAEv2 often delivering the strongest alignment. Crucially, they show that stronger video-text alignment correlates with downstream performance on semantic and non-semantic tasks, enabling a scalable zero-shot metric for evaluating spatio-temporal representations and highlighting current limitations and avenues for future work in temporal modeling and generative video representations.

Abstract

The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/

Dynamic Reflections: Probing Video Representations with Text Alignment

TL;DR

This work extends the Platonic Representation Hypothesis to the temporal domain by systematically probing video-text alignment across 121 encoders. It introduces a test-time framework that scales visual context (frames) and textual context (captions) and measures cross-modal similarity via mutual -NN, optimizing over encoder layers. The authors demonstrate substantial gains in alignment with richer test-time data and provide a saturation-based scaling law that accurately predicts these gains, with VideoMAEv2 often delivering the strongest alignment. Crucially, they show that stronger video-text alignment correlates with downstream performance on semantic and non-semantic tasks, enabling a scalable zero-shot metric for evaluating spatio-temporal representations and highlighting current limitations and avenues for future work in temporal modeling and generative video representations.

Abstract

The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data. Project page can be found at https://video-prh.github.io/

Paper Structure

This paper contains 17 sections, 2 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Scaling both the number of video frames and text captions at test time improves alignment. Given a paired video $v_i$ and set of captions $c_i$, leveraging rich multi-frame and multi-caption information at test time leads to improved alignment, measured in terms of mutual $k$-NN.
  • Figure 2: Video representations are strongly aligned with text. We measure alignment between modern vision encoders and Gemma 2 9B-it (subset of all vision models shown for clarity) on the V A T E X video dataset using a single caption for each video. In addition to the points, we plot a linear regression of alignment scores to representation strength for three different LLMs. Our takeaways are threefold: (1) The strongest models for both alignment and retrieval are large video models () (2) Averaging over frames is a simple yet effective baseline for image models on video input (), while image-only models are limited to scores below 22% (), in line with huh2024platonic. (3) Recent text models have improved alignment with vision models, shown by the regression lines.
  • Figure 3: Vision-text alignment scales strongly with the amount of visual and textual data available at test time. (left) Providing more frames increases vision-text alignment for both image and video models, with the latter being able to take advantage of more frames more effectively. (right) Providing more captions to the text model also significantly boosts alignment with vision models, across all frame counts.
  • Figure 4: Correlation between video/text alignment and downstream video perception task performance, for SSL methods trained without text supervision.
  • Figure 5: Video and text eventually are aligned, but encode temporal information differently. While the video text alignment is largely perfect for all of the models once $k=3$, they differ heavily before that when looking at $k=1,2$. Text models are often bag-of-words, while video models differ.
  • ...and 7 more figures