Table of Contents
Fetching ...

Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model

Keunwoo Peter Yu, Achal Dave, Rares Ambrus, Jean Mercat

TL;DR

Espresso tackles the token-explosion problem of long-form video understanding in vision-language models by introducing a fixed-length projector that explicitly disentangles spatial and temporal information. The architecture uses a frame-wise ViT with a spatial pooler across time and a temporal pooler across space, each followed by a Q-Former-based compressor that yields fixed-length sequences of size $p$ and $t$, which are concatenated and projected into the LLM input; the approach supports segmentation so the total budget remains fixed at $n(p+t)$. Empirical results show Espresso outperforms pooling-based and Perceiver-style projectors on long-form benchmarks like EgoSchema and NH-EgoSchema, with segment-wise processing providing scalable benefits; a two-stage training strategy (Espressoalign) enhances short-form performance by leveraging Panda-70M data. The findings suggest fixed-length projectors, when architecturally biased to separate spatial and temporal processing and combined with segmentation, can rival pooling methods in efficiency while enabling robust long-form reasoning in streaming or embodied settings.

Abstract

Recent advances in vision-language models (VLMs) have shown great promise in connecting images and text, but extending these models to long videos remains challenging due to the rapid growth in token counts. Models that compress videos by local aggregation in time or space have become popular for handling long-form inputs; however, these pooling-based projectors sacrifice the benefits of fixed-length representations that are crucial for streaming and efficient video understanding. We introduce $\texttt{Espresso}$, a new architecture that separately compresses spatial and temporal features into fixed-length sequences. $\texttt{Espresso}$ enables efficient video encoding while maintaining strong long-form reasoning capabilities. Experiments show that fixed-length compression combined with segment-wise processing offers a scalable and competitive alternative to pooling-based approaches. Our results demonstrate that fixed-length projectors, when properly designed and trained, remain a viable foundation for video-language modeling.

Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model

TL;DR

Espresso tackles the token-explosion problem of long-form video understanding in vision-language models by introducing a fixed-length projector that explicitly disentangles spatial and temporal information. The architecture uses a frame-wise ViT with a spatial pooler across time and a temporal pooler across space, each followed by a Q-Former-based compressor that yields fixed-length sequences of size and , which are concatenated and projected into the LLM input; the approach supports segmentation so the total budget remains fixed at . Empirical results show Espresso outperforms pooling-based and Perceiver-style projectors on long-form benchmarks like EgoSchema and NH-EgoSchema, with segment-wise processing providing scalable benefits; a two-stage training strategy (Espressoalign) enhances short-form performance by leveraging Panda-70M data. The findings suggest fixed-length projectors, when architecturally biased to separate spatial and temporal processing and combined with segmentation, can rival pooling methods in efficiency while enabling robust long-form reasoning in streaming or embodied settings.

Abstract

Recent advances in vision-language models (VLMs) have shown great promise in connecting images and text, but extending these models to long videos remains challenging due to the rapid growth in token counts. Models that compress videos by local aggregation in time or space have become popular for handling long-form inputs; however, these pooling-based projectors sacrifice the benefits of fixed-length representations that are crucial for streaming and efficient video understanding. We introduce , a new architecture that separately compresses spatial and temporal features into fixed-length sequences. enables efficient video encoding while maintaining strong long-form reasoning capabilities. Experiments show that fixed-length compression combined with segment-wise processing offers a scalable and competitive alternative to pooling-based approaches. Our results demonstrate that fixed-length projectors, when properly designed and trained, remain a viable foundation for video-language modeling.

Paper Structure

This paper contains 23 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Overview of the canonical video language model framework. Each of the $T$ frames is first independently encoded by a frozen 2744 ViT. The frames are fed to a projector that may compress, pool or simply embed the input into the LLM input space. A pre-trained LLM, either frozen 2744 or fine-tuned , uses the projected input and a question about the video and outputs an answer.
  • Figure 2: Overview of various projector architectures for VLMs.
  • Figure 3: Overview of the Espressoprojector. A frozen ViT encodes each frame independently. The temporal pooler aggregates features across frames for each patch (producing spatial features), and the spatial pooler aggregates features across patches for each frame (producing temporal features). These are further compressed into fixed-length token sequences of size $p$ and $t$ respectively, then mapped into the LLM input space via an MLP. The process may optionally be repeated across $n$ segments, resulting in a token sequence of length $n(p + t)$. Note that while $P$ and $T$ are variable, $n$, $p$, and $t$ are fixed.