Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model
Keunwoo Peter Yu, Achal Dave, Rares Ambrus, Jean Mercat
TL;DR
Espresso tackles the token-explosion problem of long-form video understanding in vision-language models by introducing a fixed-length projector that explicitly disentangles spatial and temporal information. The architecture uses a frame-wise ViT with a spatial pooler across time and a temporal pooler across space, each followed by a Q-Former-based compressor that yields fixed-length sequences of size $p$ and $t$, which are concatenated and projected into the LLM input; the approach supports segmentation so the total budget remains fixed at $n(p+t)$. Empirical results show Espresso outperforms pooling-based and Perceiver-style projectors on long-form benchmarks like EgoSchema and NH-EgoSchema, with segment-wise processing providing scalable benefits; a two-stage training strategy (Espressoalign) enhances short-form performance by leveraging Panda-70M data. The findings suggest fixed-length projectors, when architecturally biased to separate spatial and temporal processing and combined with segmentation, can rival pooling methods in efficiency while enabling robust long-form reasoning in streaming or embodied settings.
Abstract
Recent advances in vision-language models (VLMs) have shown great promise in connecting images and text, but extending these models to long videos remains challenging due to the rapid growth in token counts. Models that compress videos by local aggregation in time or space have become popular for handling long-form inputs; however, these pooling-based projectors sacrifice the benefits of fixed-length representations that are crucial for streaming and efficient video understanding. We introduce $\texttt{Espresso}$, a new architecture that separately compresses spatial and temporal features into fixed-length sequences. $\texttt{Espresso}$ enables efficient video encoding while maintaining strong long-form reasoning capabilities. Experiments show that fixed-length compression combined with segment-wise processing offers a scalable and competitive alternative to pooling-based approaches. Our results demonstrate that fixed-length projectors, when properly designed and trained, remain a viable foundation for video-language modeling.
