Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Yunhang Shen; Chaoyou Fu; Shaoqi Dong; Xiong Wang; Yi-Fan Zhang; Peixian Chen; Mengdan Zhang; Haoyu Cao; Ke Li; Shaohui Lin; Xiawu Zheng; Yan Zhang; Yiyi Zhou; Ran He; Caifeng Shan; Rongrong Ji; Xing Sun

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, Xing Sun

TL;DR

Long-VITA tackles the open-source long-context multimodal limitation by introducing a four-stage training strategy that scales context from short to 1M tokens, coupled with an architecture that fuses a high-resolution vision encoder, a visual projector, and a large language model. It leverages exclusively open-source data, including newly released Comic-9K and MovieNet-Summary, and implements context-parallelism inference and a logits-masked LM head to handle infinitely long inputs efficiently. Empirical results on image and video benchmarks show competitive to state-of-the-art performance among open models, with strong gains on long-video tasks and notable speedups on inference. The work provides a reproducible baseline for open long-context multimodal research and highlights practical paths toward extending long-context capabilities with existing and future data.

Abstract

We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully open-source and reproducible.. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

TL;DR

Abstract

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

TL;DR

Abstract

Paper Structure

Table of Contents