Table of Contents
Fetching ...

Make Your Training Flexible: Towards Deployment-Efficient Video Models

Chenting Wang, Kunchang Li, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang

TL;DR

Video models suffer from redundancy and poor budget adaptability due to fixed-grid token sampling. The authors propose Token Optimization with Flux, a framework combining flexible sampling and token selection to maximize information under any budget, enabling deployment-efficient training and inference. Flux-UMT pretraining and FluxViT with Global-Local Positional Embedding and Dual Patch Normalization achieve state-of-the-art results across K400, SSv2, COIN, and retrieval tasks, while dramatically reducing computation by using fewer tokens when needed. The approach demonstrates robustness across single- and multi-modal tasks, including chat-centric benchmarks, highlighting its practical impact for real-world video understanding systems.

Abstract

Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.

Make Your Training Flexible: Towards Deployment-Efficient Video Models

TL;DR

Video models suffer from redundancy and poor budget adaptability due to fixed-grid token sampling. The authors propose Token Optimization with Flux, a framework combining flexible sampling and token selection to maximize information under any budget, enabling deployment-efficient training and inference. Flux-UMT pretraining and FluxViT with Global-Local Positional Embedding and Dual Patch Normalization achieve state-of-the-art results across K400, SSv2, COIN, and retrieval tasks, while dramatically reducing computation by using fewer tokens when needed. The approach demonstrates robustness across single- and multi-modal tasks, including chat-centric benchmarks, highlighting its practical impact for real-world video understanding systems.

Abstract

Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. All models and data are available at https://github.com/OpenGVLab/FluxViT.

Paper Structure

This paper contains 41 sections, 1 equation, 9 figures, 18 tables.

Figures (9)

  • Figure 1: Flux (right) employs flexible sampling and token selection to achieve Token Optimization. Common methods(left) use rigid sampling(and use token reduction for applications directly).
  • Figure 2: Overview of our Flux method. The same-scaled FluxViT and InternVideo2 iv2 series models are both pre-trained with the InternVideo2-1b model as the teacher using the same dataset. The "FluxViT+" refers to the results using Token Optimization at test time with the same GFLOPS.
  • Figure 3: Our proposed Flux method with UMT framework. We show that our proposed Flux training is easy to integrate with mainstream video training frameworks.
  • Figure 4: Our proposed essential modules for Flux. From the model side, Flux modules include Group-dynamic token selector, dual patch norm, and Global-Local positional embedding.
  • Figure 5: Comparison between different training methods on K400 using a fixed number of 2048 tokens. Note the three lines and all the points share similar training and inference costs. The shaded part shows results for the AnyRes Distilled AnyRes Tuned model with spatial resolution in range (196, 252), while others use a fixed spatial resolution at 224.
  • ...and 4 more figures