Table of Contents
Fetching ...

Inference Compute-Optimal Video Vision Language Models

Peiqi Wang, ShengYun Peng, Xuewen Zhang, Hanchao Yu, Yibo Yang, Lifu Huang, Fujun Liu, Qifan Wang

TL;DR

This work tackles the problem of allocating inference compute across three scaling factors for video vision-language models, namely LM size $x_N$, frame count $x_T$, and visual tokens per frame $x_V$, under fixed per-example compute budget $c$ and finetuning data size $n$. It combines large-scale training sweeps with a parametric add-interact model to characterize downstream task performance $f(x,n)$ and derives the compute-optimal frontier $x^*(c;n)$ via discrete optimization. The findings show diminishing returns for both scaling factors and data size, demonstrate that joint scaling is necessary to reach optimum performance, and reveal task-dependent variations in the frontier, including elasticity of $x_N$, $x_T$, and $x_V$ to changes in $n$. The results provide actionable guidelines for selecting inference configurations in video VLM deployment and underscore the importance of accounting for vision-encoder compute in overall FLOPs budgeting.

Abstract

This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.

Inference Compute-Optimal Video Vision Language Models

TL;DR

This work tackles the problem of allocating inference compute across three scaling factors for video vision-language models, namely LM size , frame count , and visual tokens per frame , under fixed per-example compute budget and finetuning data size . It combines large-scale training sweeps with a parametric add-interact model to characterize downstream task performance and derives the compute-optimal frontier via discrete optimization. The findings show diminishing returns for both scaling factors and data size, demonstrate that joint scaling is necessary to reach optimum performance, and reveal task-dependent variations in the frontier, including elasticity of , , and to changes in . The results provide actionable guidelines for selecting inference configurations in video VLM deployment and underscore the importance of accounting for vision-encoder compute in overall FLOPs budgeting.

Abstract

This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.

Paper Structure

This paper contains 44 sections, 16 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: IsoPerformance Contours. Contours show average task performance as a function of a scaling factor (e.g., $x_N$, $x_T$, or $x_V$) and finetuning data size $n$, derived from the star sweep. As detailed in Section \ref{['sec:training_sweeps']}, we construct the star sweep by starting with a inference compute-intensive "center" $x^{\bigstar} = (7.5\text{B}, 32, 196)$, varying one factor at a time while keeping the others fixed, and finetuning on different data sizes. For instance, in the left subfigure, each dot represents a $x_N$-parameter LM finetuned on $n$ examples, with $x_T = 32$ and $x_V = 196$ fixed. Performance improves as scaling factors and $n$ increase, albeit at a diminishing rate. Irregularities in the contour lines, particularly near boundaries, arise from interpolation artifacts (via matplotlib.pyplot.contourf) and variability in benchmark scores across fine-tuning runs.
  • Figure 2: IsoFLOP Curves and Compute-Optimal Frontier. IsoFLOP curves (dotted lines) show task performance (color-coded) for models with fixed inference compute cost $c(x)$ across four TFLOP budgets: 2, 5, 15, and 30. The compute-optimal frontier (solid line) connects models with the best average task performance. Both are derived from the isoFLOP sweep described in Section \ref{['sec:training_sweeps']}. The compute-optimal frontier reveals that optimal performance requires scaling both $x_T$ and $x_V$ together. Moreover, at 30 TFLOPs, a model with $x_N = 7.5$B outperforms one with $x_N = 1$B, as smaller LMs cannot effectively make use of higher compute budgets (e.g., increasing from 15 to 30 TFLOPs yields minimal gain), highlighting the bottleneck imposed by LM size. These findings underscore the importance of jointly scaling $x_N$, $x_T$, and $x_V$ to maximize performance.
  • Figure 3: Parametric Fitting of Task Performance. (Left) Box plot of bootstrap-resampled parameter estimates (100 resamples) for the add-interact model (defined in Equation \ref{['eq:add_interact_functional_form_multiple_factors']}) highlights the challenge with fitting a model with just $\sim$100 examples. (Center) Scatter plot comparing the predicted average task performance ("Metrics/Avg") with the actual performance for each run in the star and isoFLOP sweeps. add-interact achieves a strong fit to data. (Right) Bar plot illustrating add-interact's extrapolation performance on isoFLOP data after being trained on star data across various video tasks. While it achieves good performance for Metrics/Avg (Avg), it struggles to extrapolate effectively for tasks such as LongVideoBench (LVB) and Next-QA (NQA).
  • Figure 4: Predicted Compute-Optimal Frontier for Video VLMs. The left three subplots show the predicted inference compute-optimal frontier $x^*(c; n)$ for key scaling factors $x$ of video VLMs, across varying fine-tuning data sizes $n$ (shades of blue). The blue text indicates the increase in $x^*$ as inference compute grows from $2$T to $100$T FLOPs. Task performance $f(x, n)$ is modeled using the bagged add-interact model, which identifies an efficiency frontier requiring joint scaling of $(x_N, x_T, x_V)$ at varying rates. This frontier is non-monotonic due to the discrete domain $\mathcal{X}$ of $x$. The rightmost subplot depicts the elasticity (defined in Equation \ref{['eq:elasticity_definition_as_function_of_data_size']}) for each factor $k \in \{N, T, V\}$, quantifying the sensitivity of $x_k^*$ to changes in $n$. For instance, as $n$ increases, the frontier $x_T^*(c)$ shifts upward (in darker blue), corresponding to a positive $e_T(n)$ (red curve) in the elasticity plot.
  • Figure 5: Elasticity Across Tasks. Bar plot showing the elasticity (defined in Equation \ref{['eq:elasticity_definition_as_function_of_data_size_and_compute_budget']}) for scaling factors $k \in \{N, T, V\}$ across video tasks. This measures the sensitivity of optimal scaling factors $x_k^*$ to changes in data size $n$. While there is significant task-specific variation, the general trend suggests decreasing $x_N$ and increasing $x_T, x_V$ as data size $n$ grows.
  • ...and 5 more figures