Table of Contents
Fetching ...

Efficient Test-Time Scaling for Small Vision-Language Models

Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos

TL;DR

This work tackles the efficiency-performance trade-off in small vision-language models by introducing two inference-time strategies: Test-Time Augmentation (TTAug) and Test-Time Adaptation (TTAdapt). Both methods operate on internal model representations and require no external supervision, with TTAug aggregating token-level predictions across diverse input augmentations and TTAdapt using consensus-derived pseudolabels to fine-tune the model during inference. Across nine benchmarks and multiple architectures, the approach yields consistent improvements while incurring modest computational overhead, and token-level aggregation with greedy decoding emerges as a practical, effective principle over traditional temperature-based sampling. The results demonstrate broad generalization across model families and highlight practical deployment guidance, including augmentation counts, aggregation strategies, and task-dependent adaptation intensity, paving the way for robust open-domain VLM inference on resource-constrained devices.

Abstract

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

Efficient Test-Time Scaling for Small Vision-Language Models

TL;DR

This work tackles the efficiency-performance trade-off in small vision-language models by introducing two inference-time strategies: Test-Time Augmentation (TTAug) and Test-Time Adaptation (TTAdapt). Both methods operate on internal model representations and require no external supervision, with TTAug aggregating token-level predictions across diverse input augmentations and TTAdapt using consensus-derived pseudolabels to fine-tune the model during inference. Across nine benchmarks and multiple architectures, the approach yields consistent improvements while incurring modest computational overhead, and token-level aggregation with greedy decoding emerges as a practical, effective principle over traditional temperature-based sampling. The results demonstrate broad generalization across model families and highlight practical deployment guidance, including augmentation counts, aggregation strategies, and task-dependent adaptation intensity, paving the way for robust open-domain VLM inference on resource-constrained devices.

Abstract

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

Paper Structure

This paper contains 41 sections, 21 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Our framework consists of two main pipelines: (1) Test-Time Augmentation: Given an input image and text prompt, we apply various transformations to create multiple augmented versions. VLM processes each augmented input to produce next token probability distributions, which are then aggregated at the token level to generate the final response. (2) Test-Time Adaptation: We create pseudolabels through test-time augmentation and fine-tune the VLM parameters, then repeat the process. Both methods demonstrate effectiveness across nine diverse benchmarks as shown in (b).
  • Figure 2: Performance scaling as a function of the number of augmentations. Performance gains generally plateau after 16 augmentations.
  • Figure 3: Improvements across different models, demonstrating cross-model generalization.
  • Figure 4: Performance across aggregation layers. Each subplot shows accuracy as a function of the transformer layer where feature aggregation occurs. Different benchmarks exhibit distinct optimal aggregation points: later layers favor language-heavy tasks (ChartQA, TextVQA), while earlier layers benefit visual reasoning tasks (OCRVQA, GQA).
  • Figure 5: Overhead in peak GPU memory usage and runtime for different numbers of augmentations, comparing parallel and sequential implementation strategies.
  • ...and 2 more figures