Efficient Test-Time Scaling for Small Vision-Language Models
Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos
TL;DR
This work tackles the efficiency-performance trade-off in small vision-language models by introducing two inference-time strategies: Test-Time Augmentation (TTAug) and Test-Time Adaptation (TTAdapt). Both methods operate on internal model representations and require no external supervision, with TTAug aggregating token-level predictions across diverse input augmentations and TTAdapt using consensus-derived pseudolabels to fine-tune the model during inference. Across nine benchmarks and multiple architectures, the approach yields consistent improvements while incurring modest computational overhead, and token-level aggregation with greedy decoding emerges as a practical, effective principle over traditional temperature-based sampling. The results demonstrate broad generalization across model families and highlight practical deployment guidance, including augmentation counts, aggregation strategies, and task-dependent adaptation intensity, paving the way for robust open-domain VLM inference on resource-constrained devices.
Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.
