Table of Contents
Fetching ...

Small Vision-Language Models: A Survey on Compact Architectures and Techniques

Nitesh Patnaik, Navdeep Nayak, Himani Bansal Agrawal, Moinak Chinmoy Khamaru, Gourav Bal, Saishree Smaranika Panda, Rishi Raj, Vishal Meena, Kartheek Vadlamani

TL;DR

This survey addresses the problem of making vision-language models practical on resource-constrained devices by focusing on small, efficient architectures. It introduces a threefold backbone taxonomy—transformer-based, Mamba-based, and hybrid—and surveys enabling techniques such as knowledge distillation, lightweight attention, and modality pre-fusion, illustrated by models like TinyGPT-V, MiniGPT-4, and VL-Mamba. The authors synthesize performance-efficiency trade-offs across a range of models and benchmarks, emphasizing challenges from data biases to generalization in complex tasks. The work underscores the practical impact of sVLMs for edge and embedded systems and outlines concrete avenues for future research, including data efficiency, cross-domain generalization, and hybrid architectures that balance accuracy with computation.

Abstract

The emergence of small vision-language models (sVLMs) marks a critical advancement in multimodal AI, enabling efficient processing of visual and textual data in resource-constrained environments. This survey offers a comprehensive exploration of sVLM development, presenting a taxonomy of architectures - transformer-based, mamba-based, and hybrid - that highlight innovations in compact design and computational efficiency. Techniques such as knowledge distillation, lightweight attention mechanisms, and modality pre-fusion are discussed as enablers of high performance with reduced resource requirements. Through an in-depth analysis of models like TinyGPT-V, MiniGPT-4, and VL-Mamba, we identify trade-offs between accuracy, efficiency, and scalability. Persistent challenges, including data biases and generalization to complex tasks, are critically examined, with proposed pathways for addressing them. By consolidating advancements in sVLMs, this work underscores their transformative potential for accessible AI, setting a foundation for future research into efficient multimodal systems.

Small Vision-Language Models: A Survey on Compact Architectures and Techniques

TL;DR

This survey addresses the problem of making vision-language models practical on resource-constrained devices by focusing on small, efficient architectures. It introduces a threefold backbone taxonomy—transformer-based, Mamba-based, and hybrid—and surveys enabling techniques such as knowledge distillation, lightweight attention, and modality pre-fusion, illustrated by models like TinyGPT-V, MiniGPT-4, and VL-Mamba. The authors synthesize performance-efficiency trade-offs across a range of models and benchmarks, emphasizing challenges from data biases to generalization in complex tasks. The work underscores the practical impact of sVLMs for edge and embedded systems and outlines concrete avenues for future research, including data efficiency, cross-domain generalization, and hybrid architectures that balance accuracy with computation.

Abstract

The emergence of small vision-language models (sVLMs) marks a critical advancement in multimodal AI, enabling efficient processing of visual and textual data in resource-constrained environments. This survey offers a comprehensive exploration of sVLM development, presenting a taxonomy of architectures - transformer-based, mamba-based, and hybrid - that highlight innovations in compact design and computational efficiency. Techniques such as knowledge distillation, lightweight attention mechanisms, and modality pre-fusion are discussed as enablers of high performance with reduced resource requirements. Through an in-depth analysis of models like TinyGPT-V, MiniGPT-4, and VL-Mamba, we identify trade-offs between accuracy, efficiency, and scalability. Persistent challenges, including data biases and generalization to complex tasks, are critically examined, with proposed pathways for addressing them. By consolidating advancements in sVLMs, this work underscores their transformative potential for accessible AI, setting a foundation for future research into efficient multimodal systems.

Paper Structure

This paper contains 10 sections, 22 figures, 1 table.

Figures (22)

  • Figure 1: Structure of the Paper
  • Figure 2: Number of Papers Published by Year
  • Figure 3: Evolution of Vision Language Models over time
  • Figure 4: CLIP Architecture
  • Figure 5: BLIP Architecture: Proposed a multimodal mixture of encoder-decoder, a unified vision-language model which can operate in one of the three functionalities: (1) Unimodal encoder is trained with an image-text contrastive (ITC) loss to align the vision and language representations. (2) Image-grounded text encoder uses additional cross-attention layers to model vision-language interactions, and is trained with a image-text matching (ITM) loss to distinguish between positive and negative image-text pairs. (3) Image-grounded text decoder replaces the bi-directional self-attention layers with causal self-attention layers, and shares the same cross-attention layers and feed forward networks as the encoder. The decoder is trained with a language modeling (LM) loss to generate captions given images.
  • ...and 17 more figures