Table of Contents
Fetching ...

Vision Language Models: A Survey of 26K Papers

Fengming Lin

TL;DR

This survey provides a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS (2023–2025) using a lexicon-based labeling approach to profile tasks, architectures, training regimes, losses, datasets, and modalities. It identifies three macro shifts: a rapid rise of multimodal vision–language models and LVLMs, sustained diffusion-based generation with controllability and speed considerations, and enduring interest in 3D and video understanding. The analysis shows a center of gravity moving toward instruction tuning and parameter-efficient adaptation, with lightweight bridges and modular fusion replacing heavy end-to-end encoders in many cases. The work offers practical guidance for researchers and practitioners, and provides a reproducible toolkit for auditing and extending trend analyses across venues and years.

Abstract

We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.

Vision Language Models: A Survey of 26K Papers

TL;DR

This survey provides a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS (2023–2025) using a lexicon-based labeling approach to profile tasks, architectures, training regimes, losses, datasets, and modalities. It identifies three macro shifts: a rapid rise of multimodal vision–language models and LVLMs, sustained diffusion-based generation with controllability and speed considerations, and enduring interest in 3D and video understanding. The analysis shows a center of gravity moving toward instruction tuning and parameter-efficient adaptation, with lightweight bridges and modular fusion replacing heavy end-to-end encoders in many cases. The work offers practical guidance for researchers and practitioners, and provides a reproducible toolkit for auditing and extending trend analyses across venues and years.

Abstract

We present a transparent, reproducible measurement of research trends across 26,104 accepted papers from CVPR, ICLR, and NeurIPS spanning 2023-2025. Titles and abstracts are normalized, phrase-protected, and matched against a hand-crafted lexicon to assign up to 35 topical labels and mine fine-grained cues about tasks, architectures, training regimes, objectives, datasets, and co-mentioned modalities. The analysis quantifies three macro shifts: (1) a sharp rise of multimodal vision-language-LLM work, which increasingly reframes classic perception as instruction following and multi-step reasoning; (2) steady expansion of generative methods, with diffusion research consolidating around controllability, distillation, and speed; and (3) resilient 3D and video activity, with composition moving from NeRFs to Gaussian splatting and a growing emphasis on human- and agent-centric understanding. Within VLMs, parameter-efficient adaptation like prompting/adapters/LoRA and lightweight vision-language bridges dominate; training practice shifts from building encoders from scratch to instruction tuning and finetuning strong backbones; contrastive objectives recede relative to cross-entropy/ranking and distillation. Cross-venue comparisons show CVPR has a stronger 3D footprint and ICLR the highest VLM share, while reliability themes such as efficiency or robustness diffuse across areas. We release the lexicon and methodology to enable auditing and extension. Limitations include lexicon recall and abstract-only scope, but the longitudinal signals are consistent across venues and years.

Paper Structure

This paper contains 15 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Direction Trajectories across CVPR+ICLR+NeurIPS — ALL CATEGORIES (direct labels). Each curve is the yearly aggregated TF--IDF mass for a direction (integer year ticks).
  • Figure 2: Small-multiples view of research-direction trajectories from 2022–2025. Each panel shows one category; the x-axis is year and the y-axis is normalized topic intensity (aggregated TF–IDF score). Vision–Language/Multimodal/LLM, Diffusion & Generative, and NeRF/Neural Rendering trend upward, while some traditional areas (e.g., 2D Object Detection, Self-Supervised/Pretraining) soften; others (e.g., GNN, Bayesian, Optimization) remain flat or decline slightly.
  • Figure 3: Top Rising Directions across CVPR+ICLR+NeurIPS (2022--2025). Bars show the slope of each direction’s aggregated TF--IDF trajectory over years. Larger slope = faster growth.
  • Figure 4: Chronological overview of representative VLM / Multimodal LLM milestones (2020–2025). Four color-coded categories: (A) Dual-encoder / Contrastive, (B) Cross-modal / Encoder--Decoder, (C) Multimodal LLMs. Each arrow is vertically offset within its year and labeled above/below with the model name and citation.