Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu; Tianchen Zhao; Chang Liu; Jiarui Cai; Zheng Zhang; Zhuowei Li; Aaditya Singh; Xiang Xu; Mani Srivastava; Jonathan Wu

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu

TL;DR

This work tackles the mismatch between vision encoders in LVLMs and domain-specific tasks by proposing CRAFT, a discrete, codebook-anchored framework that decouples vision from the language model. By discretizing visual embeddings into a shared codebook and training only the vision encoder with surrogate alignment, commitment, and contrastive losses, CRAFT achieves robust domain adaptation with cross-LLM transfer, while a test-time token pruning scheme yields efficient inference. Across ten benchmarks and multiple backbones, CRAFT delivers significant domain gains (average ~13.51 percentage points) and preserves instruction-following and explanatory capabilities, outperforming continuous-feature and PEFT baselines. The approach offers practical benefits for resource-constrained settings by enabling portable vision encoders that can be paired with diverse LLM backbones without re-alignment or extensive retraining.

Abstract

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

TL;DR

Abstract

Paper Structure (25 sections, 12 equations, 7 figures, 7 tables)

This paper contains 25 sections, 12 equations, 7 figures, 7 tables.

Introduction
Related Work
Methodology
Preliminaries
Training Process
Test-Time Vision Token Pruning
Experiments
Experimental Setup
Performance Comparison
Evaluation of Instruction-Following
Decoupling Vision and Language
Runtime and Efficiency
Ablation Studies
Future Work
Conclusion
...and 10 more sections

Figures (7)

Figure 1: Continuous vs. Discrete Adaptation. (a) In conventional continuous-space adaptation, fine-tuning the vision encoder shifts its feature distribution, requiring costly re-alignment with each language model. (b) CRAFT introduces a discrete interface that anchors visual features to a shared codebook, allowing a single adapted encoder to work seamlessly across language models of different architectures without additional re-training or alignment.
Figure 2: Examples from plant pathology salathe2016plantvillage, medical imaging lau2018dataset, and abstract diagram understanding lu2021iconqa are shown using a general continuous LVLM lin2023vila, its PEFT-tuned variant, and our CRAFT model built on the discrete LVLM wu2024vila. General LVLM often lacks visual grounding or domain-specific knowledge in under-represented domains (e.g., misidentifying plant diseases). PEFT improves task accuracy such as question answering, but its language output collapses into rigid responses. In contrast, CRAFT captures domain-specific visual cues (e.g., identifying lesions in medical images) while keeping alignment stable through the shared discrete token interface, allowing the model to produce both accurate decisions and coherent explanations. Correct and incorrect answers or explanations are marked in green and red, respectively.
Figure 3: Overview of the CRAFT framework. CRAFT adapts to a target domain by fine-tuning only the vision encoder, and its outputs are quantized into a shared discrete codebook. Training is guided by commitment and contrastive losses, with an additional surrogate LLM providing multimodal supervision. At inference, the adapted encoder can be used with any LLM that shares the same codebook.
Figure 4: Illustration of the Token Pruning Process. Most of the white background patches are mapped to the same codebook entry (ID 11745), whereas semantically meaningful objects such as the Chihuahua are represented by rarer token IDs (ID-5825). Token IDs that appear too often in the training set usually repeat information already captured by others. To reduce this redundancy, we regularize them by pruning a subset at test-time. Tokens that appear frequently during training receive lower rarity weights and are pruned more aggressively than those with higher rarity weights.
Figure 5: Accuracy of CRAFT encoder with VILA-U-7B backbone on various datasets versus the keep ratio. Each curve represents a different dataset. The keep ratio is the ratio between target budget $M$ and image token number $N$; 1.0 indicates no pruning. Performance is consistently reliable when the keep ratio is above 0.6.
...and 2 more figures

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

TL;DR

Abstract

Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)