Table of Contents
Fetching ...

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Jiajie Li, Chenhui Xu, Meihuan Liu, Jinjun Xiong

Abstract

Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Abstract

Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
Paper Structure (40 sections, 5 equations, 6 figures, 3 tables)

This paper contains 40 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Chain-of-Adaptation Overview. Left: The training pipeline of the CoA. It features a evidence-orientated cold start that enrich model's domain-specific concepts, and a RL-based training that encourages compliance with the CoA reasoning format and accurate final answers. Right: CoA performs adaptation from general to specialized domains through a four-stage reasoning format.< general description> gives plain descriptions that language models excel at.< evidence> explicitly collects mined information and connects it with domain knowledge.< thought> and < answer> further deepen the reasoning and draw the final conclusion.
  • Figure 2: Qwen3-VL-8B-Instruct's response to: "Describe the surgical image in detail."
  • Figure 3: Token Length reduces dramatically after small-scale SFT.
  • Figure 4: Model's response after SFT.
  • Figure 5: SFT vs GRPO on CholecT50 and Endovis2018.
  • ...and 1 more figures