Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Jiajie Li; Chenhui Xu; Meihuan Liu; Jinjun Xiong

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Jiajie Li, Chenhui Xu, Meihuan Liu, Jinjun Xiong

Abstract

Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Abstract

Paper Structure (40 sections, 5 equations, 6 figures, 3 tables)

This paper contains 40 sections, 5 equations, 6 figures, 3 tables.

Introduction
Related Work
General-Purpose Vision-Language Models.
Vision-Language Models with Reinforcement Learning.
Surgical Vision-Language Models.
Existing Surgical Datasets.
Dilemma in Surgical VLM Adaptation
Rethinking SFT in the Surgical Domain
Empirical Observation After SFT
Why Does SFT Fall Short?
CoA: Chain-of-Adaptation
Preliminary: RLVR and GRPO
Structured Reasoning: From CoT to CoA
Rationale.
Two-Stage Surgical VLM Adaptation
...and 25 more sections

Figures (6)

Figure 1: Chain-of-Adaptation Overview. Left: The training pipeline of the CoA. It features a evidence-orientated cold start that enrich model's domain-specific concepts, and a RL-based training that encourages compliance with the CoA reasoning format and accurate final answers. Right: CoA performs adaptation from general to specialized domains through a four-stage reasoning format.< general description> gives plain descriptions that language models excel at.< evidence> explicitly collects mined information and connects it with domain knowledge.< thought> and < answer> further deepen the reasoning and draw the final conclusion.
Figure 2: Qwen3-VL-8B-Instruct's response to: "Describe the surgical image in detail."
Figure 3: Token Length reduces dramatically after small-scale SFT.
Figure 4: Model's response after SFT.
Figure 5: SFT vs GRPO on CholecT50 and Endovis2018.
...and 1 more figures

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Abstract

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (6)