CATCH: A Modular Cross-domain Adaptive Template with Hook
Xinjin Li, Yulie Lu, Jinghan Cao, Yu Ma, Zhenglin Li, Yeyang Zhou
TL;DR
CATCH tackles cross-domain VQA by decoupling domain adaptation from backbone training through a modular, hook-based framework. A lightweight domain classifier routes each input to a pair of adapters—Prompt Adapter for language and Visual Adapter for vision—dynamically injected into a frozen backbone to produce answers. Across four diverse domain benchmarks, CATCH demonstrates consistent gains in accuracy and generative metrics (e.g., BLEU and ROUGE) and improves factual grounding, validating its scalability and practicality for multi-domain deployment. The approach offers a flexible, low-cost pathway to broaden vision-language systems beyond natural-image domains while preserving core model capabilities.
Abstract
Recent advances in Visual Question Answering (VQA) have demonstrated impressive performance in natural image domains, with models like LLaVA leveraging large language models (LLMs) for open-ended reasoning. However, their generalization degrades significantly when transferred to out-of-domain scenarios such as remote sensing, medical imaging, or math diagrams, due to large distributional shifts and the lack of effective domain adaptation mechanisms. Existing approaches typically rely on per-domain fine-tuning or bespoke pipelines, which are costly, inflexible, and not scalable across diverse tasks. In this paper, we propose CATCH, a plug-and-play framework for cross-domain adaptation that improves the generalization of VQA models while requiring minimal changes to their core architecture. Our key idea is to decouple visual and linguistic adaptation by introducing two lightweight modules: a domain classifier to identify the input image type, and a dual adapter mechanism comprising a Prompt Adapter for language modulation and a Visual Adapter for vision feature adjustment. Both modules are dynamically injected via a unified hook interface, requiring no retraining of the backbone model. Experimental results across four domain-specific VQA benchmarks demonstrate that our framework achieves consistent performance gains without retraining the backbone model, including +2.3 BLEU on MathVQA, +2.6 VQA on MedVQA-RAD, and +3.1 ROUGE on ChartQA. These results highlight that CATCH provides a scalable and extensible approach to multi-domain VQA, enabling practical deployment across diverse application domains.
