Reasoning-Driven Multimodal LLM for Domain Generalization

Zhipeng Xu; Zilong Wang; Xinyang Jiang; Dongsheng Li; De Cheng; Nannan Wang

Reasoning-Driven Multimodal LLM for Domain Generalization

Zhipeng Xu, Zilong Wang, Xinyang Jiang, Dongsheng Li, De Cheng, Nannan Wang

TL;DR

RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling.

Abstract

This paper addresses the domain generalization (DG) problem in deep learning. While most DG methods focus on enforcing visual feature invariance, we leverage the reasoning capability of multimodal large language models (MLLMs) and explore the potential of constructing reasoning chains that derives image categories to achieve more robust predictions under domain shift. To this end, we systematically study the role of reasoning in DG using DomainBed-Reasoning, a newly constructed extension of DomainBed dataset, in which each sample is paired with class-relevant reasoning chains. Our analysis reveals two key challenges: (i) fine-tuning MLLMs with reasoning chains for classification is more challenging than direct label supervision, since the model must optimize complex reasoning sequences before label prediction; and (ii) mismatches in reasoning patterns between supervision signals and fine-tuned MLLMs lead to a trade-off between semantic richness (informative but harder to optimize) and optimization efficiency (easier to optimize but less informative). To address these issues, we propose RD-MLDG (Reasoning-Driven Multimodal LLM for Domain Generalization), a framework with two components: (i) MTCT (Multi-Task Cross-Training), which introduces an additional direct classification pathway to guide reasoning supervision; and (ii) SARR (Self-Aligned Reasoning Regularization), which preserves the semantic richness of reasoning chains while mitigating reasoning-pattern mismatches via iterative self-labeling. Experiments on standard DomainBed datasets (PACS, VLCS, OfficeHome, TerraInc) demonstrate that RD-MLDG achieves state-of-the-art performances, highlighting reasoning as a promising complementary signal for robust out-of-domain generalization.

Reasoning-Driven Multimodal LLM for Domain Generalization

TL;DR

Abstract

Paper Structure (23 sections, 3 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 3 equations, 8 figures, 8 tables, 1 algorithm.

Introduction
Related work
DomainBed-Reasoning
Challenges of reasoning chain in DG
Optimization Gap in Reasoning-Chain Supervision
Mismatches in Reasoning Patterns Across Sources
Method
Multi-Task Cross-Training
Self-Aligned Reasoning Regularization
Experiments
Results on Multiple Domain Generalization
Ablation study
Parameter Analysis
Conclusion
ACKNOWLEDGMENTS
...and 8 more sections

Figures (8)

Figure 1: Illustrative examples of the printer in OfficeHome (Art Painting, Clipart, Product, Real World). Although the visual appearance differs substantially, the highlighted green segments of the reasoning chains remain highly consistent, capturing class-relevant cues that are likely to generalize well across domains.
Figure 2: Overview of the DomainBed-Reasoning construction pipeline. GPT-4o generates multi-stage reasoning chains (<SUMMARY>, <CAPTION>, <REASONING>, <REFLECTION>, <CONCLUSION>) without access to ground-truth labels. Multiple candidates are sampled and filtered through rejection sampling to obtain coherent reasoning chains, which form the foundation for analyzing reasoning challenges in DG.
Figure 3: Illustration of Challenge 1 (see Sec. \ref{['challenge1']}). (A) Classification token probabilities on target-domain test data under zero-shot and SFT, with and without reasoning; (B) Probability distributions of all tokens on source-domain training data under zero-shot and SFT; (C) Training dynamics: (I–II) classification token probabilities on source-domain training data after SFT; (III) loss curves comparing reasoning-based and no-thinking SFT.
Figure 4: Illustration of Challenge 2 (see Sec. \ref{['challenge2']}). (A) Token probability distributions on source-domain training data under zero-shot and fine-tuned settings: GPT-4o reasoning (A.I, A.II) vs InternVL-8B reasoning (A.III, A.IV). (B) Top-15 tokens with the largest entropy reduction during optimization, highlighting differences between GPT-4o (B.I) and InternVL-8B (B.II). (C) Qualitative examples corresponding to (B.I) and (B.II), with tokens receiving the strongest optimization gains highlighted in red.
Figure 5: Analysis of MTCT (Sec. \ref{['method:mtct']}). Probability distributions on TerraInc source domain training data under direct label prediction, reasoning-only SFT, and MTCT SFT: (A) all tokens and (B) class tokens.
...and 3 more figures

Reasoning-Driven Multimodal LLM for Domain Generalization

TL;DR

Abstract

Reasoning-Driven Multimodal LLM for Domain Generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (8)