Table of Contents
Fetching ...

Robust Domain Generalization for Multi-modal Object Recognition

Yuxin Qiao, Keqin Li, Junhong Lin, Rong Wei, Chufeng Jiang, Yang Luo, Haoyu Yang

TL;DR

The paper tackles domain generalization in zero-shot, multi-modal, multi-label recognition by leveraging vision-language pre-training and identifying shortcomings in CLIPood. It calibrates the finetuning loss by deriving the actual objective $L_{actual}$ and introduces Mixup-CLIPood with a cross-modal mix-up loss $\mathcal{L}_{mix}$, combining them as $\mathcal{L}_{total} = L_{actual} + \lambda L_{mix}$ with $\lambda=0.1$. The approach also broadens evaluation to larger vision-language backbones and emphasizes class-aware visual fusion to improve generalization. Experiments on PACS, VLCS, and Office-Home across ViT backbones demonstrate consistent gains over CLIPood and CLIP, establishing a new benchmark for robust multi-modal domain generalization in object recognition.

Abstract

In multi-label classification, machine learning encounters the challenge of domain generalization when handling tasks with distributions differing from the training data. Existing approaches primarily focus on vision object recognition and neglect the integration of natural language. Recent advancements in vision-language pre-training leverage supervision from extensive visual-language pairs, enabling learning across diverse domains and enhancing recognition in multi-modal scenarios. However, these approaches face limitations in loss function utilization, generality across backbones, and class-aware visual fusion. This paper proposes solutions to these limitations by inferring the actual loss, broadening evaluations to larger vision-language backbones, and introducing Mixup-CLIPood, which incorporates a novel mix-up loss for enhanced class-aware visual fusion. Our method demonstrates superior performance in domain generalization across multiple datasets.

Robust Domain Generalization for Multi-modal Object Recognition

TL;DR

The paper tackles domain generalization in zero-shot, multi-modal, multi-label recognition by leveraging vision-language pre-training and identifying shortcomings in CLIPood. It calibrates the finetuning loss by deriving the actual objective and introduces Mixup-CLIPood with a cross-modal mix-up loss , combining them as with . The approach also broadens evaluation to larger vision-language backbones and emphasizes class-aware visual fusion to improve generalization. Experiments on PACS, VLCS, and Office-Home across ViT backbones demonstrate consistent gains over CLIPood and CLIP, establishing a new benchmark for robust multi-modal domain generalization in object recognition.

Abstract

In multi-label classification, machine learning encounters the challenge of domain generalization when handling tasks with distributions differing from the training data. Existing approaches primarily focus on vision object recognition and neglect the integration of natural language. Recent advancements in vision-language pre-training leverage supervision from extensive visual-language pairs, enabling learning across diverse domains and enhancing recognition in multi-modal scenarios. However, these approaches face limitations in loss function utilization, generality across backbones, and class-aware visual fusion. This paper proposes solutions to these limitations by inferring the actual loss, broadening evaluations to larger vision-language backbones, and introducing Mixup-CLIPood, which incorporates a novel mix-up loss for enhanced class-aware visual fusion. Our method demonstrates superior performance in domain generalization across multiple datasets.
Paper Structure (12 sections, 8 equations, 2 figures, 3 tables)

This paper contains 12 sections, 8 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of Our Proposed Mix-up Loss.
  • Figure 2: Examples of Accuracy Curves on Target Data.