Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

Tarun Kalluri; Bodhisattwa Prasad Majumder; Manmohan Chandraker

Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

Tarun Kalluri, Bodhisattwa Prasad Majumder, Manmohan Chandraker

TL;DR

LaGTran tackles the problem of cross-domain transfer under limited labeling by leveraging text descriptions as supervision. By training a source-language text classifier on captions and using its predictions to pseudo-label target images, it provides a simple, cross-modal supervision mechanism that outperforms prior unsupervised domain adaptation methods on GeoNet and DomainNet, and extends effectively to video with the Ego2Exo benchmark. The approach demonstrates that language can bridge semantic gaps more effectively than pixel-space alignment, achieving strong open-set and cross-view transfer while remaining test-time efficient. The work provides a practical, data-efficient pathway for robust transfer in vision tasks and highlights future directions in text-based adaptation and broader language supervision.

Abstract

We introduce LaGTran, a novel framework that utilizes text supervision to guide robust transfer of discriminative knowledge from labeled source to unlabeled target data with domain gaps. While unsupervised adaptation methods have been established to address this problem, they show limitations in handling challenging domain shifts due to their exclusive operation within the pixel-space. Motivated by our observation that semantically richer text modality has more favorable transfer properties, we devise a transfer mechanism to use a source-trained text-classifier to generate predictions on the target text descriptions, and utilize these predictions as supervision for the corresponding images. Our approach driven by language guidance is surprisingly easy and simple, yet significantly outperforms all prior approaches on challenging datasets like GeoNet and DomainNet, validating its extreme effectiveness. To further extend the scope of our study beyond images, we introduce a new benchmark called Ego2Exo to study ego-exo transfer in videos and find that our language-aided approach LaGTran yields significant gains in this highly challenging and non-trivial transfer setting. Code, models, and proposed datasets are publicly available at https://tarun005.github.io/lagtran/.

Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

TL;DR

Abstract

Paper Structure (36 sections, 3 equations, 13 figures, 5 tables)

This paper contains 36 sections, 3 equations, 13 figures, 5 tables.

Introduction
Related Work
Domain robustness in computer vision.
Language supervision in computer vision.
Domain robustness using language supervision.
Method Details
Problem Description and Background
LaGTran for Cross-Domain Transfer
Overview.
Training the text classifier.
Cross-modal supervision transfer.
Extending LaGTran to Handle Outliers
Experiments
LaGTran for Image Classification
Datasets.
...and 21 more sections

Figures (13)

Figure 1: A summary of our insights for LaGTran: In a domain transfer setting with labeled source and unlabeled target domain data, we observe significantly more drop incurred while transferring an image-classifier trained on source images to target ($17.1\%$), compared to a text-classifier trained on corresponding text descriptions of source images ($9.5\%$). We use this insight to build a simple framework called LaGTran that leverages these text descriptions easily available in both domains to improve transfer in images and videos.
Figure 2: An overview of training using LaGTran: We operate in a setting where the labeled source domain and unlabeled target domain data possess cheaply available or easily generated language descriptions for each image. LaGTran proceeds by first training a BERT-classifier$\mathcal{B}$ using source captions and labels (\ref{['eq:text_Classify']}), and using the trained model to generate pseudo-labels $\hat{y}_t$ for the target captions and corresponding images (\ref{['eq:pseudo']}). We then use this generated supervision along with source domain data in jointly training a Vision classifier$\mathcal{G}$ for image or video classification (\ref{['eq:joint_training']}).
Figure 3:
Figure 4:
Figure 5:
...and 8 more figures

Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

TL;DR

Abstract

Tell, Don't Show!: Language Guidance Eases Transfer Across Domains in Images and Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (13)