Table of Contents
Fetching ...

Semantically Guided Action Anticipation

Anxhelo Diko, Antonino Furnari, Luigi Cinque, Giovanni Maria Farinella

TL;DR

This paper tackles unsupervised domain adaptation by shifting from absolute alignment to preserving semantic-geometric relationships across domains using a language-derived reference structure. It presents LAGUNA, a three-stage framework that (1) constructs a language-based reference anchors space, (2) trains a language supervisor to map captions to this reference and generate pseudo-labels, and (3) trains a cross-domain visual classifier whose representations are constrained to follow the reference structure via relative encodings and a cross-domain attention mechanism. Key contributions include the use of relative encodings to enforce geometry rather than coordinate overlap, learnable domain-specific anchors to preserve domain peculiarities, and a volume-based regularization to prevent anchor collapse. Extensive experiments on DomainNet, GeoImnet, GeoPlaces, and Ego2Exo demonstrate substantial gains over state-of-the-art baselines, with the approach achieving strong performance while remaining significantly more compact than large multilingual-language models. The work highlights the practical value of language-guided, structure-aware alignment for robust cross-domain generalization in vision tasks and lays groundwork for broader multi-modal adaptation.

Abstract

Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. Our method defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. We empirically demonstrate our method's superiority in domain adaptation tasks across four diverse image and video datasets. Remarkably, we surpass previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.

Semantically Guided Action Anticipation

TL;DR

This paper tackles unsupervised domain adaptation by shifting from absolute alignment to preserving semantic-geometric relationships across domains using a language-derived reference structure. It presents LAGUNA, a three-stage framework that (1) constructs a language-based reference anchors space, (2) trains a language supervisor to map captions to this reference and generate pseudo-labels, and (3) trains a cross-domain visual classifier whose representations are constrained to follow the reference structure via relative encodings and a cross-domain attention mechanism. Key contributions include the use of relative encodings to enforce geometry rather than coordinate overlap, learnable domain-specific anchors to preserve domain peculiarities, and a volume-based regularization to prevent anchor collapse. Extensive experiments on DomainNet, GeoImnet, GeoPlaces, and Ego2Exo demonstrate substantial gains over state-of-the-art baselines, with the approach achieving strong performance while remaining significantly more compact than large multilingual-language models. The work highlights the practical value of language-guided, structure-aware alignment for robust cross-domain generalization in vision tasks and lays groundwork for broader multi-modal adaptation.

Abstract

Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. Our method defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. We empirically demonstrate our method's superiority in domain adaptation tasks across four diverse image and video datasets. Remarkably, we surpass previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.

Paper Structure

This paper contains 21 sections, 14 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Left: Existing UDA approaches align source and target spaces in absolute coordinates, potentially overlooking domain-specific characteristics and resulting in partial alignment. Right: LAGUNA aligns spaces in relative terms, preserving distinct absolute coordinates (e.g., circles in source and target) while matching angles $\theta_i^s$, $\theta_i^t$ between data points to a reference structure $\theta_i^l$ ($\theta_i^s$$\sim \theta_i^l \sim$$\theta_i^t$), encouraging similar geometric-semantic relations.
  • Figure 2: The 3-stage architecture. (1) We define domain-agnostic semantic anchors $\mathcal{A}$. (2) A language model RGB]228,255,232$\mathcal{G}$ generates pseudo-labels for target data, trained with structural loss $\mathcal{L}_S$ and cross-entropy loss $\mathcal{L}_{CE}$. (3) A visual classifier is trained using an encoder RGB]255,226,149$\mathcal{V}$ to extract features RGB]237,190,162$g_i^s$ and RGB]223,159,115$g_i^t$. We align visual-anchor similarities with text-anchor similarities using $\mathcal{L}_S$, learnable anchors (RGB]255,243,183$\mathcal{A}_s$, RGB]212,233,254$\mathcal{A}_t$), and textual representations (RGB]206,255,144$z_i^t$, RGB]166,207,220$^*A[y_i^S]$). A Cross-Domain Attention layer grounds visual features using $\mathcal{A}_s$, and an MLP classifier is trained with $\mathcal{L}_{CE}$ and regularized by $\mathcal{L}_{Reg}$.
  • Figure 3: In a), similarity maps of $100$ randomly selected classes from GeoImnet (yellow for high similarity) and average accuracies. In b), t-SNE plots for 1000 randomly selected Source and Target samples from GeoImnet, with respective MMD scores in Relative (right) and absolute (left) spaces.
  • Figure 4: LAGUNA's accuracy with different quantities of pseudo-labeled data (target samples with captions).