Semantically Guided Action Anticipation
Anxhelo Diko, Antonino Furnari, Luigi Cinque, Giovanni Maria Farinella
TL;DR
This paper tackles unsupervised domain adaptation by shifting from absolute alignment to preserving semantic-geometric relationships across domains using a language-derived reference structure. It presents LAGUNA, a three-stage framework that (1) constructs a language-based reference anchors space, (2) trains a language supervisor to map captions to this reference and generate pseudo-labels, and (3) trains a cross-domain visual classifier whose representations are constrained to follow the reference structure via relative encodings and a cross-domain attention mechanism. Key contributions include the use of relative encodings to enforce geometry rather than coordinate overlap, learnable domain-specific anchors to preserve domain peculiarities, and a volume-based regularization to prevent anchor collapse. Extensive experiments on DomainNet, GeoImnet, GeoPlaces, and Ego2Exo demonstrate substantial gains over state-of-the-art baselines, with the approach achieving strong performance while remaining significantly more compact than large multilingual-language models. The work highlights the practical value of language-guided, structure-aware alignment for robust cross-domain generalization in vision tasks and lays groundwork for broader multi-modal adaptation.
Abstract
Unsupervised domain adaptation remains a critical challenge in enabling the knowledge transfer of models across unseen domains. Existing methods struggle to balance the need for domain-invariant representations with preserving domain-specific features, which is often due to alignment approaches that impose the projection of samples with similar semantics close in the latent space despite their drastic domain differences. We introduce a novel approach that shifts the focus from aligning representations in absolute coordinates to aligning the relative positioning of equivalent concepts in latent spaces. Our method defines a domain-agnostic structure upon the semantic/geometric relationships between class labels in language space and guides adaptation, ensuring that the organization of samples in visual space reflects reference inter-class relationships while preserving domain-specific characteristics. We empirically demonstrate our method's superiority in domain adaptation tasks across four diverse image and video datasets. Remarkably, we surpass previous works in 18 different adaptation scenarios across four diverse image and video datasets with average accuracy improvements of +3.32% on DomainNet, +5.75% in GeoPlaces, +4.77% on GeoImnet, and +1.94% mean class accuracy improvement on EgoExo4D.
