Table of Contents
Fetching ...

Learning Visual-Semantic Subspace Representations

Gabriel Moreira, Manuel Marques, João Paulo Costeira, Alexander Hauptmann

TL;DR

This paper addresses learning image representations that respect semantic partial orders and enable logical reasoning by introducing a nuclear norm-based loss grounded in information-theoretic principles.The core idea is a joint low-rank formulation where $Z=YX$, with a loss $l(X)=ig\|Zigigig|_*-oldsymbol{ extalpha}igigig|_*+etaigig ext|Xigig|_2^2$, which yields a spectral embedding of the label Gram matrix $Y^ op Y$ and prevents representation collapse.The learned representations form a Boolean subspace lattice, enabling propositional queries via projection operators and supporting multi-label classification and complex retrieval tasks with logical queries.Empirical results on standard benchmarks and CelebA demonstrate competitive classification performance and effective retrieval with negations, while theoretical results guarantee orthogonalization of minterms and spectral geometry aligned with semantics.Overall, the work provides a principled, interpretable, and modality-agnostic framework for visual-semantic representation learning with strong connections to symbolic reasoning.

Abstract

Learning image representations that capture rich semantic relationships remains a significant challenge. Existing approaches are either contrastive, lacking robust theoretical guarantees, or struggle to effectively represent the partial orders inherent to structured visual-semantic data. In this paper, we introduce a nuclear norm-based loss function, grounded in the same information theoretic principles that have proved effective in self-supervised learning. We present a theoretical characterization of this loss, demonstrating that, in addition to promoting class orthogonality, it encodes the spectral geometry of the data within a subspace lattice. This geometric representation allows us to associate logical propositions with subspaces, ensuring that our learned representations adhere to a predefined symbolic structure.

Learning Visual-Semantic Subspace Representations

TL;DR

This paper addresses learning image representations that respect semantic partial orders and enable logical reasoning by introducing a nuclear norm-based loss grounded in information-theoretic principles.The core idea is a joint low-rank formulation where $Z=YX$, with a loss $l(X)=ig\|Zigigig|_*-oldsymbol{ extalpha}igigig|_*+etaigig ext|Xigig|_2^2$, which yields a spectral embedding of the label Gram matrix $Y^ op Y$ and prevents representation collapse.The learned representations form a Boolean subspace lattice, enabling propositional queries via projection operators and supporting multi-label classification and complex retrieval tasks with logical queries.Empirical results on standard benchmarks and CelebA demonstrate competitive classification performance and effective retrieval with negations, while theoretical results guarantee orthogonalization of minterms and spectral geometry aligned with semantics.Overall, the work provides a principled, interpretable, and modality-agnostic framework for visual-semantic representation learning with strong connections to symbolic reasoning.

Abstract

Learning image representations that capture rich semantic relationships remains a significant challenge. Existing approaches are either contrastive, lacking robust theoretical guarantees, or struggle to effectively represent the partial orders inherent to structured visual-semantic data. In this paper, we introduce a nuclear norm-based loss function, grounded in the same information theoretic principles that have proved effective in self-supervised learning. We present a theoretical characterization of this loss, demonstrating that, in addition to promoting class orthogonality, it encodes the spectral geometry of the data within a subspace lattice. This geometric representation allows us to associate logical propositions with subspaces, ensuring that our learned representations adhere to a predefined symbolic structure.
Paper Structure (38 sections, 12 theorems, 58 equations, 6 figures, 6 tables)

This paper contains 38 sections, 12 theorems, 58 equations, 6 figures, 6 tables.

Key Result

Lemma 3.1

Let $\mathbf{Y}\in\mathbb{R}^{c\times n}$ and $\mathbf{X}\in\mathbb{R}^{d\times n}$. For any $\mathbf{U}_1 \in O(c), \mathbf{U}_2\in O(d)$ and $\mathbf{V}\in O(n)$

Figures (6)

  • Figure 1: Subspace Boolean lattice. Each axis encodes a minterm of 2 literals: $\mathbf{p}\land\mathbf{q}$, $\neg\mathbf{p}\land\mathbf{q}$ and $\mathbf{p}\land\neg\mathbf{q}$. The propositions $\mathbf{p}$ and $\mathbf{q}$ are represented by two orthogonal 2-d subspaces. The squared norm of the projection of $\mathbf{x}$, with $\|\mathbf{x}\|=1$, over each subspace yields the posterior probability of the corresponding proposition.
  • Figure 2: Gram matrix of $\mathbf{Y} \in \{0,1\}^{6\times 15}$ and of the representations optimized with OLÉ, MMCR and our loss.
  • Figure 3: Top: Inner products between the principal direction of each class. Bottom: Inner products between the unit $\ell_2$-norm test set embeddings.
  • Figure 4: Ours vs CLIP's top-20 retrieved images from the test set of Celeb-A using propositional and the corresponding natural language queries.
  • Figure 5: Gram matrix of $\mathbf{Y} \in \{0,1\}^{6\times 15}$ and of the representations optimized with OLE, MCMR and our loss, for three synthetic experiments
  • ...and 1 more figures

Theorems & Definitions (26)

  • Lemma 3.1: Symmetry
  • Theorem 3.2
  • Theorem 3.3
  • Lemma 3.4: No $\ell_2$-penalty
  • Lemma 3.5
  • Corollary 3.6: Orthogonal disjoint classes
  • Lemma 4.1: Minterm orthogonality
  • Lemma 4.2: Propositions as projections
  • Definition A.1: Spectral Norm
  • Definition A.2: Nuclear Norm
  • ...and 16 more