Table of Contents
Fetching ...

Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

Siqi Zeng, Sixian Du, Makoto Yamada, Han Zhao

TL;DR

The paper tackles embedding hierarchical relationships among classes into feature representations for improved generalization, addressing limitations of Euclidean centroid-based CPCC when class-conditional distributions are multi-modal. It generalizes CPCC to an optimal-transport framework (OT-CPCC) by replacing Euclidean distances with Earth Mover's Distance (EMD) between class-conditioned distributions, and introduces a fast linear-time variant, Fast FlowTree (FastFT), that leverages an augmented label tree to reduce OT computation to a 1D greedy flow problem. The authors provide a differentiability analysis (via Danskin's theorem) for OT-CPCC and compare several OT-CPCC methods, showing FastFT achieves strong hierarchical preservation and competitive downstream performance across diverse datasets, with substantial gains in interpretability. Overall, OT-CPCC yields scalable, distribution-aware hierarchical representations that better capture multi-modal class structures, enabling improved fine-level classification and robust hierarchical retrieval in practical settings. The work advances the intersection of hierarchical representation learning and optimal transport, offering a practical, scalable approach with open-source code.

Abstract

To embed structured knowledge within labels into feature representations, prior work [Zeng et al., 2022] proposed to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer during supervised learning. This regularizer calculates pairwise Euclidean distances of class means and aligns them with the corresponding shortest path distances derived from the label hierarchy tree. However, class means may not be good representatives of the class conditional distributions, especially when they are multi-mode in nature. To address this limitation, under the CPCC framework, we propose to use the Earth Mover's Distance (EMD) to measure the pairwise distances among classes in the feature space. We show that our exact EMD method generalizes previous work, and recovers the existing algorithm when class-conditional distributions are Gaussian. To further improve the computational efficiency of our method, we introduce the Optimal Transport-CPCC family by exploring four EMD approximation variants. Our most efficient OT-CPCC variant, the proposed Fast FlowTree algorithm, runs in linear time in the size of the dataset, while maintaining competitive performance across datasets and tasks. The code is available at https://github.com/uiuctml/OTCPCC.

Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

TL;DR

The paper tackles embedding hierarchical relationships among classes into feature representations for improved generalization, addressing limitations of Euclidean centroid-based CPCC when class-conditional distributions are multi-modal. It generalizes CPCC to an optimal-transport framework (OT-CPCC) by replacing Euclidean distances with Earth Mover's Distance (EMD) between class-conditioned distributions, and introduces a fast linear-time variant, Fast FlowTree (FastFT), that leverages an augmented label tree to reduce OT computation to a 1D greedy flow problem. The authors provide a differentiability analysis (via Danskin's theorem) for OT-CPCC and compare several OT-CPCC methods, showing FastFT achieves strong hierarchical preservation and competitive downstream performance across diverse datasets, with substantial gains in interpretability. Overall, OT-CPCC yields scalable, distribution-aware hierarchical representations that better capture multi-modal class structures, enabling improved fine-level classification and robust hierarchical retrieval in practical settings. The work advances the intersection of hierarchical representation learning and optimal transport, offering a practical, scalable approach with open-source code.

Abstract

To embed structured knowledge within labels into feature representations, prior work [Zeng et al., 2022] proposed to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer during supervised learning. This regularizer calculates pairwise Euclidean distances of class means and aligns them with the corresponding shortest path distances derived from the label hierarchy tree. However, class means may not be good representatives of the class conditional distributions, especially when they are multi-mode in nature. To address this limitation, under the CPCC framework, we propose to use the Earth Mover's Distance (EMD) to measure the pairwise distances among classes in the feature space. We show that our exact EMD method generalizes previous work, and recovers the existing algorithm when class-conditional distributions are Gaussian. To further improve the computational efficiency of our method, we introduce the Optimal Transport-CPCC family by exploring four EMD approximation variants. Our most efficient OT-CPCC variant, the proposed Fast FlowTree algorithm, runs in linear time in the size of the dataset, while maintaining competitive performance across datasets and tasks. The code is available at https://github.com/uiuctml/OTCPCC.
Paper Structure (54 sections, 6 theorems, 18 equations, 14 figures, 22 tables, 3 algorithms)

This paper contains 54 sections, 6 theorems, 18 equations, 14 figures, 22 tables, 3 algorithms.

Key Result

Proposition 3.0

$\mathrm{EMD}({\mathcal{N}}(\mu_z, \Sigma), {\mathcal{N}}(\mu_{z'}, \Sigma)) = \|\mu_z - \mu_{z'}\|$.

Figures (14)

  • Figure 1: Comparison between EMD (weighted sum of red lines) and the class mean $\ell_2$ distance (blue line).
  • Figure 1: Time complexity comparison of different CPCC methods. Let $I$ be the number of iterations for iterative methods, $p$ be the number of projections for SWD, $\Phi$ be the maximum Euclidean distance of any two feature vectors. We use $\tilde{\cdot}$ to represent a factor of $k^2$. For simplicity we assume $m=n$. In the batch learning setting, $d$ can be much larger than $n$ particularly for the fine-grained classification problems. See App.\ref{['app:time-complexity']} for detailed analysis.
  • Figure 2: An example of augmented ${\mathcal{T}}$. Leaves become samples of each class, and each sample is assigned some weight (ex., uniform) within a class. Whenever we call FastFT-CPCC, only a subtree rooted at the lowest common ancestor of a pair of class label will be used. For example, we use subtree rooted at $F$ to compute optimal flow of samples with label $A,B$, and subtree rooted at $H$ for samples with label $A,C$.
  • Figure 3: Efficiency and approximation error comparison of OT methods on synthetic datasets.
  • Figure 4: Hierarchical data split. Source and target dataset share the same coarse labels but different fine labels.
  • ...and 9 more figures

Theorems & Definitions (9)

  • Proposition 3.0: EMD reduces to $\ell_2$ between means of Gaussian with same covariance
  • Theorem 3.1: Correctness of Fast FlowTree
  • Theorem 3.2: 210b8709-0258-37ec-92d7-002e2b673206
  • Lemma 3.3
  • proof
  • Proposition C.0: EMD reduces to $\ell_2$ between means of Gaussian with same covariance
  • proof
  • Theorem D.1: Correctness of Fast FlowTree
  • proof