Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

Siqi Zeng; Sixian Du; Makoto Yamada; Han Zhao

Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

Siqi Zeng, Sixian Du, Makoto Yamada, Han Zhao

TL;DR

The paper tackles embedding hierarchical relationships among classes into feature representations for improved generalization, addressing limitations of Euclidean centroid-based CPCC when class-conditional distributions are multi-modal. It generalizes CPCC to an optimal-transport framework (OT-CPCC) by replacing Euclidean distances with Earth Mover's Distance (EMD) between class-conditioned distributions, and introduces a fast linear-time variant, Fast FlowTree (FastFT), that leverages an augmented label tree to reduce OT computation to a 1D greedy flow problem. The authors provide a differentiability analysis (via Danskin's theorem) for OT-CPCC and compare several OT-CPCC methods, showing FastFT achieves strong hierarchical preservation and competitive downstream performance across diverse datasets, with substantial gains in interpretability. Overall, OT-CPCC yields scalable, distribution-aware hierarchical representations that better capture multi-modal class structures, enabling improved fine-level classification and robust hierarchical retrieval in practical settings. The work advances the intersection of hierarchical representation learning and optimal transport, offering a practical, scalable approach with open-source code.

Abstract

To embed structured knowledge within labels into feature representations, prior work [Zeng et al., 2022] proposed to use the Cophenetic Correlation Coefficient (CPCC) as a regularizer during supervised learning. This regularizer calculates pairwise Euclidean distances of class means and aligns them with the corresponding shortest path distances derived from the label hierarchy tree. However, class means may not be good representatives of the class conditional distributions, especially when they are multi-mode in nature. To address this limitation, under the CPCC framework, we propose to use the Earth Mover's Distance (EMD) to measure the pairwise distances among classes in the feature space. We show that our exact EMD method generalizes previous work, and recovers the existing algorithm when class-conditional distributions are Gaussian. To further improve the computational efficiency of our method, we introduce the Optimal Transport-CPCC family by exploring four EMD approximation variants. Our most efficient OT-CPCC variant, the proposed Fast FlowTree algorithm, runs in linear time in the size of the dataset, while maintaining competitive performance across datasets and tasks. The code is available at https://github.com/uiuctml/OTCPCC.

Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

TL;DR

Abstract

Paper Structure (54 sections, 6 theorems, 18 equations, 14 figures, 22 tables, 3 algorithms)

This paper contains 54 sections, 6 theorems, 18 equations, 14 figures, 22 tables, 3 algorithms.

Introduction
Preliminaries
Notation and Problem Setup
Cophenetic Correlation Coefficient
Exact Earth Mover's Distance
CPCC with Optimal Transport
OT-CPCC Family
Fast FlowTree
FlowTree
Our Fast FlowTree
Key Insight
Comparison of OT-methods on Synthetic Dataset
Differentiability Analysis of OT-CPCC
Differentiability for Other OT-CPCC Methods
Gradient of $\ell_2$-CPCC
...and 39 more sections

Key Result

Proposition 3.0

$\mathrm{EMD}({\mathcal{N}}(\mu_z, \Sigma), {\mathcal{N}}(\mu_{z'}, \Sigma)) = \|\mu_z - \mu_{z'}\|$.

Figures (14)

Figure 1: Comparison between EMD (weighted sum of red lines) and the class mean $\ell_2$ distance (blue line).
Figure 1: Time complexity comparison of different CPCC methods. Let $I$ be the number of iterations for iterative methods, $p$ be the number of projections for SWD, $\Phi$ be the maximum Euclidean distance of any two feature vectors. We use $\tilde{\cdot}$ to represent a factor of $k^2$. For simplicity we assume $m=n$. In the batch learning setting, $d$ can be much larger than $n$ particularly for the fine-grained classification problems. See App.\ref{['app:time-complexity']} for detailed analysis.
Figure 2: An example of augmented ${\mathcal{T}}$. Leaves become samples of each class, and each sample is assigned some weight (ex., uniform) within a class. Whenever we call FastFT-CPCC, only a subtree rooted at the lowest common ancestor of a pair of class label will be used. For example, we use subtree rooted at $F$ to compute optimal flow of samples with label $A,B$, and subtree rooted at $H$ for samples with label $A,C$.
Figure 3: Efficiency and approximation error comparison of OT methods on synthetic datasets.
Figure 4: Hierarchical data split. Source and target dataset share the same coarse labels but different fine labels.
...and 9 more figures

Theorems & Definitions (9)

Proposition 3.0: EMD reduces to $\ell_2$ between means of Gaussian with same covariance
Theorem 3.1: Correctness of Fast FlowTree
Theorem 3.2: 210b8709-0258-37ec-92d7-002e2b673206
Lemma 3.3
proof
Proposition C.0: EMD reduces to $\ell_2$ between means of Gaussian with same covariance
proof
Theorem D.1: Correctness of Fast FlowTree
proof

Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

TL;DR

Abstract

Learning Structured Representations by Embedding Class Hierarchy with Fast Optimal Transport

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (9)