Cosine-Similarity Methods for Efficient Training and Sampling in High-Dimensional Latent Spaces
Xu Duan, Dongmei Chen
TL;DR
This work tackles the inefficiencies of training and sampling in high-dimensional latent spaces by exploiting semantic geometry through a cosine-similarity framework. It introduces adaptive cosine-based sampling, cosine-guided fine-tuning, and a cosine-cost optimal-transport coupling to align training targets with sampling trajectories, all without changing model architectures. The proposed methods yield strong empirical gains, including substantial FID improvements with modest computation (e.g., 800-epoch RAE: 11.99 → 8.60; single-epoch fine-tuning at 20 epochs achieving 3.30, matching an 80-epoch baseline). Overall, the paper demonstrates that semantic directional alignment can significantly accelerate diffusion model training and improve sample fidelity in latent spaces.
Abstract
Latent generative models are increasingly shifting from traditional VAEs toward representation autoencoders and semantically aligned latent spaces, which lift images into higher-dimensional feature domains where semantic factors become more separable. Yet these spaces also contain geometric regularities that existing methods do not fully exploit--particularly in the directional relationships between features. We introduce a cosine-similarity-based mechanism that improves both training and sampling by selecting couplings that produce cleaner, less entangled velocity fields. This simple alignment reduces gradient noise, accelerates convergence, and improves sample fidelity. Building on this idea, we develop cosine-similarity-based fine-tuning and time-scheduling strategies that reduce the FID of an 800-epoch RAE from 11.99 to 8.60. Furthermore, by formulating an optimal-transport coupling using a cosine cost, a single-epoch fine-tuning step at the 20-epoch checkpoint reaches 3.30 FID-matching the performance of the 80-epoch baseline.
