Table of Contents
Fetching ...

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

Zinan Lin, Enshu Liu, Xuefei Ning, Junyi Zhu, Wenyu Wang, Sergey Yekhanin

TL;DR

Latent Zoning Network (LZN) introduces a unified latent space shared across data types, partitioned into disjoint latent zones for each sample and governed by a Gaussian prior $\mathcal{N}(0,\mathbf{I})$. Two atomic operations, latent computation via flow matching and latent alignment across data types, enable a single framework to perform generative modeling, representation learning, and classification through various encoder–decoder configurations. Empirical results show that LZN can enhance an existing generative model (RF) for improved CIFAR10 generation, achieve competitive or superior performance on unsupervised ImageNet representations, and realize joint generation and classification with CIFAR10 results approaching SoTA standards; the work also provides extensive ablations and efficiency optimizations. The proposed approach suggests a scalable path toward multi-task synergy by sharing a principled latent space and demonstrates practical utility with open-source code and demonstrations.

Abstract

Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.

Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification

TL;DR

Latent Zoning Network (LZN) introduces a unified latent space shared across data types, partitioned into disjoint latent zones for each sample and governed by a Gaussian prior . Two atomic operations, latent computation via flow matching and latent alignment across data types, enable a single framework to perform generative modeling, representation learning, and classification through various encoder–decoder configurations. Empirical results show that LZN can enhance an existing generative model (RF) for improved CIFAR10 generation, achieve competitive or superior performance on unsupervised ImageNet representations, and realize joint generation and classification with CIFAR10 results approaching SoTA standards; the work also provides extensive ablations and efficiency optimizations. The proposed approach suggests a scalable path toward multi-task synergy by sharing a principled latent space and demonstrates practical utility with open-source code and demonstrations.

Abstract

Generative modeling, representation learning, and classification are three core problems in machine learning (ML), yet their state-of-the-art (SoTA) solutions remain largely disjoint. In this paper, we ask: Can a unified principle address all three? Such unification could simplify ML pipelines and foster greater synergy across tasks. We introduce Latent Zoning Network (LZN) as a step toward this goal. At its core, LZN creates a shared Gaussian latent space that encodes information across all tasks. Each data type (e.g., images, text, labels) is equipped with an encoder that maps samples to disjoint latent zones, and a decoder that maps latents back to data. ML tasks are expressed as compositions of these encoders and decoders: for example, label-conditional image generation uses a label encoder and image decoder; image embedding uses an image encoder; classification uses an image encoder and label decoder. We demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN can enhance existing models (image generation): When combined with the SoTA Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without modifying the training objective. (2) LZN can solve tasks independently (representation learning): LZN can implement unsupervised representation learning without auxiliary loss functions, outperforming the seminal MoCo and SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear classification on ImageNet. (3) LZN can solve multiple tasks simultaneously (joint generation and classification): With image and label encoders/decoders, LZN performs both tasks jointly by design, improving FID and achieving SoTA classification accuracy on CIFAR10. The code and trained models are available at https://github.com/microsoft/latent-zoning-networks. The project website is at https://zinanlin.me/blogs/latent_zoning_networks.html.

Paper Structure

This paper contains 37 sections, 1 theorem, 8 equations, 26 figures, 8 tables, 9 algorithms.

Key Result

Theorem 1

Assume that there are $n=2$ samples $x_1,x_2$, with their anchor points $a_1=-1,a_2=1$. We have where $\Phi$ is the CDF function of the standard Gaussian distribution.

Figures (26)

  • Figure 1: Latent Zoning Network (LZN) connects multiple encoders and decoders through a shared latent space, enabling a wide range of ML tasks via different encoder-decoder combinations or standalone encoders/decoders. The figure illustrates eight example tasks, but more could be supported. Only tasks 1-4 are evaluated in this paper, while the rest are for illustration.
  • Figure 2: The latent space of LZN has two key properties: (1) Generative: It follows a simple Gaussian prior, allowing easy sampling for generation tasks. (2) Unified: It serves as a shared representation across all data types (e.g., image, text, label). Each data type induces a distinct partitioning of the latent space into latent zones, where each zone corresponds to a specific sample (e.g., an individual image or label). The latent space is shown as a closed circle for illustration, but it is unbounded in practice.
  • Figure 2: Classification accuracy on ImageNet by training a linear classifier on the unsupervised representations. Methods marked with § are based on contrastive learning. The horizontal line separates baselines that perform worse or better than our LZN. All methods use the ResNet-50 architecture he2016deep for fair comparison.
  • Figure 3: Training and inference in LZN rely on two atomic operations: (1) Latent computation (\ref{['sec:lzn_latent_computation']}): Computes latent zones for a data type by encoding samples into anchor points and using flow matching (FM) liu2022flowlipman2022flow to partition the latent space. Conversely, any latent point can be mapped to a sample via the decoder (not shown). (2) Latent alignment (\ref{['sec:lzn_latent_alignment']}): Aligns latent zones across data types by matching their FM processes. This figure also illustrates the approach for LZN in joint conditional generative modeling and classification (\ref{['sec:gen_and_class']}).
  • Figure 4: LZN for unconditional generative modeling (\ref{['sec:gen']}). During training, the LZN latent of each target image is fed as an extra condition to the rectified flow (RF) model liu2022flow, making the RF learn conditional flows based on LZN latents. The objective remains the standard RF loss, and the LZN encoder is trained end-to-end within it. During generation, we sample LZN latents from a standard Gaussian and use them as the extra condition. We illustrate the approach with RF, but since LZN latents require no supervision and are differentiable, the method could apply to other tasks by adding a condition input for LZN latents to the task network.
  • ...and 21 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof