SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator
Yuhta Takida, Satoshi Hayakawa, Takashi Shibuya, Masaaki Imaizumi, Naoki Murata, Bac Nguyen, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuki Mitsufuji
TL;DR
SONA tackles the core challenge of conditional generation in GANs by decoupling authenticity from conditional alignment through a discriminator with separate naturalness and alignment projections. It introduces three synergistic components: unconditional discrimination via a sliced-Wasserstein–based SAN objective, matching-aware discrimination using Bradley–Terry–style mismatched samples, and an adaptive weighting scheme that balances these goals during training. Theoretical results connect SAN to a meaningful distance between data and generator distributions, while BT-based losses yield conditional and mismatching guidance, culminating in a robust overall objective. Empirically, SONA surpasses state-of-the-art discriminators on class-conditional benchmarks and shows strong performance in text-to-image tasks, demonstrating versatility and practical impact for high-fidelity, well-aligned conditional generation.
Abstract
Deep generative models have made significant advances in generating complex content, yet conditional generation remains a fundamental challenge. Existing conditional generative adversarial networks often struggle to balance the dual objectives of assessing authenticity and conditional alignment of input samples within their conditional discriminators. To address this, we propose a novel discriminator design that integrates three key capabilities: unconditional discrimination, matching-aware supervision to enhance alignment sensitivity, and adaptive weighting to dynamically balance all objectives. Specifically, we introduce Sum of Naturalness and Alignment (SONA), which employs separate projections for naturalness (authenticity) and alignment in the final layer with an inductive bias, supported by dedicated objective functions and an adaptive weighting mechanism. Extensive experiments on class-conditional generation tasks show that \ours achieves superior sample quality and conditional alignment compared to state-of-the-art methods. Furthermore, we demonstrate its effectiveness in text-to-image generation, confirming the versatility and robustness of our approach.
