UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang
TL;DR
UniGame tackles the inconsistency between understanding and generation in Unified Multimodal Models by introducing a self-adversarial post-training scheme. It adds a decoder-constrained perturber at the shared visual-token interface to create realistic, on-manifold adversarial samples and trains via a minimax objective between the understanding branch and the perturbing generator, augmented by a hard-sample buffer. The approach yields consistent gains in cross-modal coherence, improves robustness to distributional shifts and adversarial inputs, and remains architecture-agnostic with minimal parameter overhead. This adversarial self-play framework offers a general, efficient path to strengthening the coherence and unified competence of future multimodal foundation models.
Abstract
Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame
