Table of Contents
Fetching ...

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang

TL;DR

UniGame tackles the inconsistency between understanding and generation in Unified Multimodal Models by introducing a self-adversarial post-training scheme. It adds a decoder-constrained perturber at the shared visual-token interface to create realistic, on-manifold adversarial samples and trains via a minimax objective between the understanding branch and the perturbing generator, augmented by a hard-sample buffer. The approach yields consistent gains in cross-modal coherence, improves robustness to distributional shifts and adversarial inputs, and remains architecture-agnostic with minimal parameter overhead. This adversarial self-play framework offers a general, efficient path to strengthening the coherence and unified competence of future multimodal foundation models.

Abstract

Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame

UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

TL;DR

UniGame tackles the inconsistency between understanding and generation in Unified Multimodal Models by introducing a self-adversarial post-training scheme. It adds a decoder-constrained perturber at the shared visual-token interface to create realistic, on-manifold adversarial samples and trains via a minimax objective between the understanding branch and the perturbing generator, augmented by a hard-sample buffer. The approach yields consistent gains in cross-modal coherence, improves robustness to distributional shifts and adversarial inputs, and remains architecture-agnostic with minimal parameter overhead. This adversarial self-play framework offers a general, efficient path to strengthening the coherence and unified competence of future multimodal foundation models.

Abstract

Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame

Paper Structure

This paper contains 50 sections, 17 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Qualitative and quantitative analyses of UniGame. (a) The performance vs. consistency score of several models, indicating significant improvement of both metrics of our models. (b) The manifold produced by SFT, reconstruction-based Post Train, and UniGame. UniGame expands the training distribution toward hard yet realistic neighborhoods.
  • Figure 2: Illustration of four different post-training paradigms.
  • Figure 3: Overview of UniGame. This adversarial self-play improves understanding robustness and understanding-generation consistency. The perturber $C$ is a lightweight (3-layer MLP) module and the hard buffer $\mathcal{B}$ is a filtering mechanism.
  • Figure 4: (a) Robustness evaluation. (b) We observe that over 5K of training steps, the hard-sample loss persistently dominates that of Clean/Adversarial, suggesting UniGame continuously generates samples that are most challenging for the current model state.
  • Figure 5: Qualitative case studies of UniGame understanding and generation.
  • ...and 7 more figures