Table of Contents
Fetching ...

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

Wenxun Dai, Zhiyuan Zhao, Yule Zhong, Yiji Cheng, Jianwei Zhang, Linqing Wang, Shiyi Zhang, Yunlong Lin, Runze He, Fellix Song, Wayne Zhuang, Yong Liu, Haoji Zhang, Yansong Tang, Qinglin Lu, Chunyu Wang

TL;DR

ChatUMM addresses the need for persistent, context-aware dialogue in open-source unified multimodal models by introducing an interleaved multi-turn training paradigm and a systematic data-synthesis pipeline. It models serialized text-image streams with a decoder-only Mixture-of-Transformers and uses Generalized Causal Attention to condition on full dialogue history, optimizing with $L_{CE}$ for understanding and $L_{MSE}$ for image generation. The data pipeline converts single-turn datasets into fluid, stateful dialogues via three stages and a four-dimensional data taxonomy, backed by LLM-powered atomic operations. Empirically, ChatUMM achieves state-of-the-art performance among open-source UMMs on visual understanding and instruction-guided editing benchmarks, maintains strong image-generation fidelity, and demonstrates robust, long-horizon conversational dynamics, offering a strong foundation for end-to-end multimodal dialogue systems. Future work includes scaling to close the gap with agent-based systems and developing a unified tokenizer to reduce computational costs while preserving visual detail.$

Abstract

Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor'' turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.

ChatUMM: Robust Context Tracking for Conversational Interleaved Generation

TL;DR

ChatUMM addresses the need for persistent, context-aware dialogue in open-source unified multimodal models by introducing an interleaved multi-turn training paradigm and a systematic data-synthesis pipeline. It models serialized text-image streams with a decoder-only Mixture-of-Transformers and uses Generalized Causal Attention to condition on full dialogue history, optimizing with for understanding and for image generation. The data pipeline converts single-turn datasets into fluid, stateful dialogues via three stages and a four-dimensional data taxonomy, backed by LLM-powered atomic operations. Empirically, ChatUMM achieves state-of-the-art performance among open-source UMMs on visual understanding and instruction-guided editing benchmarks, maintains strong image-generation fidelity, and demonstrates robust, long-horizon conversational dynamics, offering a strong foundation for end-to-end multimodal dialogue systems. Future work includes scaling to close the gap with agent-based systems and developing a unified tokenizer to reduce computational costs while preserving visual detail.$

Abstract

Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm, effectively functioning as solvers for independent requests rather than assistants in continuous dialogue. To bridge this gap, we present ChatUMM. As a conversational unified model, it excels at robust context tracking to sustain interleaved multimodal generation. ChatUMM derives its capabilities from two key innovations: an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow, and a systematic conversational data synthesis pipeline. This pipeline transforms a diverse set of standard single-turn datasets into fluid dialogues through three progressive stages: constructing basic stateful dialogues, enforcing long-range dependency resolution via ``distractor'' turns with history-dependent query rewriting, and synthesizing naturally interleaved multimodal responses. Extensive evaluations demonstrate that ChatUMM achieves state-of-the-art performance among open-source unified models on visual understanding and instruction-guided editing benchmarks, while maintaining competitive fidelity in text-to-image generation. Notably, ChatUMM exhibits superior robustness in complex multi-turn scenarios, ensuring fluid, context-aware dialogues.
Paper Structure (15 sections, 10 figures, 4 tables)

This paper contains 15 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Examples of ChatUMM demonstrating diverse conversational capabilities.
  • Figure 2: Overview of ChatUMM (Interleaved Multi-turn Training). The model processes a serialized stream of interleaved text and image tokens using special tokens as structural delimiters. |im_s| and |im_e| encapsulate textual segments, while |v_s| and |v_e| enclose visual content. Crucially, specific token transitions serve as explicit intent signals: predicting |v_s| immediately following |im_e| triggers image generation, whereas predicting |im_s| after |v_e| instructs the model to initiate text generation. The |end| token marks the completion of a turn, and |NTP| represents the text tokens generated via next token prediction. For visual generation, the image of the current turn (e.g., Turn-1 and Turn-2) is processed as noised VAE latents (striped blue), supervised by the flow matching loss $\mathcal{L}_{\text{MSE}}$. To retrieve context from the dialogue history (e.g., Turn-1 referenced during Turn-2), the model attends to historical text tokens, clean VAE latents (solid blue), and ViT tokens (yellow). Concurrently, the standard cross-entropy loss ($\mathcal{L}_{\text{CE}}$) is applied to |NTP| and special tokens to supervise text generation and intent prediction.
  • Figure 3: Overview of our data synthesis pipeline. We transform standard single-turn datasets into fluid, stateful dialogues through three progressive stages. (a) Basic multi-turn construction: We leverage atomic LLM-powered operations to convert single-turn samples (e.g., text-to-image, image editing) into basic stateful dialogues. (b) Independent single-turn insertion: To enforce long-range dependency resolution, unrelated "distractor" turns are inserted into the flow. The subsequent query is rewritten to be explicitly history-dependent (<query-dep>), teaching the model to maintain robust context tracking across a noisy history. (c) Interleaved output generation: We evolve the output modality to sustain interleaved multimodal generation. An LLM (e.g., Gemini 2.5 Pro) generates a relevant Q&A pair based on the image of the final turn, transforming the interaction into a continuous text-image stream (i.e., User: <query-final><Q>, Assistant: <image><A>).
  • Figure 4: Visualization of selected atomic operations.Top:caption2query converts image captions into user queries. Bottom:query2dep_q transforms user queries into specific, explicit instructions that clearly identify the target image or subject.
  • Figure 5: Qualitative examples of multimodal understanding capabilities. ChatUMM demonstrates versatility in diverse tasks, including chart interpretation and reasoning, celebrity recognition with knowledge retrieval, explanation of visual humor, and detailed image captioning.
  • ...and 5 more figures