Table of Contents
Fetching ...

OMG-Agent: Toward Robust Missing Modality Generation with Decoupled Coarse-to-Fine Agentic Workflows

Ruiting Dai, Zheyu Wang, Haoyu Yang, Yihan Liu, Chengzhi Wang, Zekun Zhang, Zishan Huang, Jiaman Cen, Lisi Mo

TL;DR

OMG-Agent is presented, a novel framework that shifts the paradigm from static mapping to a dynamic coarse-to-fine Agentic Workflow and consistently surpasses state-of-the-art methods, maintaining robustness under extreme missingness.

Abstract

Data incompleteness severely impedes the reliability of multimodal systems. Existing reconstruction methods face distinct bottlenecks: conventional parametric/generative models are prone to hallucinations due to over-reliance on internal memory, while retrieval-augmented frameworks struggle with retrieval rigidity. Critically, these end-to-end architectures are fundamentally constrained by Semantic-Detail Entanglement -- a structural conflict between logical reasoning and signal synthesis that compromises fidelity. In this paper, we present \textbf{\underline{O}}mni-\textbf{\underline{M}}odality \textbf{\underline{G}}eneration Agent (\textbf{OMG-Agent}), a novel framework that shifts the paradigm from static mapping to a dynamic coarse-to-fine Agentic Workflow. By mimicking a \textit{deliberate-then-act} cognitive process, OMG-Agent explicitly decouples the task into three synergistic stages: (1) an MLLM-driven Semantic Planner that resolves input ambiguity via Progressive Contextual Reasoning, creating a deterministic structured semantic plan; (2) a non-parametric Evidence Retriever that grounds abstract semantics in external knowledge; and (3) a Retrieval-Injected Executor that utilizes retrieved evidence as flexible feature prompts to overcome rigidity and synthesize high-fidelity details. Extensive experiments on multiple benchmarks demonstrate that OMG-Agent consistently surpasses state-of-the-art methods, maintaining robustness under extreme missingness, e.g., a $2.6$-point gain on CMU-MOSI at $70$\% missing rates.

OMG-Agent: Toward Robust Missing Modality Generation with Decoupled Coarse-to-Fine Agentic Workflows

TL;DR

OMG-Agent is presented, a novel framework that shifts the paradigm from static mapping to a dynamic coarse-to-fine Agentic Workflow and consistently surpasses state-of-the-art methods, maintaining robustness under extreme missingness.

Abstract

Data incompleteness severely impedes the reliability of multimodal systems. Existing reconstruction methods face distinct bottlenecks: conventional parametric/generative models are prone to hallucinations due to over-reliance on internal memory, while retrieval-augmented frameworks struggle with retrieval rigidity. Critically, these end-to-end architectures are fundamentally constrained by Semantic-Detail Entanglement -- a structural conflict between logical reasoning and signal synthesis that compromises fidelity. In this paper, we present \textbf{\underline{O}}mni-\textbf{\underline{M}}odality \textbf{\underline{G}}eneration Agent (\textbf{OMG-Agent}), a novel framework that shifts the paradigm from static mapping to a dynamic coarse-to-fine Agentic Workflow. By mimicking a \textit{deliberate-then-act} cognitive process, OMG-Agent explicitly decouples the task into three synergistic stages: (1) an MLLM-driven Semantic Planner that resolves input ambiguity via Progressive Contextual Reasoning, creating a deterministic structured semantic plan; (2) a non-parametric Evidence Retriever that grounds abstract semantics in external knowledge; and (3) a Retrieval-Injected Executor that utilizes retrieved evidence as flexible feature prompts to overcome rigidity and synthesize high-fidelity details. Extensive experiments on multiple benchmarks demonstrate that OMG-Agent consistently surpasses state-of-the-art methods, maintaining robustness under extreme missingness, e.g., a -point gain on CMU-MOSI at \% missing rates.
Paper Structure (21 sections, 18 equations, 6 figures, 3 tables)

This paper contains 21 sections, 18 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of OMG-Agent: 1) Planner, 2) Retriever and 3) Executor. Three agents synergistically decouple high-level reasoning from low-level synthesis via a coarse-to-fine workflow.
  • Figure 2: OMG-Agent Architecture. The workflow decouples generation into: (1) Semantic Planning: $\pi_{\mathcal{P}}$ reasoning for plan $S$; (2) Evidence Retrieval: $\pi_{\mathcal{R}}$ anchoring $S$ into evidence $E$ from $\mathcal{D}$; (3) Retrieval-Injected Execution: $\pi_{\mathcal{E}}$ synthesizing $\hat{Y}$ via dual-conditioning of $c_S$ and $E$.
  • Figure 3: Performance under random missing rate (0.0-0.7) on the test set. (a)-(c): $ACC_2$/F1/$ACC_7$ on CMU-MOSI. (d)-(f): $ACC_2$/F1/$ACC_7$ on CMU-MOSEI.
  • Figure 4: Ablation study of OMG-Agent architecture on the test set(MR=0.7). We report $ACC_2$/F1/$ACC_7$ scores on CMU-MOSI(a-c) and CMU-MOSEI(d-f).
  • Figure 5: Ablation study of retrieval quantity K. We report ACC/F1 on CMU-MOSI (a) and CMU-MOSEI (b).
  • ...and 1 more figures