Generative Distribution Prediction: A Unified Approach to Multimodal Learning
Xinyu Tian, Xiaotong Shen
TL;DR
GDP addresses the challenge of integrating heterogeneous multimodal data by learning the data-generating distribution with conditional synthetic data from diffusion-based generators. It combines transfer learning with dual-level embeddings and a unified diffusion backbone to produce point predictions and predictive distributions across tabular and unstructured modalities, while providing theoretical guarantees on excess risk via Wasserstein-distance bounds. The framework is validated on four tasks—domain adaptation, image captioning, Q&A, and adaptive quantile regression—demonstrating competitive or superior predictive accuracy and robust domain transfer. This work offers a scalable, distribution-focused approach for multimodal analytics, with practical impact in settings where heterogeneous data and domain shifts are prevalent.
Abstract
Accurate prediction with multimodal data-encompassing tabular, textual, and visual inputs or outputs-is fundamental to advancing analytics in diverse application domains. Traditional approaches often struggle to integrate heterogeneous data types while maintaining high predictive accuracy. We introduce Generative Distribution Prediction (GDP), a novel framework that leverages multimodal synthetic data generation-such as conditional diffusion models-to enhance predictive performance across structured and unstructured modalities. GDP is model-agnostic, compatible with any high-fidelity generative model, and supports transfer learning for domain adaptation. We establish a rigorous theoretical foundation for GDP, providing statistical guarantees on its predictive accuracy when using diffusion models as the generative backbone. By estimating the data-generating distribution and adapting to various loss functions for risk minimization, GDP enables accurate point predictions across multimodal settings. We empirically validate GDP on four supervised learning tasks-tabular data prediction, question answering, image captioning, and adaptive quantile regression-demonstrating its versatility and effectiveness across diverse domains.
