Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes
Huijie Liu, Bingcan Wang, Jie Hu, Xiaoming Wei, Guoliang Kang
TL;DR
Omni-Dish tackles the challenge of generating photorealistic and faithful images for arbitrary Chinese dishes from text. It introduces a large-scale, curated 100M dish name–image dataset, a two-stage recaption strategy, and a coarse-to-fine training regime to learn fine-grained culinary details, complemented by a high-quality caption library for inference. Building on the generation model, the authors develop a dish editing framework via Concept-Enhanced P2P and a DiT-based editing model trained with multi-task data, enabling precise addition/removal of ingredients and other edits. Extensive experiments using automated metrics and human evaluations demonstrate state-of-the-art performance in both dish generation fidelity and editing effectiveness, highlighting Omni-Dish’s potential for culinary content creation, e-commerce, and domain-specific image manipulation.
Abstract
Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.
