Table of Contents
Fetching ...

Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

Huijie Liu, Bingcan Wang, Jie Hu, Xiaoming Wei, Guoliang Kang

TL;DR

Omni-Dish tackles the challenge of generating photorealistic and faithful images for arbitrary Chinese dishes from text. It introduces a large-scale, curated 100M dish name–image dataset, a two-stage recaption strategy, and a coarse-to-fine training regime to learn fine-grained culinary details, complemented by a high-quality caption library for inference. Building on the generation model, the authors develop a dish editing framework via Concept-Enhanced P2P and a DiT-based editing model trained with multi-task data, enabling precise addition/removal of ingredients and other edits. Extensive experiments using automated metrics and human evaluations demonstrate state-of-the-art performance in both dish generation fidelity and editing effectiveness, highlighting Omni-Dish’s potential for culinary content creation, e-commerce, and domain-specific image manipulation.

Abstract

Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.

Omni-Dish: Photorealistic and Faithful Image Generation and Editing for Arbitrary Chinese Dishes

TL;DR

Omni-Dish tackles the challenge of generating photorealistic and faithful images for arbitrary Chinese dishes from text. It introduces a large-scale, curated 100M dish name–image dataset, a two-stage recaption strategy, and a coarse-to-fine training regime to learn fine-grained culinary details, complemented by a high-quality caption library for inference. Building on the generation model, the authors develop a dish editing framework via Concept-Enhanced P2P and a DiT-based editing model trained with multi-task data, enabling precise addition/removal of ingredients and other edits. Extensive experiments using automated metrics and human evaluations demonstrate state-of-the-art performance in both dish generation fidelity and editing effectiveness, highlighting Omni-Dish’s potential for culinary content creation, e-commerce, and domain-specific image manipulation.

Abstract

Dish images play a crucial role in the digital era, with the demand for culturally distinctive dish images continuously increasing due to the digitization of the food industry and e-commerce. In general cases, existing text-to-image generation models excel in producing high-quality images; however, they struggle to capture diverse characteristics and faithful details of specific domains, particularly Chinese dishes. To address this limitation, we propose Omni-Dish, the first text-to-image generation model specifically tailored for Chinese dishes. We develop a comprehensive dish curation pipeline, building the largest dish dataset to date. Additionally, we introduce a recaption strategy and employ a coarse-to-fine training scheme to help the model better learn fine-grained culinary nuances. During inference, we enhance the user's textual input using a pre-constructed high-quality caption library and a large language model, enabling more photorealistic and faithful image generation. Furthermore, to extend our model's capability for dish editing tasks, we propose Concept-Enhanced P2P. Based on this approach, we build a dish editing dataset and train a specialized editing model. Extensive experiments demonstrate the superiority of our methods.

Paper Structure

This paper contains 15 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Existing methods face challenges in generating photorealistic and faithful images of arbitrary Chinese dishes. Row 1 shows reference images that are real photographs.
  • Figure 2: Nuanced descriptions not only help Omni-Dish generate faithful dish images, but also endowing it with fine-grained instruction-following capabilities.
  • Figure 3: Overview of our method. In the yellow block, (a) with the dish curation and recaption, the coarse-to-fine strategy is applied to train Omni-Dish; (b) high-quality captions are obtained from a pre-constructed library and rewritten by large language models for inference. In the green block, (c) the Concept-Enhanced P2P approach is introduced to build the dish editing dataset; (d) a dish editing model is trained through a multi-task data mixture.
  • Figure 4: Dish name correction by two steps. For details, refer to Data Correction in Sec. \ref{['sec:data curation']}.
  • Figure 5: Concept-Enhanced P2P can enhance editing effects while maintaining consistency.
  • ...and 4 more figures