Table of Contents
Fetching ...

Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

Chaohua Shi, Xuan Wang, Si Shi, Xule Wang, Mingrui Zhu, Nannan Wang, Xinbo Gao

TL;DR

This paper introduces a large-scale, high-quality food image composite dataset, FC22 k, and proposes a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information.

Abstract

Food image composition requires the use of existing dish images and background images to synthesize a natural new image, while diffusion models have made significant advancements in image generation, enabling the construction of end-to-end architectures that yield promising results. However, existing diffusion models face challenges in processing and fusing information from multiple images and lack access to high-quality publicly available datasets, which prevents the application of diffusion models in food image composition. In this paper, we introduce a large-scale, high-quality food image composite dataset, FC22k, which comprises 22,000 foreground, background, and ground truth ternary image pairs. Additionally, we propose a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information. This fused information aligns the foreground features with the background structure by merging the global structural information at the cross-attention layer of the denoising UNet. To further enhance the content and structure of the background, we also integrate a Content-Structure Control Module. Extensive experiments demonstrate the effectiveness and scalability of our proposed method.

Foodfusion: A Novel Approach for Food Image Composition via Diffusion Models

TL;DR

This paper introduces a large-scale, high-quality food image composite dataset, FC22 k, and proposes a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information.

Abstract

Food image composition requires the use of existing dish images and background images to synthesize a natural new image, while diffusion models have made significant advancements in image generation, enabling the construction of end-to-end architectures that yield promising results. However, existing diffusion models face challenges in processing and fusing information from multiple images and lack access to high-quality publicly available datasets, which prevents the application of diffusion models in food image composition. In this paper, we introduce a large-scale, high-quality food image composite dataset, FC22k, which comprises 22,000 foreground, background, and ground truth ternary image pairs. Additionally, we propose a novel food image composition method, Foodfusion, which leverages the capabilities of the pre-trained diffusion models and incorporates a Fusion Module for processing and integrating foreground and background information. This fused information aligns the foreground features with the background structure by merging the global structural information at the cross-attention layer of the denoising UNet. To further enhance the content and structure of the background, we also integrate a Content-Structure Control Module. Extensive experiments demonstrate the effectiveness and scalability of our proposed method.
Paper Structure (31 sections, 7 equations, 9 figures, 1 table)

This paper contains 31 sections, 7 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: The illustration of our dataset (FC22k) construction process. Starting with a ground truth ($I_{gt}$) image containing a clear foreground ($I_{f}$) and background ($I_{b}$), our automated process generates multiple data pairs with different foregrounds but the same background, along with their corresponding GT images.
  • Figure 2: Some samples from FC22k dataset.
  • Figure 3: Ovearview of our proposed Foodfusion model. Given a foreground food image $I_f$ and a background $I_b$, Foodfusion effectively processes and merges them. By automatically adjusting the foreground's size, angle, and position, it seamlessly integrates $I_f$ with $I_b$ to create a high-quality composite image $I_c$.
  • Figure 4: The illustration of the Fusion Module. It can effectively process and fuse foreground and background images.
  • Figure 5: Qualitative comparison with other methods. Our method effectively fuses foreground $I_f$ and background $I_b$ information without requiring additional positional data, such as masks, to generate high-quality food composite image $I_c$.
  • ...and 4 more figures