Table of Contents
Fetching ...

BIFRÖST: 3D-Aware Image compositing with Language Instructions

Lingxiao Li, Kaixiong Gong, Weihong Li, Xili Dai, Tao Chen, Xiaojun Yuan, Xiangyu Yue

TL;DR

Bifr\"ost", a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition, significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding.

Abstract

This paper introduces Bifröst, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships ($\textit{e.g.}$, occlusion). Bifröst addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.

BIFRÖST: 3D-Aware Image compositing with Language Instructions

TL;DR

Bifr\"ost", a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition, significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding.

Abstract

This paper introduces Bifröst, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships (, occlusion). Bifröst addresses these issues by training MLLM as a 2.5D location predictor and integrating depth maps as an extra condition during the generation process to bridge the gap between 2D and 3D, which enhances spatial comprehension and supports sophisticated spatial interactions. Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that Bifröst significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding. This work not only pushes the boundaries of generative image compositing but also reduces reliance on expensive annotated datasets by effectively utilizing existing resources in innovative ways.

Paper Structure

This paper contains 25 sections, 9 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Bifröst results on various personalized image compositing tasks.Top:Bifröst is adept at precise, arbitrary object placement and replacement in a background image with a reference object and a language instruction, and achieves 3D-aware high-fidelity harmonized compositing results; Bottom Left: Given a coarse mask, Bifröst can change the pose of the object to follow the shape of the mask; Bottom Right: Our model adapts the identity of the reference image to the target image without changing the pose.
  • Figure 2: Overview of the inference pipeline of Bifröst. Given background image $\mathbf{I}_{bg}$, and text instruction $\boldsymbol{c}_{T}$ that indicates the location for object compositing to the background, the MLLM first predicts the 2.5D location consists of a bounding box and the depth of the object. Then a pre-trained depth predictor is applied to estimate the given images' depth. After that. The depth of the reference object is scaled to the depth value predicted by MLLM and fused in the predicted location of the background depth. Finally, the masked background image, fused depth, and reference object image are used as the input of the compositing model and generate an output image $\mathbf{I}_{out}$ that satisfies spatial relations in the text instruction $\boldsymbol{c}_T$ and appears visually coherent and natural (e.g., with light and shadow that are consistent with the background image).
  • Figure 3: Overview of the 2.5D counterfactual dataset generation for fine-tuning MLLM. Given a scene image $I$, one object $o$ was randomly selected as the object we want to predict (e.g., the laptop in this figure). The depth of the object is predicted by a pre-trained depth predictor. The selected object is then removed from the given image using the SAM (i.e. mask the object) followed by an SD-based inpainting model (i.e., inpaint the masked hole). The final data pair consists of a text instruction, a counterfactual image, and a 2.5D location of the selected object $o$.
  • Figure 4: Examples of 2.5D counterfactual dataset for fine-tuning MLLM.
  • Figure 5: Overview of training pipeline of Bifröst on image compositing stage. A segmentation module is first adopted to get the masked image and object without background, followed by an ID extractor to obtain its identity information. The high-frequency filter is then applied to extract the detail of the object, stitch the result with the scene at the predicted location, and employ a detail extractor to complement the ID extractor with texture details. We then use a depth predictor to estimate the depth of the image and apply a depth extractor to capture the spatial information of the scene. Finally, the ID tokens, detail maps, and depth maps are integrated into a pre-trained diffusion model, enabling the target object to seamlessly blend with its surroundings while preserving complex spatial relationships.
  • ...and 16 more figures