Table of Contents
Fetching ...

Generating Fine Details of Entity Interactions

Xinyi Gu, Jiayuan Mao

TL;DR

This work tackles the difficulty of generating faithful images with fine-grained entity interactions by introducing the InterActing dataset of 1000 prompts spanning three interaction categories. It proposes DetailScribe, a generate-then-refine framework that leverages hierarchical concept decomposition via LLMs and vision-language model critique to guide diffusion-based refinement. Through extensive experiments, DetailScribe outperforms strong baselines on both human judgments and automatic metrics across three interaction scenarios, demonstrating improved fidelity and interaction realism. The dataset and code enable future research into interaction-rich image generation and refined inference strategies for complex scenes.

Abstract

Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions. This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions. To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies. Our dataset and code are available at https://concepts-ai.com/p/detailscribe/ to facilitate future exploration of interaction-rich image generation.

Generating Fine Details of Entity Interactions

TL;DR

This work tackles the difficulty of generating faithful images with fine-grained entity interactions by introducing the InterActing dataset of 1000 prompts spanning three interaction categories. It proposes DetailScribe, a generate-then-refine framework that leverages hierarchical concept decomposition via LLMs and vision-language model critique to guide diffusion-based refinement. Through extensive experiments, DetailScribe outperforms strong baselines on both human judgments and automatic metrics across three interaction scenarios, demonstrating improved fidelity and interaction realism. The dataset and code enable future research into interaction-rich image generation and refined inference strategies for complex scenes.

Abstract

Images not only depict objects but also encapsulate rich interactions between them. However, generating faithful and high-fidelity images involving multiple entities interacting with each other, is a long-standing challenge. While pre-trained text-to-image models are trained on large-scale datasets to follow diverse text instructions, they struggle to generate accurate interactions, likely due to the scarcity of training data for uncommon object interactions. This paper introduces InterActing, an interaction-focused dataset with 1000 fine-grained prompts covering three key scenarios: (1) functional and action-based interactions, (2) compositional spatial relationships, and (3) multi-subject interactions. To address interaction generation challenges, we propose a decomposition-augmented refinement procedure. Our approach, DetailScribe, built on Stable Diffusion 3.5, leverages LLMs to decompose interactions into finer-grained concepts, uses a VLM to critique generated images, and applies targeted interventions within the diffusion process in refinement. Automatic and human evaluations show significantly improved image quality, demonstrating the potential of enhanced inference strategies. Our dataset and code are available at https://concepts-ai.com/p/detailscribe/ to facilitate future exploration of interaction-rich image generation.

Paper Structure

This paper contains 49 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Left: DetailScribe improves the base text-to-image model across three scenarios: functional interaction, complex scene layouts, and multi-subject interactions. Right: A gallery showcasing DetailScribe-generated images with rich entity interactions.
  • Figure 2: The overall pipeline of DetailScribe. DetailScribe takes as input a single natural language instruction. It first prompts a large language model (LLM) to generate a breakdown of the concepts in the image, which guides a vision-language model (VLM) to attend to different regions of a generated image and suggests fixes. It then adds noises back to the generated image and re-runs the diffusion process with the VLM-refined prompt to generate a faithful and high-fidelity image with rich entity interactions.
  • Figure 3: VLM-based critique and prompt refinement. Given the LLM-generated concept decomposition and an image generated using the user input, a vision-language model generates a critique of errors in the image, suggests corrections, and finally refines the prompt. This prompt will be used in a second-round diffusion process to refine the image.
  • Figure 4: Images generated by DetailScribe and baselines on the InterActing dataset. From left to right: 1) Stable Diffusion: generating images using Stable Diffusion (SD) with the prompt directly; 2) SD + GPT: Stable Diffusion with GPT augmented prompts ; 3) DALL·E 3: prompting DALL·E 3 with the original prompts, which are augmented internally within DALL·E; 4) Ours: DetailScribe generating images with decomposed concepts and VLM generated critiques. DetailScribe consistently provides effective corrections, which help generate images that closely follow the fine details in the prompts.
  • Figure 5: An illustrative example showing the effectiveness of the explicit concept decomposition module. VLM first critiques the original generation, and identifies the features needs to be correct (red) and the features non-necessary for further modification (green), and then provides the corrected prompt for re-denoising.
  • ...and 5 more figures