Table of Contents
Fetching ...

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Eyal Gutflaish, Eliran Kachlon, Hezi Zisman, Tal Hacham, Nimrod Sarid, Alexander Visheratin, Saar Huberman, Gal Davidi, Guy Bukchin, Kfir Goldberg, Ron Mokady

TL;DR

The paper addresses the mismatch between sparse user prompts and richly detailed images by proposing long structured captions for training open-source text-to-image systems. It introduces DimFusion, a compute-efficient LLM fusion mechanism, and Text-as-a-Bottleneck Reconstruction (TaBR), an image-grounded evaluation that measures expressiveness and controllability. The authors present FIBO, an 8B-parameter open-source model trained on 120M licensed image-caption pairs, achieving state-of-the-art prompt adherence and native disentanglement for fine-grained attribute control. This work demonstrates that caption scale and structure can significantly improve controllable image generation, reduces reliance on priors, and provides a practical, license-respecting resource for professionals and researchers alike.

Abstract

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

TL;DR

The paper addresses the mismatch between sparse user prompts and richly detailed images by proposing long structured captions for training open-source text-to-image systems. It introduces DimFusion, a compute-efficient LLM fusion mechanism, and Text-as-a-Bottleneck Reconstruction (TaBR), an image-grounded evaluation that measures expressiveness and controllability. The authors present FIBO, an 8B-parameter open-source model trained on 120M licensed image-caption pairs, achieving state-of-the-art prompt adherence and native disentanglement for fine-grained attribute control. This work demonstrates that caption scale and structure can significantly improve controllable image generation, reduces reliance on priors, and provides a practical, license-respecting resource for professionals and researchers alike.

Abstract

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images, creating a gap between sparse textual input and rich visual outputs. This mismatch reduces controllability, as models often fill in missing details arbitrarily, biasing toward average user preferences and limiting precision for professional use. We address this limitation by training the first open-source text-to-image model on long structured captions, where every training sample is annotated with the same set of fine-grained attributes. This design maximizes expressive coverage and enables disentangled control over visual factors. To process long captions efficiently, we propose DimFusion, a fusion mechanism that integrates intermediate tokens from a lightweight LLM without increasing token length. We also introduce the Text-as-a-Bottleneck Reconstruction (TaBR) evaluation protocol. By assessing how well real images can be reconstructed through a captioning-generation loop, TaBR directly measures controllability and expressiveness, even for very long captions where existing evaluation methods fail. Finally, we demonstrate our contributions by training the large-scale model FIBO, achieving state-of-the-art prompt alignment among open-source models. Model weights are publicly available at https://huggingface.co/briaai/FIBO

Paper Structure

This paper contains 21 sections, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Generating an image from a long structured caption. Training with long structured captions substantially improves controllability and expressiveness. Unlike short prompts, each scene element is specified with rich attributes (e.g., position, size, texture, and relations), while global properties (e.g., background, lighting, composition, and photographic characteristics) are explicitly defined. The result is a complete, machine-readable description that enables faithful, controllable generation for professional use.
  • Figure 2: Workflow. A short caption is expanded by a VLM into a detailed JSON used by FIBO to generate an image. The user can then refine the JSON, and FIBO produces a new image that reflects only the requested changes, demonstrating strong disentanglement between modified and preserved elements.
  • Figure 3: Contextual contradiction. We use prompts from ContraBench huberman2025image and Whoops Bitton_Guetta_2023. FIBO generates semantically consistent images that faithfully represent the correct relations, unlike other models that often default to typical co-occurrences.
  • Figure 4: Controllability examples.Top: Our model (FIBO) enables precise depth-of-field control, from shallow to deep, whereas FLUX flux2024 does not consistently produce deep depth of field. Bottom: FIBO allows simultaneous control of facial expressions for multiple subjects within the same scene.
  • Figure 5: Training with long structured captions vs. short captions.Top: Qualitative comparison ($256X256$ resolution). Training with long captions produces more coherent and visually detailed images, showing faster convergence and better alignment. Bottom: Quantitative results on $30$K COCO-2014 val images lin2014microsoft: long captions yield lower (better) FID.
  • ...and 7 more figures