Table of Contents
Fetching ...

BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Eliran Kachlon, Alexander Visheratin, Nimrod Sarid, Tal Hacham, Eyal Gutflaish, Saar Huberman, Hezi Zisman, David Ruppin, Ron Mokady

Abstract

Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.

BBQ-to-Image: Numeric Bounding Box and Qolor Control in Large-Scale Text-to-Image Models

Abstract

Text-to-image models have rapidly advanced in realism and controllability, with recent approaches leveraging long, detailed captions to support fine-grained generation. However, a fundamental parametric gap remains: existing models rely on descriptive language, whereas professional workflows require precise numeric control over object location, size, and color. In this work, we introduce BBQ, a large-scale text-to-image model that directly conditions on numeric bounding boxes and RGB triplets within a unified structured-text framework. We obtain precise spatial and chromatic control by training on captions enriched with parametric annotations, without architectural modifications or inference-time optimization. This also enables intuitive user interfaces such as object dragging and color pickers, replacing ambiguous iterative prompting with precise, familiar controls. Across comprehensive evaluations, BBQ achieves strong box alignment and improves RGB color fidelity over state-of-the-art baselines. More broadly, our results support a new paradigm in which user intent is translated into an intermediate structured language, consumed by a flow-based transformer acting as a renderer and naturally accommodating numeric parameters.
Paper Structure (22 sections, 1 equation, 7 figures, 3 tables)

This paper contains 22 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Bounding-box and RGB-controlled image generation and refinement. BBQ enables precise spatial and color control by conditioning on explicit numeric bounding boxes and RGB values. In the example, the exact locations of the people and the dog are specified via bounding boxes, and the colors of their clothing are defined using RGB triplets. Beyond initial generation, BBQ enables structured refinement by modifying only the numeric parameters in the caption and re-generating the image. Due to the model’s disentangled control over layout and color, updating bounding boxes (e.g., swapping the man and the woman, or moving the dog to the right) or modifying RGB values results in consistent, targeted changes while preserving the rest of the scene.
  • Figure 2: End-to-end parametric workflow. A short prompt is expanded by a VLM into a structured JSON that includes numeric bounding boxes and RGB values (for clarity, we show only the parametric fields for the woman). The JSON is then provided to BBQ to generate an image. Users can edit specific fields (e.g., box coordinates or color values), and BBQ updates the output accordingly while preserving unrelated content, demonstrating native disentanglement. Notably, BBQ receives no image input, and consistency is maintained solely through the disentangle structured conditioning.
  • Figure 3: Disentangled parametric refinement via structured re-generation. Each example starts from an image generated from a structured JSON prompt. We then edit only the relevant JSON fields and re-generate using the same random seed. Although the model does not observe the original image, it produces localized changes that follow the modified parameters while preserving the rest of the scene, demonstrating strong parametric disentanglement. Ground-truth bounding boxes are overlaid for visualization.
  • Figure 4: Text-as-a-Bottleneck Reconstruction (TaBR). Starting from the original image (left), a detailed caption is generated and used as input to each model. The resulting reconstructions are compared against the original. BBQ more faithfully preserves scene layout, object relations, and fine-grained attributes than competing state-of-the-art models, demonstrating improved expressiveness.
  • Figure 5: Color-conditioning accuracy. Each example shows the target color (left) and images generated by different models when conditioned on the same object and exact RGB value. BBQ achieves high chromatic fidelity to the target color and produces competitive results compared to state-of-the-art text-to-image models under identical color-conditioning prompts.
  • ...and 2 more figures