Table of Contents
Fetching ...

CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

Xiaojun Shan, Haoyu Shen, Yucheng Mao, Xiang Zhang, Abhay Anand, Bingnan Li, Haiyang Xu, Zhuowen Tu

Abstract

We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.

CyCLeGen: Cycle-Consistent Layout Prediction and Image Generation in Vision Foundation Models

Abstract

We present CyCLeGen, a unified vision-language foundation model capable of both image understanding and image generation within a single autoregressive framework. Unlike existing vision models that depend on separate modules for perception and synthesis, CyCLeGen adopts a fully integrated architecture that enforces cycle-consistent learning through image->layout->image and layout->image->layout generation loops. This unified formulation introduces two key advantages: introspection, enabling the model to reason about its own generations, and data efficiency, allowing self-improvement via synthetic supervision under a reinforcement learning objective guided by cycle consistency. Extensive experiments show that CyCLeGen achieves significant gains across diverse image understanding and generation benchmarks, highlighting the potential of unified vision-language foundation models.
Paper Structure (37 sections, 9 equations, 4 figures, 10 tables)

This paper contains 37 sections, 9 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Overview of the CyCLeGen framework. CyCLeGen employs a unified autoregressive transformer that jointly performs layout understanding (image → layout) and layout-conditioned image generation (layout → image) within a shared token space. Given image tokens, the model predicts layout and text tokens through the understanding branch; conversely, conditioned on layout and text tokens, the generation branch produces image tokens. Training enforces bidirectional cycle consistency across the two directions, forming image → layout → image and layout → image → layout loops. Cycle rewards/losses encourage the generated outputs in one direction to remain structurally and semantically recoverable in the reverse direction, aligning visual understanding with controllable image generation.
  • Figure 2: Overview of CycleGRPO. CycleGRPO trains a unified autoregressive transformer for both layout understanding and layout-conditioned generation through two complementary directions. The Und $\rightarrow$ Gen direction (blue arrows) starts from an input image, where the model autoregressively predicts layouts via trajectory rollouts and decodes them into bounding boxes. The Gen $\rightarrow$ Und direction (purple arrows) starts from a layout prompt, where the model generates images and enforces that the generated images can recover the original layout through understanding. Multiple trajectories are sampled in both directions and optimized with GRPO, providing cycle-consistent reinforcement signals that align visual understanding with image generation.
  • Figure 3: Qualitative comparison of layout-to-image generation. Given the same input layouts (Ground Truth), our CyCLeGen model produces images that more faithfully preserve object geometry, spatial configuration, and fine-grained structure compared to PlanGen and diffusion-based baselines (InstanceDiffwang2024instancediffusion, HiCocheng2024hico, CreatiLayoutzhang2024creatilayout). Across diverse scenes—including people, architectural facades, cable cars, and airplanes—CyCLeGen demonstrates stronger adherence to layout constraints and generates visually coherent, detail-rich images.
  • Figure 4: Qualitative comparison of layout understanding results. We compare CyCLeGen (top) with PlanGen (middle) and GroundingDINO (bottom) across diverse scenes. CyCLeGen produces more complete and precise detections, with fewer missed objects, cleaner bounding boxes, and more semantically coherent structured descriptions. In contrast, PlanGen often yields fragmented boxes, duplicated predictions, and under-detection in complex scenes. These results demonstrate that our cycle-consistent training substantially boosts structural grounding quality.