Table of Contents
Fetching ...

ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points

Ryota Okumura, Kaede Shiohara, Toshihiko Yamasaki

TL;DR

ControlVP addresses vanishing point inconsistencies in AI-generated architectural images by enabling user-guided refinements. It extends a pre-trained diffusion model with a ControlNet-like conditioning on user-drawn building outlines and introduces a Vanishing Point Loss to enforce edge alignment with perspective cues. The approach uses an inpainting-based VP correction process, a dedicated dataset of VP inconsistencies, and extensive ablations showing improved VP accuracy while preserving perceptual quality. The work demonstrates practical potential for geometry-aware editing and downstream tasks like image-to-3D reconstruction, with an accessible GUI and publicly released code.

Abstract

Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at https://github.com/RyotaOkumura/ControlVP .

ControlVP: Interactive Geometric Refinement of AI-Generated Images with Consistent Vanishing Points

TL;DR

ControlVP addresses vanishing point inconsistencies in AI-generated architectural images by enabling user-guided refinements. It extends a pre-trained diffusion model with a ControlNet-like conditioning on user-drawn building outlines and introduces a Vanishing Point Loss to enforce edge alignment with perspective cues. The approach uses an inpainting-based VP correction process, a dedicated dataset of VP inconsistencies, and extensive ablations showing improved VP accuracy while preserving perceptual quality. The work demonstrates practical potential for geometry-aware editing and downstream tasks like image-to-3D reconstruction, with an accessible GUI and publicly released code.

Abstract

Recent text-to-image models, such as Stable Diffusion, have achieved impressive visual quality, yet they often suffer from geometric inconsistencies that undermine the structural realism of generated scenes. One prominent issue is vanishing point inconsistency, where projections of parallel lines fail to converge correctly in 2D space. This leads to structurally implausible geometry that degrades spatial realism, especially in architectural scenes. We propose ControlVP, a user-guided framework for correcting vanishing point inconsistencies in generated images. Our approach extends a pre-trained diffusion model by incorporating structural guidance derived from building contours. We also introduce geometric constraints that explicitly encourage alignment between image edges and perspective cues. Our method enhances global geometric consistency while maintaining visual fidelity comparable to the baselines. This capability is particularly valuable for applications that require accurate spatial structure, such as image-to-3D reconstruction. The dataset and source code are available at https://github.com/RyotaOkumura/ControlVP .

Paper Structure

This paper contains 35 sections, 6 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Our method, ControlVP, corrects vanishing point (VP) inconsistencies in generated images. In the left image, the green lines converge at a single VP while the red one does not, indicating a geometric inconsistency in the image. By incorporating user-provided building contours as conditions, our method transforms the inconsistent image into a geometrically coherent one where parallel lines properly converge at their respective VPs. Users specify desired / original building outlines through an interactive interface, allowing for precise geometric refinement. A mask is automatically generated from the areas between the original and desired outlines, limiting modifications to only the necessary regions. The initial image was adapted from Sora videoworldsimulators2024, a video generation model developed by OpenAI.
  • Figure 2: Examples of VP inconsistencies across different image generation models. From left to right: Stable Diffusion v2.1 Rombach_2022_CVPR, DALLE-3 dalle3 and Midjourney v6.1 midjourney. A set of parallel lines in 3D space (red and green) should form a single VP, but they converge at different points.
  • Figure 3: Overview of ControlVP. The upper part shows the basic structure of LDM, where the VAE encoder compresses images into latent space, U-Net performs noise prediction, and the VAE decoder reconstructs the image. The lower part illustrates the extended ControlNet architecture, which processes condition images (such as building outlines) and provides additional features to the U-Net, enabling geometrically consistent image generation.
  • Figure 4: Classifier-free guidance (CFG) in different image translation tasks. Images are generated by three different ControlNet conditions: (a) edge maps and (b) depth maps, and (c) contours towards VP. The semitransparent yellow lines overlaid on the generated images of (c) indicate the given contour conditions. CFG harms the texture in conventional ControlNet tasks such as (a) and (b). In contrast, CFG improves the conditioning fidelity while preserving the texture quality in our contour-to-image task (c).
  • Figure 5: Inpainting process for VP correction. The process involves (a) mapping the input image to latent space, (b) adding noise to the latent representation, (c) performing denoising with predicted noise in the masked region while using true noise elsewhere, and (d) decoding the corrected latent to generate image with consistent VPs while preserving the unmasked regions.
  • ...and 6 more figures