Table of Contents
Fetching ...

Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D

Ping Chen, Zezhou Chen, Xingpeng Zhang, Yanlin Qian, Huan Hu, Xiang Liu, Zipeng Wang, Xin Wang, Zhaoxiang Liu, Kai Wang, Shiguo Lian

TL;DR

This paper proposes Art3D, a preliminary framework exploring a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis, and introduces a preliminary evaluation method to quantify cinematic alignment.

Abstract

Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.

Beyond Geometry: Artistic Disparity Synthesis for Immersive 2D-to-3D

TL;DR

This paper proposes Art3D, a preliminary framework exploring a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis, and introduces a preliminary evaluation method to quantify cinematic alignment.

Abstract

Current 2D-to-3D conversion methods achieve geometric accuracy but are artistically deficient, failing to replicate the immersive and emotionally resonant experience of professional 3D cinema. This is because geometric reconstruction paradigms mistake deliberate artistic intent, such as strategic zero-plane shifts for pop-out effects and local depth sculpting, for data noise or ambiguity. This paper argues for a new paradigm: Artistic Disparity Synthesis, shifting the goal from physically accurate disparity estimation to artistically coherent disparity synthesis. We propose Art3D, a preliminary framework exploring this paradigm. Art3D uses a dual-path architecture to decouple global depth parameters (macro-intent) from local artistic effects (visual brushstrokes) and learns from professional 3D film data via indirect supervision. We also introduce a preliminary evaluation method to quantify cinematic alignment. Experiments show our approach demonstrates potential in replicating key local out-of-screen effects and aligning with the global depth styles of cinematic 3D content, laying the groundwork for a new class of artistically-driven conversion tools.
Paper Structure (16 sections, 9 equations, 4 figures, 5 tables)

This paper contains 16 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Artistic ambiguity in 3D film production. The same scene can exhibit distinct 3D perceptions under different artistic directions. To illustrate this, we simulate various 3D production choices and visualize their effects. From left to right, the three columns show the left view, right view, and the resulting red–cyan anaglyph (best viewed when zoomed in). Comparing the first and second rows, variations in the stereo camera’s baseline $b$ and focal length $f$ lead to different disparities, reflecting differences in the Mastery of Global Depth. The second and third rows show zero-plane ambiguity, where the perceived depth reference shifts from the foreground cow to the distant forest, often due to creative camera setup or post-production adjustment. The fourth row illustrates over-simple depth layering, where disparity remains nearly uniform, likely reflecting cost-efficient rather than artistic intent.
  • Figure 2: Art3D Framework. Our pipeline takes a 2D Left View as input. 1. Geometric Feature Extraction (Frozen): 'DepthNet' extracts geometric features and the inverse depth map, 'geometric canvas ($iz$)'. 2. Artistic Blueprint & Masks (Frozen, Data Construction Stage): During the data construction stage (no training or inference required), 'StereoNet' estimates the target artistic blueprint ($d^L$) and provides a valid pixel mask ($M_{valid}$) via its left-right consistency check. Simultaneously, 'Lang-SAM' analyzes the Left View to generate the local effects mask ($M_{local}$). The final global style mask ($M_{global}$) is then derived from valid regions that are not part of the local effects (i.e., $M_{valid} \cdot (1-M_{local})$). 3. CameraNet (Trainable): This network takes features and $iz$ to synthesize the virtual camera parameters ($vs, vt$), which are pixel-level tensors, and a preliminary virtual right disparity map ($\hat{d}^R$). The virtual left disparity map ($\hat{d}^L$) is then constructed as $iz \times vs + vt$. 4. Dual-Path Supervision: Our core '$\mathcal{L}_{Art}$' loss guides the learning of $vs, vt$ by comparing $\hat{d}^L$ with $d^L$, weighted by $M_{global}$ (for global style) and $M_{local}$ (for local effects). An additional left-right consistency check loss further refines $\hat{d}^R$. 5. Virtual Right View Synthesis: The final $\hat{d}^R$, combined with the original Left View, can be used to generate the Virtual Right View via standard warping techniques. For more details on the loss functions, please refer to Sec.\ref{['method']}.
  • Figure 3: Qualitative analysis of Sculpting of Local Effects on 2D inputs. Row 1: Input image. Row 2: Ours (w/o $\mathcal{L}{path}(M_{local})$), which fails to sculpt local pop-out effects. Row 3: Owl3D, which produces partial effects but lacks artistic consistency. Row 4:Art3D (Full Model), which successfully sculpts strong and coherent pop-out effects. Best viewed zoomed in and with red–cyan anaglyph filters.
  • Figure 4: Analysis of Geometric Consistency via DDC-IoU. Each row presents three independent samples. For each sample, the Anaglyph 3D view is shown on top. Below it, the corresponding right disparity map ($d^R$ or $\hat{d}^R$) is displayed alongside the right geometric canvas ($iz_R$) derived from that disparity: for ground truth, $iz_R$ is obtained by feeding the true right image into DepthNet; for our estimate, $iz_R$ is computed from the synthesized right view generated by warping the left image using $\hat{d}^R$. Top Row (Original 3D Films): Visualizes the inconsistent quality of source data. The leftmost sample shows poor structural alignment ($\text{DDC-IoU}=0$). The middle and rightmost samples are acceptable ($\text{DDC-IoU} \geq 0.8$). Bottom Row (Our Art3D Output): Shows our model's synthesized blueprints ($\hat{d}^R$) for the same three scenes shown above. Our outputs consistently achieve high DDC-IoU scores (e.g., 0.85, 0.83, 0.89), demonstrating that artistic style is learned without corrupting the underlying geometry. Visually, the synthesized disparity maps are coherent and structurally sound.