Table of Contents
Fetching ...

ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models

Riccardo de Lutio, Tobias Fischer, Yen-Yu Chang, Yuxuan Zhang, Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Katarina Tothova, Zan Gojcic, Haithem Turki

TL;DR

This work trains a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas, and distill it into a causal auto-regressive model that generates hundreds of frames in a single pass.

Abstract

Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.

ArtiFixer: Enhancing and Extending 3D Reconstruction with Auto-Regressive Diffusion Models

TL;DR

This work trains a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas, and distill it into a causal auto-regressive model that generates hundreds of frames in a single pass.

Abstract

Per-scene optimization methods such as 3D Gaussian Splatting provide state-of-the-art novel view synthesis quality but extrapolate poorly to under-observed areas. Methods that leverage generative priors to correct artifacts in these areas hold promise but currently suffer from two shortcomings. The first is scalability, as existing methods use image diffusion models or bidirectional video models that are limited in the number of views they can generate in a single pass (and thus require a costly iterative distillation process for consistency). The second is quality itself, as generators used in prior work tend to produce outputs that are inconsistent with existing scene content and fail entirely in completely unobserved regions. To solve these, we propose a two-stage pipeline that leverages two key insights. First, we train a powerful bidirectional generative model with a novel opacity mixing strategy that encourages consistency with existing observations while retaining the model's ability to extrapolate novel content in unseen areas. Second, we distill it into a causal auto-regressive model that generates hundreds of frames in a single pass. This model can directly produce novel views or serve as pseudo-supervision to improve the underlying 3D representation in a simple and highly efficient manner. We evaluate our method extensively and demonstrate that it can generate plausible reconstructions in scenarios where existing approaches fail completely. When measured on commonly benchmarked datasets, we outperform existing all existing baselines by a wide margin, exceeding prior state-of-the-art methods by 1-3 dB PSNR.
Paper Structure (41 sections, 3 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 41 sections, 3 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Method overview. We first train a bidirectional flow matching model that transports degraded RGB renderings into clean outputs. We encode the input RGB into latent space and mix with Gaussian noise using the rendered opacity maps to avoid modal collapse in unseen regions. We inject fine-grained opacity information and camera control along with clean reference views and an optional text prompt. In the second phase of our pipeline, we distill the teacher into an auto-regressive causal model via Self Forcing-style DMD distillation huang2025selfforcing, which can be directly used to render novel views or used as pseudo-supervision to distill back into the underlying 3D representation.
  • Figure 2: Transformer block. We start from a pretrained text-to-video model wan2025 and inject camera and opacity information into each transformer block via linear layers after applying self-attention and layer normalization. We patchify reference views into visual tokens, apply relative camera conditioning via PRoPe li2025cameras, and add $K_n$ and $V_n$ projections to the cross-attention operation. We zero-initialize $f_c$, $f_o$, and $V_n$ to ensure compatibility with the pretrained initialization.
  • Figure 3: Opacity mixing. Given a degraded rendering, set of reference views, and an optional text prompt (left), we predict an artifact-free rendering at a target viewpoint. Starting from Gaussian noise and and channel concatenating the degraded rendering as in prior work Wu2025GenFusionyin2025gsfixerimproving3dgaussian produces renderings that are semantically similar to the reference views, but with notable inconsistencies (such as the table in the top row). Directly starting from the degraded rendering instead of Gaussian noise improves consistency, but degrades quality noticeably when extrapolating to areas outside those covered by the degraded renderings (bottom row). Instead, we mix Gaussian noise into the rendering based on its opacity map. The resulting input retains the consistency benefits of the original while enabling a strong generative capability in entirely novel regions.
  • Figure 4: Reference views. Without the initial rendering condition, ArtiFixer can generate predictions from the reference views. Although fidelity drops somewhat, the high-level structure of the scene remains intact.
  • Figure 5: Text-to-video generation. To illustrate our model's generative ability, we generate videos from text prompts alone. With opacity mixing, it retains similar quality to its base model wan2025
  • ...and 4 more figures