TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

Yukai Shi; Jianan Wang; He Cao; Boshi Tang; Xianbiao Qi; Tianyu Yang; Yukun Huang; Shilong Liu; Lei Zhang; Heung-Yeung Shum

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

Yukai Shi, Jianan Wang, He Cao, Boshi Tang, Xianbiao Qi, Tianyu Yang, Yukun Huang, Shilong Liu, Lei Zhang, Heung-Yeung Shum

TL;DR

This work tackles the under-constrained problem of novel view synthesis from a single image by introducing TOSS, a text-guided diffusion framework that constrains the NVS solution space with semantic text. The approach fuses text, the input view, and relative camera pose through a dense cross-attention architecture built on Stable Diffusion, and it uses dedicated expert denoisers to sharpen pose accuracy and detail preservation. Empirical results show TOSS yields more plausible, controllable, and multiview-consistent novel-view generations and improved 3D reconstruction compared with prior methods like Zero-1-to-3, with further gains achievable by leveraging stronger text-to-image models and textual inversion. The method offers a flexible, orthogonal augmentation to foundation diffusion models, enabling better NVS and downstream 3D applications while remaining compatible with future model advances.

Abstract

In this paper, we present TOSS, which introduces text to the task of novel view synthesis (NVS) from just a single RGB image. While Zero-1-to-3 has demonstrated impressive zero-shot open-set NVS capability, it treats NVS as a pure image-to-image translation problem. This approach suffers from the challengingly under-constrained nature of single-view NVS: the process lacks means of explicit user control and often results in implausible NVS generations. To address this limitation, TOSS uses text as high-level semantic information to constrain the NVS solution space. TOSS fine-tunes text-to-image Stable Diffusion pre-trained on large-scale text-image pairs and introduces modules specifically tailored to image and camera pose conditioning, as well as dedicated training for pose correctness and preservation of fine details. Comprehensive experiments are conducted with results showing that our proposed TOSS outperforms Zero-1-to-3 with more plausible, controllable and multiview-consistent NVS results. We further support these results with comprehensive ablations that underscore the effectiveness and potential of the introduced semantic guidance and architecture design.

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

TL;DR

Abstract

Paper Structure (32 sections, 3 equations, 16 figures, 4 tables)

This paper contains 32 sections, 3 equations, 16 figures, 4 tables.

Introduction
Related Work
Method
Preliminary
TOSS: Text-guided Novel View Synthesis
Model Formulation
Adapting a Text-to-Image Model for NVS
Enabling Text-to-Image Models to Condition on Image
Enabling Text-to-Image Models to Condition on Relative Camera Pose
More Accurate Pose via Expert Denoisers.
Experiments
Experimental Settings
Novel View Synthesis from a Single Image
3D Reconstruction
Ablation Study
...and 17 more sections

Figures (16)

Figure 1: TOSS significantly boosts novel view plausibility with additional textual guidance (a) and grants users a better controllability over concealed parts of an object (b). We demonstrate higher novel view generation quality and multiview-consistency with look-around views (c) and random views (d).
Figure 2: The pipeline of TOSS (Left) and our conditioning mechanisms (Right).
Figure 3: Comparing previous image conditioning mechanisms (a-b) and TOSS (c).
Figure 4: Qualitative comparison of single-view NVS on GSO (Left) and RTMV (Right).
Figure 5: Qualitative comparison of 3D reconstruction on GSO and RTMV.
...and 11 more figures

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

TL;DR

Abstract

TOSS:High-quality Text-guided Novel View Synthesis from a Single Image

Authors

TL;DR

Abstract

Table of Contents

Figures (16)