Table of Contents
Fetching ...

Aligning Text, Images, and 3D Structure Token-by-Token

Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari

TL;DR

Kyvo introduces a decoder-only transformer that unifies text, images, and a structured 3D scene modality by tokenizing scenes as object lists with shape, type, location, pose, and size. The 3D modality is compressed via a Trellis-based 3D VQ-VAE into 512 tokens per object and integrated with image and text tokens into a single autoregressive vocabulary, enabling tasks such as 3D reconstruction from a single image, image-conditioned 3D rendering, real-world object recognition, instruction-following in 3D editing, and QA. The authors provide a data- and sequence-design cookbook, showing that a hybrid number encoding, image-before-3D input ordering, and center-token reordering with weighted first-tokens are crucial for robust generation, and demonstrate strong performance across CLEVR, ObjaWorld, Objectron, and ARKitScenes, including superior 3D reconstruction and competitive real-world recognition. Limitations include limited cross-domain 3D data and generalization challenges, suggesting future work on mixed-domain training to broaden Kyvo’s applicability while maintaining its object-centric, end-to-end autoregressive capabilities.

Abstract

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

Aligning Text, Images, and 3D Structure Token-by-Token

TL;DR

Kyvo introduces a decoder-only transformer that unifies text, images, and a structured 3D scene modality by tokenizing scenes as object lists with shape, type, location, pose, and size. The 3D modality is compressed via a Trellis-based 3D VQ-VAE into 512 tokens per object and integrated with image and text tokens into a single autoregressive vocabulary, enabling tasks such as 3D reconstruction from a single image, image-conditioned 3D rendering, real-world object recognition, instruction-following in 3D editing, and QA. The authors provide a data- and sequence-design cookbook, showing that a hybrid number encoding, image-before-3D input ordering, and center-token reordering with weighted first-tokens are crucial for robust generation, and demonstrate strong performance across CLEVR, ObjaWorld, Objectron, and ARKitScenes, including superior 3D reconstruction and competitive real-world recognition. Limitations include limited cross-domain 3D data and generalization challenges, suggesting future work on mixed-domain training to broaden Kyvo’s applicability while maintaining its object-centric, end-to-end autoregressive capabilities.

Abstract

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

Paper Structure

This paper contains 33 sections, 3 equations, 27 figures, 9 tables.

Figures (27)

  • Figure 1: Kyvo: a decoder-only transformer aligns a structured 3D modality with language and vision. This 3D modality represents scenes as lists of objects, each defined by its 3D shape, type, 3D position, pose and size parameters. Kyvo unifies the token space of images, text, and 3D to enable a variety of complex visual 3D tasks.
  • Figure 2: 3D task examples with Kyvo's unified autoregressive framework using a structured 3D modality. (1) 3D shape and scene reconstruction: From a single input image, Kyvo reconstructs individual objects with accurate geometry and spatial relationships. (2) 3D object recognition: Given an input image, Kyvo identifies objects and predicts their 3D positions in real-world scenes. (3) Shape and scene rendering: Kyvo generates semantically consistent images from structured 3D scene inputs. (4) Instruction-Following: Given an image, 3D scene and text instruction, Kyvo produces coherent modifications to both image and the 3D representation.
  • Figure 3: 3D VQ-VAE training involves the standard VQ-VAE losses (including reconstruction loss) applied in latent space as well as an auxiliary reconstruction loss applied in decoded pixel space.
  • Figure 4: 3D Tokenization findings and comparisons. (a) Effect of auxiliary reconstruction loss. An auxiliary pixel-space reconstruction loss on decoded renders from multiple views of the 3D object leads to much better reconstructions. (b) 3D tokenizer comparison. Reconstructions from our Trellis-based VQ-VAE exceed the quality of SAR3D reconstructions with fewer tokens; improved textures stem from the Trellis slat representation rather than triplanes. (c) Learned 3D shape encodings are effective during decoding. The 3D tokens used in Kyvo are sufficient for both reconstruction and rendering using Llama 3.2 as decoder.
  • Figure 5: Effect of Granularity. A $0.05$ granularity more accurately captures object locations and shapes.
  • ...and 22 more figures