Table of Contents
Fetching ...

Reconstructing Animals and the Wild

Peter Kulits, Michael J. Black, Silvia Zuffi

TL;DR

The paper addresses the challenge of reconstructing natural outdoor scenes containing animals and their habitats from a single image. It introduces RAW, a framework that uses an autoregressive LLM to decode a CLIP embedding into a structured graphics code representing both animals and the wild, trained exclusively on a million-image synthetic dataset built with Infinigen. A key innovation is representing assets by continuous CLIP embeddings rather than discrete names, enabling scalable asset diversity and better semantic alignment, along with scene-level parameters and 9-parameter rotations. The approach generalizes to real-world images and enables editable, animatable 3D reconstructions, paving the way for computational ethology and richer analyses of animal behavior in context.

Abstract

The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here, we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes containing trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting environmental context. This limits their usefulness for analysis tasks, as animals exist inherently within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct natural scenes from single images. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one million images and thousands of assets. Our approach, having been trained solely on synthetic data, generalizes to the task of reconstructing animals and their environments in real-world images. We will release our dataset and code to encourage future research at https://raw.is.tue.mpg.de/

Reconstructing Animals and the Wild

TL;DR

The paper addresses the challenge of reconstructing natural outdoor scenes containing animals and their habitats from a single image. It introduces RAW, a framework that uses an autoregressive LLM to decode a CLIP embedding into a structured graphics code representing both animals and the wild, trained exclusively on a million-image synthetic dataset built with Infinigen. A key innovation is representing assets by continuous CLIP embeddings rather than discrete names, enabling scalable asset diversity and better semantic alignment, along with scene-level parameters and 9-parameter rotations. The approach generalizes to real-world images and enables editable, animatable 3D reconstructions, paving the way for computational ethology and richer analyses of animal behavior in context.

Abstract

The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here, we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes containing trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting environmental context. This limits their usefulness for analysis tasks, as animals exist inherently within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct natural scenes from single images. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one million images and thousands of assets. Our approach, having been trained solely on synthetic data, generalizes to the task of reconstructing animals and their environments in real-world images. We will release our dataset and code to encourage future research at https://raw.is.tue.mpg.de/

Paper Structure

This paper contains 18 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: We train an LLM to decode a frozen CLIP embedding of a natural image into a structured compositional scene representation encompassing both animals and their habitats.
  • Figure 2: Dataset Samples. Training samples from our synthesized dataset. See the Supp. Mat. for additional visualizations.
  • Figure 3: CLIP Head. Rather than teaching the LLM to generate asset names as discrete tokens without a semantically meaningful distance metric, we train the LLM to produce a special token to signal when the LLM hidden state should be projected into a continuous CLIP embedding.
  • Figure 4: Ablation Visualization. We observe that, while both the discrete-name IG-LLM baseline and the CLIP-estimation variant well-capture the layout of the in-distribution testing scenes, the discrete variant makes non-interpretable asset-selection errors. Rather than consistently matching a tiger with another estimated tiger asset, the model confuses it with a bush. Similarly, a bird is mistaken for a boulder. Rather than the errors being semantically meaningful misinterpretations, the discrete supervision leads to mistakes that do not make sense. In contrast, the CLIP-estimation variant consistently identifies objects with aligned interpretations.
  • Figure 5: Additional Reconstructions. Additional real-world-generalization samples. Note how we can reconstruct scenes where the animal is very far or very close to the camera, with severe occlusion, and in different lighting conditions.
  • ...and 1 more figures