Table of Contents
Fetching ...

Inverse Neural Rendering for Explainable Multi-Object Tracking

Julian Ost, Tanushree Banerjee, Mario Bijelic, Felix Heide

TL;DR

The paper tackles monocular 3D multi-object tracking by reframing it as an inverse rendering problem, optimizing over latent object representations within a differentiable rendering pipeline to fit 2D observations while recovering 3D pose, shape, and appearance. It uses a GET3D-based prior with disentangled shape and texture latents $z_S$ and $z_T$, and minimizes the loss $\mathcal{L}_{IR} = \mathcal{L}_{RGB} + \lambda \mathcal{L}_{perceptual}$, where $\mathcal{L}_{RGB} = \| (I_c - \hat{I}_c) \circ \hat{M}_{I_c} \|_2$ and $\mathcal{L}_{perceptual}$ is LPIPS-based. Tracking across frames is achieved by initializing from 2D detections, refining per-object latents and 3D state with a Kalman-filter-based prediction, and solving data association via the Hungarian algorithm, enabling robust generalization to unseen datasets (nuScenes and Waymo) without fine-tuning. The approach offers interpretability by exposing the recovered latent parameters and rendered objects, providing a rich, explainable 3D scene representation suitable for downstream planning and analysis.

Abstract

Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.

Inverse Neural Rendering for Explainable Multi-Object Tracking

TL;DR

The paper tackles monocular 3D multi-object tracking by reframing it as an inverse rendering problem, optimizing over latent object representations within a differentiable rendering pipeline to fit 2D observations while recovering 3D pose, shape, and appearance. It uses a GET3D-based prior with disentangled shape and texture latents and , and minimizes the loss , where and is LPIPS-based. Tracking across frames is achieved by initializing from 2D detections, refining per-object latents and 3D state with a Kalman-filter-based prediction, and solving data association via the Hungarian algorithm, enabling robust generalization to unseen datasets (nuScenes and Waymo) without fine-tuning. The approach offers interpretability by exposing the recovered latent parameters and rendered objects, providing a rich, explainable 3D scene representation suitable for downstream planning and analysis.

Abstract

Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.
Paper Structure (13 sections, 15 equations, 5 figures, 2 tables)

This paper contains 13 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Inverse Rendering for Monocular Multi-Object Tracking. For each detection, we initialize the embedding codes of an object generator $\mathbf{z}_S$ for shape and $\mathbf{z}_T$ for texture. The generative object prior model is frozen and only embedding codes, pose, and size of each object instance are optimized through inverse rendering to best fit the image observation. Inverse-rendered texture and shape embeddings and refined object locations are provided to the matching stage to match predicted states of tracked objects of the past and new observations. Matched and new tracklets are updated, and unmatched tracklets are ultimately discarded before predicting states in the next step.
  • Figure 2: Tracking via Inverse Neural Rendering on nuScenes caesar2020nuscenes. From left to right, we show (i) observed images from diverse scenes at timestep $k=0$; (ii) an overlay of the optimized generated object and its 3D bounding boxes at timestep $k=0, 1, 2 \text{ and } 3$. The color of the bounding boxes for each object corresponds to the predicted tracklet ID. We see that even in such diverse scenarios, our method does not lose any tracks and performs robustly across all scenarios, although the dataset is unseen.
  • Figure 3: Without changing the model or training on the dataset, our proposed method can generalize well to the Waymo Open Driving Dataset sun2020scalability. Similar to Fig \ref{['fig:nuScenes_results']}, from left to right, we show (i) observed images from diverse scenes from the dataset at timestep $k=0$; (ii) an overlay of the closest generated object and predicted 3D bounding boxes at timestep $k=0, 1, 2 \text{ and } 3$. The color of the bounding boxes for each object corresponds to the predicted tracklet ID. Our method does not lose any tracks even on a different unseen dataset in diverse scenes, validating that the approach generalizes.
  • Figure 4: Optimization Process. From left to right, we show (i) the observed image, (ii) the rendering predicted by the initial starting point latent embeddings, (iii) the predicted rendered objects after the texture code is optimized (iv) the predicted rendered objects after the translation, scale, and rotation are optimized, and (v) the predicted rendered objects after the shape latent code is optimized. The ground truth images are faded to show our rendered objects clearly. Our method is capable of refining the predicted texture, pose, and shape over several optimization steps, even if initialized with poses or appearance far from the target -- all corrected through inverse rendering.
  • Figure 5: Layout Generation Through Inverse Rendering. From left to right, we show (i) observed image from a single camera for two scenes, (ii) test-time optimized inverse rendered (IR) objects of class "car", and (iii) Bird's Eye View (BEV) layout of the scene. In the BEV layout, black boxes represent ground truth, and the colored boxes represent predicted BEV boxes. The bottom shows a zoomed-in region at a 60 m distance (see BEV layout). Even in this setting, our method recovers the coarse appearance, shape of the objects, pose, and size,