Structured Generative Models for Scene Understanding
Christopher K. I. Williams
TL;DR
This paper advocates structured generative models (SGMs) as a principled framework for static scene understanding, emphasizing object-centric representations and scene-level relations that enable coherent, editable 3D reconstructions from images. It surveys object models (things vs. stuff), scene models (autoregressive, energy-based, hierarchical grammars), and inference strategies (MCMC, variational methods, differentiable rendering), highlighting the benefits of compositionality, interpretability, and multi-task transfer. The authors discuss strengths and limitations of SGMs, the open-world challenges, and the need for datasets and benchmarks to study 3D scene inference, editing, and completion. They conclude with a roadmap for advancing the field, including richer object/scene models, scalable inference, and standardized evaluation toward end-to-end SGM-enabled scene understanding.
Abstract
This position paper argues for the use of \emph{structured generative models} (SGMs) for the understanding of static scenes. This requires the reconstruction of a 3D scene from an input image (or a set of multi-view images), whereby the contents of the image(s) are causally explained in terms of models of instantiated objects, each with their own type, shape, appearance and pose, along with global variables like scene lighting and camera parameters. This approach also requires scene models which account for the co-occurrences and inter-relationships of objects in a scene. The SGM approach has the merits that it is compositional and generative, which lead to interpretability and editability. \\\\ To pursue the SGM agenda, we need models for objects and scenes, and approaches to carry out inference. We first review models for objects, which include ``things'' (object categories that have a well defined shape), and ``stuff'' (categories which have amorphous spatial extent). We then move on to review \emph{scene models} which describe the inter-relationships of objects. Perhaps the most challenging problem for SGMs is \emph{inference} of the objects, lighting and camera parameters, and scene inter-relationships from input consisting of a single or multiple images. We conclude with a discussion of issues that need addressing to advance the SGM agenda.
