Table of Contents
Fetching ...

Believing is Seeing: Unobserved Object Detection using Generative Models

Subhransu S. Bhattacharjee, Dylan Campbell, Rahul Shome

TL;DR

This work introduces unobserved object detection, aiming to locate objects not visible within a camera frame by modeling spatio-semantic distributions over extended 2D and 3D domains. It develops three pipelines—3D diffusion with forward models, 2D diffusion with outpainting, and vision-language model querying—to estimate the distributions conditioned on a single RGB image, and proposes a standardized metric suite to evaluate them. Across RealEstate10k and NYU Depth V2 indoors, 3D diffusion approaches excel at occluded and out-of-frame detection, while 2D diffusion and VLMs show strengths and limitations in region-wise reasoning and speed. The results underscore the potential of generative priors for perception under partial observability, while highlighting practical bottlenecks such as compute time and dependence on prompts or pretraining data.

Abstract

Can objects that are not visible in an image -- but are in the vicinity of the camera -- be detected? This study introduces the novel tasks of 2D, 2.5D and 3D unobserved object detection for predicting the location of nearby objects that are occluded or lie outside the image frame. We adapt several state-of-the-art pre-trained generative models to address this task, including 2D and 3D diffusion models and vision-language models, and show that they can be used to infer the presence of objects that are not directly observed. To benchmark this task, we propose a suite of metrics that capture different aspects of performance. Our empirical evaluation on indoor scenes from the RealEstate10k and NYU Depth v2 datasets demonstrate results that motivate the use of generative models for the unobserved object detection task.

Believing is Seeing: Unobserved Object Detection using Generative Models

TL;DR

This work introduces unobserved object detection, aiming to locate objects not visible within a camera frame by modeling spatio-semantic distributions over extended 2D and 3D domains. It develops three pipelines—3D diffusion with forward models, 2D diffusion with outpainting, and vision-language model querying—to estimate the distributions conditioned on a single RGB image, and proposes a standardized metric suite to evaluate them. Across RealEstate10k and NYU Depth V2 indoors, 3D diffusion approaches excel at occluded and out-of-frame detection, while 2D diffusion and VLMs show strengths and limitations in region-wise reasoning and speed. The results underscore the potential of generative priors for perception under partial observability, while highlighting practical bottlenecks such as compute time and dependence on prompts or pretraining data.

Abstract

Can objects that are not visible in an image -- but are in the vicinity of the camera -- be detected? This study introduces the novel tasks of 2D, 2.5D and 3D unobserved object detection for predicting the location of nearby objects that are occluded or lie outside the image frame. We adapt several state-of-the-art pre-trained generative models to address this task, including 2D and 3D diffusion models and vision-language models, and show that they can be used to infer the presence of objects that are not directly observed. To benchmark this task, we propose a suite of metrics that capture different aspects of performance. Our empirical evaluation on indoor scenes from the RealEstate10k and NYU Depth v2 datasets demonstrate results that motivate the use of generative models for the unobserved object detection task.
Paper Structure (49 sections, 4 equations, 13 figures, 17 tables)

This paper contains 49 sections, 4 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Unobserved object detection aims to infer the location of objects that were not directly observed in an image. Consider this toy example of a dining table and chairs. Here we visualize a top-down view (slice) of the predicted discrete distributions $\mathcal{D}_{\mathcal{I}o}^{\text{2D}}$ and $\mathcal{D}_{\mathcal{I}o}^{\text{3D}}$ for the object label $o$ of "chair," conditioned on the image $\mathcal{I}$, where darker is more probable. The presence of an occluded chair (A) is predicted as relatively likely, as is the presence of an out-of-frame chair (B). Crucially, the domain $\mathbb{V}$ of the predicted 3D distribution exceeds the camera frustum (drawn in black), and the domain $\mathbb{I}$ of the 2D distribution extends beyond the image plane $\mathcal{I}$.
  • Figure 2: The 3D diffusion-based pipeline.
  • Figure 3: The 2D diffusion-based pipeline.
  • Figure 4: The VLM-based pipeline.
  • Figure 5: Qualitative results. Each row shows the predicted 2D and top-down 3D spatial distributions generated by each method for various object categories: TV (first row), refrigerator (second row), sink (third row), laptop (fourth row), and sink (fifth row). Notably, in the bottom row, the DFM-based model infers the likely presence of a sink, occluded by the refrigerator, albeit not with a high likelihood. A white triangle marks the camera position, while dashed and dot-dashed lines depict the camera frustums for $\mathcal{I}$ and $\mathbb{I}$. The white star indicates the ground-truth position of the object, when visible in 2D. Heatmap colors indicate object likelihood, with warmer tones representing higher probabilities. Since these are spatially-normalized distributions, we use a log-scale for visualization.
  • ...and 8 more figures