Table of Contents
Fetching ...

Seeing the Unseen: Visual Common Sense for Semantic Placement

Ram Ramrakhya, Aniruddha Kembhavi, Dhruv Batra, Zsolt Kira, Kuo-Hao Zeng, Luca Weihs

TL;DR

The paper tackles Semantic Placement (SP), predicting placement regions for objects not present in an image. It introduces a scalable, inpainting-based data-generation pipeline using web images and Detic/SAM, creating the LAION-SP dataset, and a high-quality synthetic finetuning set (HSSD). The CLIP-UNet model, with a frozen CLIP backbone and language-conditioned decoder, is pretrained on LAION-SP and finetuned on HSSD, achieving superior SP localization and favorable human preferences, while enabling Embodied Semantic Placement (eSP) in a photorealistic simulator. The work demonstrates strong generalization to real and synthetic data and lays groundwork for SP-enabled assistive robots and AR rendering, while acknowledging limitations from foundation-model artifacts and embodiment-related challenges. Overall, this approach advances invisible-visual-reasoning by harnessing large-scale web data and CLIP-conditioned architectures to predict plausible object placements in complex scenes.

Abstract

Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding what is not present. Specifically, given an image (e.g. of a living room) and name of an object ("cushion"), a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space). Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context from web, and then remove that object from the image via inpainting. This automated pipeline converts unstructured web data into a dataset comprising pairs of images with/without the object. Using this, we collect a novel dataset, with ${\sim}1.3$M images across $9$ object categories, and train a SP prediction model called CLIP-UNet. CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors on real-world and simulated images. In our user studies, we find that the SP masks predicted by CLIP-UNet are favored $43.7\%$ and $31.3\%$ times when comparing against the $4$ SP baselines on real and simulated images. In addition, we demonstrate leveraging SP mask predictions from CLIP-UNet enables downstream applications like building tidying robots in indoor environments.

Seeing the Unseen: Visual Common Sense for Semantic Placement

TL;DR

The paper tackles Semantic Placement (SP), predicting placement regions for objects not present in an image. It introduces a scalable, inpainting-based data-generation pipeline using web images and Detic/SAM, creating the LAION-SP dataset, and a high-quality synthetic finetuning set (HSSD). The CLIP-UNet model, with a frozen CLIP backbone and language-conditioned decoder, is pretrained on LAION-SP and finetuned on HSSD, achieving superior SP localization and favorable human preferences, while enabling Embodied Semantic Placement (eSP) in a photorealistic simulator. The work demonstrates strong generalization to real and synthetic data and lays groundwork for SP-enabled assistive robots and AR rendering, while acknowledging limitations from foundation-model artifacts and embodiment-related challenges. Overall, this approach advances invisible-visual-reasoning by harnessing large-scale web data and CLIP-conditioned architectures to predict plausible object placements in complex scenes.

Abstract

Computer vision tasks typically involve describing what is present in an image (e.g. classification, detection, segmentation, and captioning). We study a visual common sense task that requires understanding what is not present. Specifically, given an image (e.g. of a living room) and name of an object ("cushion"), a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house), and AR devices (automatically rendering an object in the user's space). Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context from web, and then remove that object from the image via inpainting. This automated pipeline converts unstructured web data into a dataset comprising pairs of images with/without the object. Using this, we collect a novel dataset, with M images across object categories, and train a SP prediction model called CLIP-UNet. CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors on real-world and simulated images. In our user studies, we find that the SP masks predicted by CLIP-UNet are favored and times when comparing against the SP baselines on real and simulated images. In addition, we demonstrate leveraging SP mask predictions from CLIP-UNet enables downstream applications like building tidying robots in indoor environments.
Paper Structure (25 sections, 14 figures, 7 tables)

This paper contains 25 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Semantic Placement. Consider asking an agent to place cushions in a living room. In (a), the couch on the right is already full with cushions, and a natural human preference would be to place the cushion against the backrest of the armchair. In (b), a natural placement preference would be center of the couch. We propose the problem of Semantic Placement (SP) -- given an image and a name of an object, a vision system must predict a semantic mask indicating a valid placement for the object in the image. For both (a) and (b) GPT4V gives meaningful natural language responses but, as we show, struggles to localize regions precisely in pixel space. (c) Our SP predictions enable a Stretch robot Kemp2022StretchRobot from Hello Robot to perform Embodied Semantic Placement (eSP) task within a photorealistic simulated environment.
  • Figure 2: Automatic Training Dataset Generation Pipeline Utilizing Foundation Models and Web Data. Our pipeline consists of five steps. (A) Query Images: we collect raw images from LAION schuhmann2022laion using sample text queries such as 'living room' shown in the leftmost panel. (B) Find Objects of Interest: we employ Detic zhou2022detecting and SAM kirillov2023segany to identify the segmentation masks of objects of interest. (C)Inpaint Objects of Interest: we use inpainting models to remove the objects of interest from the images. (D) Filter: we discard images where impainting failed by attempting to detect inpainted objects. (E) Enhance Image Quality: we leverage Stable Diffusion img2img rombach2021highresolution and SDEdit meng2022sdedit to enhance the quality of the generated images, which is crucial for training our Semantic Placement model.
  • Figure 3: Qualitative Examples of Generated Images. We present three examples of Cushion, Laptop, and Potted Plants, which include raw images queried from LAION (left), identified objects of interest and their segmentation masks obtained from SAM (middle), and the result images after Inpainting, Flitering, and Quality Enhancement steps (right). For clarity, we have magnified the inpainted regions, highlighted in green dotted boxes.
  • Figure 4: CLIP-UNet for the SP task. Inspired by CLIPort shridhar2021cliport, we first encode the input image $I$ into a feature sensor $f$, and encode the target object category $q$ into an embedding $e$. Further downsampling and tiling ensure that the target embedding matches the dimension of the feature tensors $f^{(\ell)}$ at the first three decoder layers. We then use an element-wise product to combine the target embedding $e^{(\ell)}$ and the feature tensor $f^{(\ell)}$ to achieve semantic conditioning. Similar to LingUNet lingunet, we add skip-connections for these three layers. Finally, CLIPort outputs a mask prediction on the image, indicating the optimal region to place the given target object.
  • Figure 5: IoUv.s.IoP. Top left: a hypothetical ground-truth (GT) SP region for objects of type "book". Top right & bottom left: two possible SP predictions. Both predicted regions are high-quality and should be considered true-positives. The IoU for these predictions is, however, $<0.5$ as the IoU normalizes by the large GT region. The IoP, however, only normalizes by the predicted mask's size and thus is equal to 1 for both predicted regions.
  • ...and 9 more figures