Table of Contents
Fetching ...

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy, Hong-Xing Yu, Josiah Wong, Sanjana Srivastava, Sharon Lee, Shengxin Zha, Laurent Itti, Yunzhu Li, Roberto Martín-Martín, Miao Liu, Pengchuan Zhang, Ruohan Zhang, Li Fei-Fei, Jiajun Wu

TL;DR

The BEHAVIOR Vision Suite introduces a photorealistic, physically plausible synthetic data generator built atop an extended BEHAVIOR-1K asset base and OmniGibson rendering. By offering fine-grained control over scene, object, and camera parameters, BVS enables systematic evaluation of computer vision models under diverse domain shifts and tasks. The authors demonstrate three applications—parametric robustness analysis, holistic multi-task benchmarking, and sim2real transfer for object states and relations—showing that synthetic data can reveal robustness gaps and facilitate real-world transfer. Overall, BVS provides a versatile framework to generate high-quality, customizable datasets that support rigorous CV research and multi-task learning, addressing key limitations of real data and existing synthetic tools.

Abstract

The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

TL;DR

The BEHAVIOR Vision Suite introduces a photorealistic, physically plausible synthetic data generator built atop an extended BEHAVIOR-1K asset base and OmniGibson rendering. By offering fine-grained control over scene, object, and camera parameters, BVS enables systematic evaluation of computer vision models under diverse domain shifts and tasks. The authors demonstrate three applications—parametric robustness analysis, holistic multi-task benchmarking, and sim2real transfer for object states and relations—showing that synthetic data can reveal robustness gaps and facilitate real-world transfer. Overall, BVS provides a versatile framework to generate high-quality, customizable datasets that support rigorous CV research and multi-task learning, addressing key limitations of real data and existing synthetic tools.

Abstract

The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels, which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative, particularly for embodied AI tasks, they often fall short for computer vision tasks due to low asset and rendering quality, limited diversity, and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS), a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models, based on the newly developed embodied AI benchmark, BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g., lighting, object placement), the object level (e.g., joint configuration, attributes such as "filled" and "folded"), and the camera level (e.g., field of view, focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift, evaluating scene understanding models on the same set of images, and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/
Paper Structure (27 sections, 11 figures, 5 tables)

This paper contains 27 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Overview of BEHAVIOR Vision Suite (BVS), our proposed toolkit for computer vision research. BVS builds upon the extended object assets and scene instances from BEHAVIOR-1K li2023behavior1k, and provides a customizable data generator that allows users to generate photorealistic, physically plausible labeled data in a controlled manner. We demonstrate BVS with three representative applications.
  • Figure 2: Overview of extended BEHAVIOR-1K assets: Covering a wide range of object categories and scene types, our 3D assets have high visual and physical fidelity and rich annotations of semantic properties, allowing us to generate 1,000+ realistic scene configurations.
  • Figure 3: Parametric evaluation of object detection models on five example video clips. Selected frames from these clips are shown on the left, with the target object highlighted in magenta. Average Precisions (APs) for our baseline models in \ref{['sec:App-HSU']} are plotted on the right. Since BVS allows for full customization of scene layout and camera viewpoints, we can systematically evaluate model robustness against variations in object articulation, lighting conditions, visibility, zoom (object proximity), and pitch (object pose). As illustrated, current SOTA models exhibit limited robustness to these axes of variation.
  • Figure 4: Mean performance of open-vocab object detection and segmentation models across five axes. The larger a model's colored envelope, the more robust it is. Through BVS, new vision models can be systematically tested for their robustness along these five dimensions and beyond: our users can easily add new axes of domain shift with just a few lines of code.
  • Figure 5: Holistic Scene Understanding Dataset. We generated extensive traversal videos across representative scenes, each with 10+ camera trajectories. For each image, BVS generates various labels (e.g., scene graphs, segmentation masks, depth) as shown on the right.
  • ...and 6 more figures