Table of Contents
Fetching ...

TAO-Amodal: A Benchmark for Tracking Any Object Amodally

Cheng-Yen Hsieh, Kaihua Chen, Achal Dave, Tarasha Khurana, Deva Ramanan

TL;DR

The paper addresses the lack of large-scale amodal perception benchmarks for tracking under occlusion by introducing TAO-Amodal, a real-world, high-diversity dataset with amodal and modal bounding boxes across 833 categories. It proposes an Amodal Expander plug-in to adapt existing modal trackers to produce amodal predictions and a Paste-and-Occlude data augmentation method to simulate occlusions. Through extensive benchmarks, the authors show that standard modal trackers falter under heavy and out-of-frame occlusion, while fine-tuning with the expander and PnO yields meaningful gains, establishing a practical path toward robust amodal tracking. This work provides a foundation for amodal perception in real-world, large-vocabulary settings and offers concrete guidance for improving occlusion handling in tracking systems.

Abstract

Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of \textit{modal} annotations in most benchmarks. To address the scarcity of amodal benchmarks, we introduce TAO-Amodal, featuring 833 diverse categories in thousands of video sequences. Our dataset includes \textit{amodal} and modal bounding boxes for visible and partially or fully occluded objects, including those that are partially out of the camera frame. We investigate the current lay of the land in both amodal tracking and detection by benchmarking state-of-the-art modal trackers and amodal segmentation methods. We find that existing methods, even when adapted for amodal tracking, struggle to detect and track objects under heavy occlusion. To mitigate this, we explore simple finetuning schemes that can increase the amodal tracking and detection metrics of occluded objects by 2.1\% and 3.3\%.

TAO-Amodal: A Benchmark for Tracking Any Object Amodally

TL;DR

The paper addresses the lack of large-scale amodal perception benchmarks for tracking under occlusion by introducing TAO-Amodal, a real-world, high-diversity dataset with amodal and modal bounding boxes across 833 categories. It proposes an Amodal Expander plug-in to adapt existing modal trackers to produce amodal predictions and a Paste-and-Occlude data augmentation method to simulate occlusions. Through extensive benchmarks, the authors show that standard modal trackers falter under heavy and out-of-frame occlusion, while fine-tuning with the expander and PnO yields meaningful gains, establishing a practical path toward robust amodal tracking. This work provides a foundation for amodal perception in real-world, large-vocabulary settings and offers concrete guidance for improving occlusion handling in tracking systems.

Abstract

Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of \textit{modal} annotations in most benchmarks. To address the scarcity of amodal benchmarks, we introduce TAO-Amodal, featuring 833 diverse categories in thousands of video sequences. Our dataset includes \textit{amodal} and modal bounding boxes for visible and partially or fully occluded objects, including those that are partially out of the camera frame. We investigate the current lay of the land in both amodal tracking and detection by benchmarking state-of-the-art modal trackers and amodal segmentation methods. We find that existing methods, even when adapted for amodal tracking, struggle to detect and track objects under heavy occlusion. To mitigate this, we explore simple finetuning schemes that can increase the amodal tracking and detection metrics of occluded objects by 2.1\% and 3.3\%.
Paper Structure (31 sections, 1 equation, 9 figures, 14 tables)

This paper contains 31 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: TAO-Amodal. We present TAO-Amodal, a dataset of amodal (bounding box) annotations for fully occluded and partially occluded (both within the image frame and out-of-frame) objects in videos from the TAO dataset dave2020tao. Our dataset consists of 332k boxes that cover multiple occlusion scenarios across 2,907 videos with annotations for 833 object categories. TAO-Amodal aims at assessing the occlusion reasoning capabilities of current trackers for amodal tracking of any object.
  • Figure 2: Traditional modal perception (top) vs. amodal perception (bottom). Given a sequence of images, traditional detection and tracking algorithms concentrate on identifying visible segments of multiple objects within the scene. Consequently, they face challenges resulting in perculiar output such as vanishing bounding boxes or tiny box sizes under occlusion scenarios. Amodal perception advances beyond conventional approaches by inferring complete object boundaries, thereby predicting bounding boxes that extend to the full object extent, even when certain portions are occluded.
  • Figure 3: Class distribution. We present counts of instances from top 8 most frequent categories and other categories, using a logarithmic scale.
  • Figure 4: Object occlusion distribution. We plot the distribution at a 10% visibility span.
  • Figure 5: ROI Head girshick2015fast with Amodal Expander. Amodal Expander serves as a plug-in fine-tuning scheme to "amodalize" existing detectors or trackers with limited (amodal) training data. It operates by taking as input region proposal features and modal box predictions (often represented as a residual delta with respect the region proposal) and generates amodal box outputs (again represented as residual deltas). We freeze all modules except the amodal expander during fine-tuning.
  • ...and 4 more figures