Table of Contents
Fetching ...

Extreme Amodal Face Detection

Changlin Song, Yunzhong Hou, Michael Randall Barnes, Rahul Shome, Dylan Campbell

TL;DR

This work introduces extreme amodal detection for faces, defining the challenge of localizing objects that lie outside or are truncated by the image frame in a single image. It proposes a heatmap-based extreme amodal detector with a novel selective coarse-to-fine transformer decoder to efficiently infer unseen regions, avoiding costly generative pipelines. To support evaluation, the EXAFace dataset (derived from COCO) provides structured cases for inside, truncated, and outside faces with and without direct evidence. Empirical results show strong performance and notable efficiency advantages over generative baselines, with ablations clarifying the contributions of the multi-scale, token-selective design and highlighting practical limitations and societal considerations.

Abstract

Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches. Code, data, and models are available at https://charliesong1999.github.io/exaft_web/.

Extreme Amodal Face Detection

TL;DR

This work introduces extreme amodal detection for faces, defining the challenge of localizing objects that lie outside or are truncated by the image frame in a single image. It proposes a heatmap-based extreme amodal detector with a novel selective coarse-to-fine transformer decoder to efficiently infer unseen regions, avoiding costly generative pipelines. To support evaluation, the EXAFace dataset (derived from COCO) provides structured cases for inside, truncated, and outside faces with and without direct evidence. Empirical results show strong performance and notable efficiency advantages over generative baselines, with ablations clarifying the contributions of the multi-scale, token-selective design and highlighting practical limitations and societal considerations.

Abstract

Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches. Code, data, and models are available at https://charliesong1999.github.io/exaft_web/.

Paper Structure

This paper contains 25 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of our extreme amodal detector. (a) Flowchart of our approach. Given an input image, a feature map is extracted, from which a dedicated in-image detection head infers object boxes and a face probability heatmap. Separately, a transformer encoder--decoder shares information from the image to the extended area around the image. We propose an efficient selective coarse-to-fine decoder that starts with low resolution out-of-image positional encodings as the input tokens, then refines a selected subset of these tokens at higher resolutions. A second detection head uses these tokens to infer the out-of-image object boxes and heatmap. (b) Illustration of our selective coarse-to-fine mechanism. We first query the low-resolution regions, then use a scoring network to rank these regions and select the top-$\mu$% to be refined at a higher resolution, until at the same resolution as the input image feature map.
  • Figure 2: Qualitative results. The final row shows samples from the ground-truth conditional distributions. Our model effectively leverages contextual cues—such as nearby people (example 1), objects like a skateboard (example 2), or partial body evidence (example 4)—to infer completely unseen faces. In example 1, the model correctly extends predictions to the left, where a partial person is visible, but not to the right, demonstrating awareness of scene context and typical human height. Example 3 highlights the model's generalization to real-world scenarios. Unlike other examples where inputs are synthetically cropped from complete images, this example is naturally truncated (i.e., the faces were never captured in the original photo). Our model successfully generates plausible faces despite the lack of ground truth, demonstrating its practical utility for real-world photo expansion. Compared to our model, Pix2Gestalt struggles without large visible body parts, while the outpainting pipeline can infer outside faces but yields noisier and less consistent results.
  • Figure 3: Sensitivity analysis of the percentage of retained tokens $\mu$ at scale $\mathcal{S} = (2)$. The metrics are relatively insensitive to $\mu$, so we select $\mu=25\%$, which is computationally efficient without sacrificing performance. The original data is shown in the appendix (\ref{['tab:ana_topu2']}).
  • Figure 4: Analysis of multi-scale settings. We evaluate three scales $s=1,2,4$ and their combinations $\mathcal{S} = (4,2)$, $(2,1)$, $(4,2,1)$. The results show that $\mathcal{S}=(2,1)$ yields the highest AP$\textsubscript{t}$ and AR$\textsubscript{o}$, and is therefore adopted as our default setting. Original data is shown in the appendix (\ref{['tab:ana_multiscale']}).
  • Figure 5: Failure cases. Our model struggles to predict outside faces when contextual cues are weak. In the first and second examples, strong appearance evidence is present but location cues are limited. In the third and fourth examples, no appearance evidence is available, making the presence and location of an outside face ambiguous—even for human observers.
  • ...and 2 more figures