Table of Contents
Fetching ...

Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net

Fangting Xia, Peng Wang, Liang-Chieh Chen, Alan L. Yuille

TL;DR

The paper tackles the challenge of parsing articulated objects in natural images amid large scale and pose variations. It introduces the Hierarchical Auto-Zoom Net (HAZN), a two-stage, scale-adaptive framework consisting of object-scale and part-scale Auto-Zoom Nets that predict ROIs, zoom regions to canonical sizes, and iteratively refine per-pixel part scores through shared FCNs. Empirical results on the PASCAL-Person-Part and Horse-Cow datasets show that HAZN outperforms strong baselines, particularly for small parts and across varying object sizes, with ablations confirming the importance of both object- and part-scale AZNs. The approach offers a practical, computationally efficient way to handle scale variability by focusing processing on relevant image regions rather than the entire image at multiple fixed scales, and suggests extensions to finer-grained parts and related tasks.

Abstract

Parsing articulated objects, e.g. humans and animals, into semantic parts (e.g. body, head and arms, etc.) from natural images is a challenging and fundamental problem for computer vision. A big difficulty is the large variability of scale and location for objects and their corresponding parts. Even limited mistakes in estimating scale and location will degrade the parsing output and cause errors in boundary details. To tackle these difficulties, we propose a "Hierarchical Auto-Zoom Net" (HAZN) for object part parsing which adapts to the local scales of objects and parts. HAZN is a sequence of two "Auto-Zoom Net" (AZNs), each employing fully convolutional networks that perform two tasks: (1) predict the locations and scales of object instances (the first AZN) or their parts (the second AZN); (2) estimate the part scores for predicted object instance or part regions. Our model can adaptively "zoom" (resize) predicted image regions into their proper scales to refine the parsing. We conduct extensive experiments over the PASCAL part datasets on humans, horses, and cows. For humans, our approach significantly outperforms the state-of-the-arts by 5% mIOU and is especially better at segmenting small instances and small parts. We obtain similar improvements for parsing cows and horses over alternative methods. In summary, our strategy of first zooming into objects and then zooming into parts is very effective. It also enables us to process different regions of the image at different scales adaptively so that, for example, we do not need to waste computational resources scaling the entire image.

Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net

TL;DR

The paper tackles the challenge of parsing articulated objects in natural images amid large scale and pose variations. It introduces the Hierarchical Auto-Zoom Net (HAZN), a two-stage, scale-adaptive framework consisting of object-scale and part-scale Auto-Zoom Nets that predict ROIs, zoom regions to canonical sizes, and iteratively refine per-pixel part scores through shared FCNs. Empirical results on the PASCAL-Person-Part and Horse-Cow datasets show that HAZN outperforms strong baselines, particularly for small parts and across varying object sizes, with ablations confirming the importance of both object- and part-scale AZNs. The approach offers a practical, computationally efficient way to handle scale variability by focusing processing on relevant image regions rather than the entire image at multiple fixed scales, and suggests extensions to finer-grained parts and related tasks.

Abstract

Parsing articulated objects, e.g. humans and animals, into semantic parts (e.g. body, head and arms, etc.) from natural images is a challenging and fundamental problem for computer vision. A big difficulty is the large variability of scale and location for objects and their corresponding parts. Even limited mistakes in estimating scale and location will degrade the parsing output and cause errors in boundary details. To tackle these difficulties, we propose a "Hierarchical Auto-Zoom Net" (HAZN) for object part parsing which adapts to the local scales of objects and parts. HAZN is a sequence of two "Auto-Zoom Net" (AZNs), each employing fully convolutional networks that perform two tasks: (1) predict the locations and scales of object instances (the first AZN) or their parts (the second AZN); (2) estimate the part scores for predicted object instance or part regions. Our model can adaptively "zoom" (resize) predicted image regions into their proper scales to refine the parsing. We conduct extensive experiments over the PASCAL part datasets on humans, horses, and cows. For humans, our approach significantly outperforms the state-of-the-arts by 5% mIOU and is especially better at segmenting small instances and small parts. We obtain similar improvements for parsing cows and horses over alternative methods. In summary, our strategy of first zooming into objects and then zooming into parts is very effective. It also enables us to process different regions of the image at different scales adaptively so that, for example, we do not need to waste computational resources scaling the entire image.

Paper Structure

This paper contains 29 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Intuition of our Hierarchical Auto-Zoom model (HAZN). (a) The scale and location of an object and its parts (the red dashed boxes) can be estimated from the observed field of view (the black solid box) of a neural network. (b) Part parsing can be more accurate by using proper object and part scales. At the top row, we show our estimated object and part scales. In the bottom row, our part parsing results gradually become better by increasingly utilizing the estimated object and part scales.
  • Figure 2: Testing framework of Hierarchical Auto-Zoom Net (HAZN). We address object part parsing in a wild scene, adapting to the size of objects (object-scale AZN) and parts (part-scale AZN). The part scores are predicted and refined by three FCNs, over three levels of granularity, i.e. image-level, object-level, and part-level. At each level, the FCN outputs the part score map for the current level, and estimates the locations and scales for next level. The details of parts are gradually discovered and improved along the proposed auto-zoom process (i.e. location/scale estimation, region zooming, and part score re-estimation).
  • Figure 3: Object-scale Auto-Zoom model from a probabilistic view, which predicts ROI region $N(k)$ at object-scale, and then refines part scores based on the properly zoomed region $N(k)$. Details are in Sec. \ref{['subsec:AZN']}.
  • Figure 4: Ground truth regression target for training the scale estimation network (SEN) in the image-level FCN. Details in Sec. \ref{['subsec:train_test']}.
  • Figure 5: Qualitative comparison on the PASCAL-Person-Part dataset. We compare with DeepLab-LargeFOV-CRF chen2014semantic and HAZN (no part scale). Our proposed HAZN models (the $3_{rd}$ and $4_{th}$ columns) attain better visual parsing results, especially for small scale human instances and small parts such as legs and arms.
  • ...and 3 more figures