Waterfall Transformer for Multi-person Pose Estimation
Navin Ranjan, Bruno Artacho, Andreas Savakis
TL;DR
The paper addresses the challenge of accurate multi-person 2D pose estimation by introducing Waterfall Pose (WTPose), a single-pass end-to-end framework that leverages a Waterfall Transformer Module to fuse multi-scale backbone features and expand receptive fields through a dilated attention cascade. Built on a modified Swin Transformer backbone, WTPose processes features from multiple backbone stages and uses the WTM to enhance local and global context, producing joint heatmaps with improved accuracy. Empirical results on COCO show that WTPose outperforms comparable transformer-based pose methods, with additional gains from architectural ablations such as Stem/ResNet bottlenecks and specific dilation configurations. The approach offers a practical, high-accuracy solution for multi-person pose estimation with efficient parameter utilization, demonstrating the value of multi-scale, waterfall-style attention in vision transformers.
Abstract
We propose the Waterfall Transformer architecture for Pose estimation (WTPose), a single-pass, end-to-end trainable framework designed for multi-person pose estimation. Our framework leverages a transformer-based waterfall module that generates multi-scale feature maps from various backbone stages. The module performs filtering in the cascade architecture to expand the receptive fields and to capture local and global context, therefore increasing the overall feature representation capability of the network. Our experiments on the COCO dataset demonstrate that the proposed WTPose architecture, with a modified Swin backbone and transformer-based waterfall module, outperforms other transformer architectures for multi-person pose estimation
