DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion
Guoqiang Liang, Jiahao Hu, Qingyue Wang, Shizhou Zhang
TL;DR
This work tackles human de-occlusion by introducing DMAT, a Dynamic Mask-Aware Transformer that fuses expanded local context with global modeling while dynamically focusing on visible human regions. It comprises an Expanded Convolution Head for richer local features, a Dynamic Human-Mask Guided Attention transformer body to prevent attention drift toward occluders, and a Region Upsampling decoder to preserve boundary quality, all trained within a GAN framework using an amodal loss to constrain recovery to the human region. The approach achieves state-of-the-art results on the AHP dataset, with clear improvements in HFID and FID metrics and strong qualitative recovery of human appearance under heavy occlusion. The proposed masking strategy and loss formulation offer robust, region-focused de-occlusion capabilities with practical implications for related vision tasks requiring human appearance recovery.
Abstract
Human de-occlusion, which aims to infer the appearance of invisible human parts from an occluded image, has great value in many human-related tasks, such as person re-id, and intention inference. To address this task, this paper proposes a dynamic mask-aware transformer (DMAT), which dynamically augments information from human regions and weakens that from occlusion. First, to enhance token representation, we design an expanded convolution head with enlarged kernels, which captures more local valid context and mitigates the influence of surrounding occlusion. To concentrate on the visible human parts, we propose a novel dynamic multi-head human-mask guided attention mechanism through integrating multiple masks, which can prevent the de-occluded regions from assimilating to the background. Besides, a region upsampling strategy is utilized to alleviate the impact of occlusion on interpolated images. During model learning, an amodal loss is developed to further emphasize the recovery effect of human regions, which also refines the model's convergence. Extensive experiments on the AHP dataset demonstrate its superior performance compared to recent state-of-the-art methods.
