DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

Guoqiang Liang; Jiahao Hu; Qingyue Wang; Shizhou Zhang

DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

Guoqiang Liang, Jiahao Hu, Qingyue Wang, Shizhou Zhang

TL;DR

This work tackles human de-occlusion by introducing DMAT, a Dynamic Mask-Aware Transformer that fuses expanded local context with global modeling while dynamically focusing on visible human regions. It comprises an Expanded Convolution Head for richer local features, a Dynamic Human-Mask Guided Attention transformer body to prevent attention drift toward occluders, and a Region Upsampling decoder to preserve boundary quality, all trained within a GAN framework using an amodal loss to constrain recovery to the human region. The approach achieves state-of-the-art results on the AHP dataset, with clear improvements in HFID and FID metrics and strong qualitative recovery of human appearance under heavy occlusion. The proposed masking strategy and loss formulation offer robust, region-focused de-occlusion capabilities with practical implications for related vision tasks requiring human appearance recovery.

Abstract

Human de-occlusion, which aims to infer the appearance of invisible human parts from an occluded image, has great value in many human-related tasks, such as person re-id, and intention inference. To address this task, this paper proposes a dynamic mask-aware transformer (DMAT), which dynamically augments information from human regions and weakens that from occlusion. First, to enhance token representation, we design an expanded convolution head with enlarged kernels, which captures more local valid context and mitigates the influence of surrounding occlusion. To concentrate on the visible human parts, we propose a novel dynamic multi-head human-mask guided attention mechanism through integrating multiple masks, which can prevent the de-occluded regions from assimilating to the background. Besides, a region upsampling strategy is utilized to alleviate the impact of occlusion on interpolated images. During model learning, an amodal loss is developed to further emphasize the recovery effect of human regions, which also refines the model's convergence. Extensive experiments on the AHP dataset demonstrate its superior performance compared to recent state-of-the-art methods.

DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 6 figures, 4 tables)

This paper contains 30 sections, 8 equations, 6 figures, 4 tables.

Introduction
Related Work
Image Amodal Completion
Human De-occlusion.
Image Inpainting
CNN-based Inpainting.
Transformer-based Inpainting.
Method
Overview
Expanded Convolution Head
Dynamic Human-Mask Guided Attention
Updating Strategy for Masks.
Region Upsampling Decoder
Amodal Loss
Adversarial Loss.
...and 15 more sections

Figures (6)

Figure 1: The proposed dynamic mask-aware transformer (DMAT) for human de-occlusion, which consists of an Expanded Convolution Head, a Transformer Body and a Region Upsampling Decoder. The visible mask and the amodal mask are delivered to the mask resizing module to guide the Transformer Body and the Region Upsampling Decoder.
Figure 2: Illustration of self-attention in shifted window partitioning. 'A, B, C, D' represent different regions.
Figure 3: Token representation. (a) Patch to tokens. (b) Restrictive Receptive Field (RF) feature to tokens. (c) Our Expanded RF feature to tokens. Our tokens have a large RF and use a stacked $(\times3)$ CNN embedding. These tokens with richer local features will promote global context modelling in the swin-transformer body.
Figure 4: Structure of a single transformer stage. "TB" refers to a transformer block, whose core module is the proposed DHMGA, which aggregates tokens from different regions by different weights. See text for more details
Figure 5: Comparison of visual examples. DMAT generates more reasonable appearances and human structures. Please zoom in to see details.
...and 1 more figures

DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

TL;DR

Abstract

DMAT: A Dynamic Mask-Aware Transformer for Human De-occlusion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)