SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Chull Hwan Song; Taebaek Hwang; Jooyoung Yoon; Shunghyun Choi; Yeong Hyeon Gu

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Chull Hwan Song, Taebaek Hwang, Jooyoung Yoon, Shunghyun Choi, Yeong Hyeon Gu

TL;DR

This work tackles the misalignment problem in fashion vision-language pretraining by introducing SyncMask, which uses cross-attention from a momentum model to generate synchronized masks that focus on co-occurring image patches and text tokens. It also refines grouped batch sampling with semi-hard negatives to reduce false negatives in ITC and ITM, addressing fashion-domain data scarcity and distribution bias. The method integrates synchronized masking into MLM and MIM losses and combines it with ITC/ITM objectives, achieving state-of-the-art results on FashionGen and FashionIQ downstream tasks, including cross-modal retrieval and TGIR. Overall, SyncMask advances fine-grained cross-modal understanding in fashion VL models and demonstrates practical improvements for retrieval and recognition tasks in data-scarce domains.

Abstract

Vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However, in fashion domain, datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text, leading to cases where some textual details are not visible in individual images. This mismatch, particularly when non-co-occurring elements are masked, undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling, thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem, we propose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model, ensuring a precise alignment between the two modalities. Additionally, we enhance grouped batch sampling with semi-hard negatives, effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach, outperforming existing methods in three downstream tasks.

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

TL;DR

Abstract

Paper Structure (33 sections, 21 equations, 7 figures, 6 tables)

This paper contains 33 sections, 21 equations, 7 figures, 6 tables.

Introduction
Related Works
Vision and Language (VL) Model
FashionVL Model
Attention-guided Masked Modeling
Methods
Preliminaries
Image-Text Contrastive Learning (ITC)
Image-Text Matching (ITM)
Synchronized Attentional Masked Modeling
Vision-Language Synchronized Attentionl Masking
Synchronized Attentional Masked Language Modeling
Synchronized Attentional Masked Image Modeling
Grouped Batch with Semi-hard Negatives
Experiments
...and 18 more sections

Figures (7)

Figure 1: Example of misaligned masks in the MLM task.
Figure 2: Overview of masking strategies using a teacher-student distillation framework. 1) Uni-modal models: (a) random masking, (b) teacher-guided attentional masking. 2) Multi-modal models: (c) random text masking, (d) random image/text masking, (e) teacher-guided cross-attentional masking (Ours).
Figure 3: A schematic overview of the SyncMask process: Leveraging cross-attention features from the teacher (momentum) model to generate informative masks for both MIM and MLM tasks. It is important to note that the input for MLM consists of unmasked image paired with masked text.
Figure 4: Selection phase of the SyncMask
Figure 5: The top-10 TGIR results of the SyncMask model on the FashionIQ dataset. On the left, the reference images paired with their guided descriptions are shown, while the right side presents the model's predicted images ranked by descending scores. Ground truth images are distinctly outlined with a green bounding box. It is worth mentioning that the set of predictions includes other images that also qualify as suitable matches.
...and 2 more figures

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

TL;DR

Abstract

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Authors

TL;DR

Abstract

Table of Contents

Figures (7)