Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye; Hang Ding; Siyuan Chen; Yiyang Jiang; Changyuan Zhang; Xuan Zhang

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen, Yiyang Jiang, Changyuan Zhang, Xuan Zhang

TL;DR

This work tackles robust cross-modal alignment by exploiting ambiguous negatives that lie near the decision boundary. It introduces BACL, a boundary-aware curriculum consisting of a learnable Boundary-aware Negative Sampler and a Contrastive Local Attention loss that emphasizes token-level misalignment cues. The approach yields a fast $\tilde{O}(1/n)$ generalisation rate and achieves state-of-the-art retrieval and fine-grained reasoning across four large multimodal datasets, without additional labels. Empirically, BACL delivers substantial gains over CLIP and other baselines, while theoretical results validate improved sample efficiency and margin contraction under a progressive curriculum. Overall, BACL demonstrates that dynamically exploiting half-true negatives and local attention signals can significantly strengthen multimodal alignment in noisy, web-scale data.

Abstract

Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

TL;DR

Abstract

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)