Table of Contents
Fetching ...

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Hua Ye, Hang Ding, Siyuan Chen, Yiyang Jiang, Changyuan Zhang, Xuan Zhang

TL;DR

This work tackles robust cross-modal alignment by exploiting ambiguous negatives that lie near the decision boundary. It introduces BACL, a boundary-aware curriculum consisting of a learnable Boundary-aware Negative Sampler and a Contrastive Local Attention loss that emphasizes token-level misalignment cues. The approach yields a fast $\tilde{O}(1/n)$ generalisation rate and achieves state-of-the-art retrieval and fine-grained reasoning across four large multimodal datasets, without additional labels. Empirically, BACL delivers substantial gains over CLIP and other baselines, while theoretical results validate improved sample efficiency and margin contraction under a progressive curriculum. Overall, BACL demonstrates that dynamically exploiting half-true negatives and local attention signals can significantly strengthen multimodal alignment in noisy, web-scale data.

Abstract

Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

TL;DR

This work tackles robust cross-modal alignment by exploiting ambiguous negatives that lie near the decision boundary. It introduces BACL, a boundary-aware curriculum consisting of a learnable Boundary-aware Negative Sampler and a Contrastive Local Attention loss that emphasizes token-level misalignment cues. The approach yields a fast generalisation rate and achieves state-of-the-art retrieval and fine-grained reasoning across four large multimodal datasets, without additional labels. Empirically, BACL delivers substantial gains over CLIP and other baselines, while theoretical results validate improved sample efficiency and margin contraction under a progressive curriculum. Overall, BACL demonstrates that dynamically exploiting half-true negatives and local attention signals can significantly strengthen multimodal alignment in noisy, web-scale data.

Abstract

Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.

Paper Structure

This paper contains 57 sections, 3 theorems, 30 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

Assume asm:a1 and asm:a2. Fix $\delta\!\in\!(0,1)$, margin $m\!>\!\varepsilon$ and let $d_{\text{eff}}$ be the effective (pseudo) dimension of $\Phi$. If then with probability at least $1-\delta$ where the additional $L^{2}/n$ term refines the classical $\tilde{\mathcal{O}}(1/\sqrt{n})$ rate to a fast rate whenever $m-\varepsilon=\Theta(1)$.

Figures (6)

  • Figure 1: Comparison between the BLIP pipeline and our proposed BACL pipeline. Methods like BLIP eliminate ambiguous negatives through threshold filtering without explicitly leveraging their intrinsic value. In contrast, BACL employs a curriculum learning strategy to progressively introduce more challenging ambiguous negative samples, explicitly revealing the sources of confusion. This approach enhances discriminative capability by jointly optimizing the global contrastive loss and the Contrastive Local Attention (CLA) loss.
  • Figure 2: Ablation study on (a) LAION-400M and (b) WebVid-10M. Each bar group shows the effect of enabling BNS, CLA, or both (full BACL).
  • Figure 3:
  • Figure 4: Cross-attention visualisation for a randomly selected image–text pair. Left: attention of the positive pair. Middle: attention of the hardest negative (selected by BNS). Right: element-wise difference $\Delta A$ with the ten largest discrepancies boxed in red—the regions CLA focuses on.
  • Figure 5: Hard-negative mining on LAION-400M. Left: False Positive Rate decreases monotonically as the ambiguity margin $\varepsilon$ shrinks. Right: Recall@10 improves simultaneously. Curves compare different candidate-pool sizes $k$ (nearest neighbours).
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 4.1: Fast–rate Generalisation of BACL
  • Theorem 4.2: Minimax Lower Bound for Uniform Samplers
  • Proposition 4.1: Exponential Contraction of Alignment Margin
  • proof
  • proof
  • proof