Table of Contents
Fetching ...

DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction

Chaofan Gan, Yuanpeng Tu, Yuxi Li, Weiyao Lin

TL;DR

DAC addresses 2D-3D cross-modal retrieval under noisy labels by introducing Multimodal Dynamic Division (MDD) to estimate per-sample credibility from a fused multimodal loss distribution, then Adaptive Alignment and Correction (AAC) to treat clean and noisy subsets with semantic/instance alignment and a self-correction mechanism. The framework is complemented by a new Objaverse-N200 benchmark that provides realistic, large-scale noisy data for evaluation. Empirically, DAC yields substantial gains over state-of-the-art methods on ModelNet10/40 and Objaverse-N200, demonstrating robustness to both synthetic and realistic noise and showing that MDD’s dynamic credibility modeling generalizes across methods. The results highlight the practical impact of dividing samples by learned credibility and using adaptive supervision to improve cross-modal semantic compactness while mitigating label noise.

Abstract

With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity to the threshold value. Besides, they fail to fully utilize the valuable supervisory signals within each divided subset. To tackle this problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC). Specifically, the former performs accurate sample division by adaptive credibility modeling for each sample based on the compensation information within multimodal loss distribution. Then in AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile alleviate over-fitting to noisy labels, where a self-correction strategy is introduced to improve the quality of representation. Moreover. To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely Objaverse-N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels. Extensive experiments on both traditional and the newly proposed benchmarks demonstrate the generality and superiority of our DAC, where DAC outperforms state-of-the-art models by a large margin. (i.e., with +5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).

DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction

TL;DR

DAC addresses 2D-3D cross-modal retrieval under noisy labels by introducing Multimodal Dynamic Division (MDD) to estimate per-sample credibility from a fused multimodal loss distribution, then Adaptive Alignment and Correction (AAC) to treat clean and noisy subsets with semantic/instance alignment and a self-correction mechanism. The framework is complemented by a new Objaverse-N200 benchmark that provides realistic, large-scale noisy data for evaluation. Empirically, DAC yields substantial gains over state-of-the-art methods on ModelNet10/40 and Objaverse-N200, demonstrating robustness to both synthetic and realistic noise and showing that MDD’s dynamic credibility modeling generalizes across methods. The results highlight the practical impact of dividing samples by learned credibility and using adaptive supervision to improve cross-modal semantic compactness while mitigating label noise.

Abstract

With the recent burst of 2D and 3D data, cross-modal retrieval has attracted increasing attention recently. However, manual labeling by non-experts will inevitably introduce corrupted annotations given ambiguous 2D/3D content. Though previous works have addressed this issue by designing a naive division strategy with hand-crafted thresholds, their performance generally exhibits great sensitivity to the threshold value. Besides, they fail to fully utilize the valuable supervisory signals within each divided subset. To tackle this problem, we propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework (DAC), which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC). Specifically, the former performs accurate sample division by adaptive credibility modeling for each sample based on the compensation information within multimodal loss distribution. Then in AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile alleviate over-fitting to noisy labels, where a self-correction strategy is introduced to improve the quality of representation. Moreover. To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely Objaverse-N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels. Extensive experiments on both traditional and the newly proposed benchmarks demonstrate the generality and superiority of our DAC, where DAC outperforms state-of-the-art models by a large margin. (i.e., with +5.9% gain on ModelNet40 and +5.8% on Objaverse-N200).
Paper Structure (14 sections, 12 equations, 6 figures, 7 tables)

This paper contains 14 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison between previous methods and our DAC. Our DAC employs a divide-and-conquer scheme to adaptively mine the discriminative semantics in distinct subsets, guided by the dynamically estimated credibility of each sample.
  • Figure 2: (a), (b) show the loss distribution of the noisy dataset after model convergence without/with sample division, respectively. Division Acc denotes the proportion of True and False labeled samples which is correctly identified by the sample division strategy. (c) show the Division Acc of Hand-crafted/Dynamic sample division strategies under different symmetric noise. (d) shows the division accuracy of Image-based/Point cloud-based/ Multimodal-based Dynamic sample division strategies in the training process. The experiments are conducted on the ModelNet40 under 40% symmetric noise.
  • Figure 3: An overview of our method DAC. (a) MDD: Multimodal Dynamic Division strategy (Sec. \ref{['sec:MDD']}), (b) AAC: Adaptive Alignment and Correction strategy (Sec. \ref{['sec:AAC']}). Our model performs Divide-and-Conquer alignment for different noisy samples based on the credibility of each sample. Specifically, MDD dynamically models the credibility of each sample based on the multimodal loss distribution of the dataset and divides the noisy samples into clean and noisy sets based on credibility. Then, AAC conquers different samples with adaptive alignment strategies and adopts a self-correction strategy to refurbish the corrupted label of samples.
  • Figure 4: Investigation of the division accuracy of MDD and the correction accuracy of the corrected labels generated by self-correction strategy on ModelNet10 and ModelNet40 under 40% symmetric noise (Sym-40%) and 80% symmetric noise (Sym-80%).
  • Figure 5: Investigation of the Multimodal loss distribution and division threshold $\alpha$ on ModelNet40 under 0.4 symmetric noise.
  • ...and 1 more figures