Adaptive Confidence Multi-View Hashing for Multimedia Retrieval

Jian Zhu; Yu Cui; Zhangmin Huang; Xingyu Li; Lei Liu; Lingfang Zeng; Li-Rong Dai

Adaptive Confidence Multi-View Hashing for Multimedia Retrieval

Jian Zhu, Yu Cui, Zhangmin Huang, Xingyu Li, Lei Liu, Lingfang Zeng, Li-Rong Dai

TL;DR

The paper tackles noisy, unreliable fusion in multi-view multimedia retrieval by introducing Adaptive Confidence Multi-View Hashing (ACMVH). ACMVH employs per-view confidence networks, an adaptive confidence fusion mechanism, and a dilation-based enhancer to produce robust $k$-bit hash codes, optimizing both retrieval similarity and classification signals via $L_{total} = L_{sim} + \mu L_{clf}$. On MIR-Flickr25K and NUS-WIDE, ACMVH achieves up to $3.24\%$ improvements in mean Average Precision over state-of-the-art baselines, with ablation studies confirming the critical contributions of the confidence and fusion modules. Convergence analysis shows stable training and generalization, while the work highlights practical impact in noise-robust, semantically expressive multimedia retrieval. Future work aims to sustain gains with longer hash codes and further enhance cross-view representation learning, aided by the confidence-driven fusion framework.

Abstract

The multi-view hash method converts heterogeneous data from multiple views into binary hash codes, which is one of the critical technologies in multimedia retrieval. However, the current methods mainly explore the complementarity among multiple views while lacking confidence learning and fusion. Moreover, in practical application scenarios, the single-view data contain redundant noise. To conduct the confidence learning and eliminate unnecessary noise, we propose a novel Adaptive Confidence Multi-View Hashing (ACMVH) method. First, a confidence network is developed to extract useful information from various single-view features and remove noise information. Furthermore, an adaptive confidence multi-view network is employed to measure the confidence of each view and then fuse multi-view features through a weighted summation. Lastly, a dilation network is designed to further enhance the feature representation of the fused features. To the best of our knowledge, we pioneer the application of confidence learning into the field of multimedia retrieval. Extensive experiments on two public datasets show that the proposed ACMVH performs better than state-of-the-art methods (maximum increase of 3.24%). The source code is available at https://github.com/HackerHyper/ACMVH.

Adaptive Confidence Multi-View Hashing for Multimedia Retrieval

TL;DR

-bit hash codes, optimizing both retrieval similarity and classification signals via

. On MIR-Flickr25K and NUS-WIDE, ACMVH achieves up to

improvements in mean Average Precision over state-of-the-art baselines, with ablation studies confirming the critical contributions of the confidence and fusion modules. Convergence analysis shows stable training and generalization, while the work highlights practical impact in noise-robust, semantically expressive multimedia retrieval. Future work aims to sustain gains with longer hash codes and further enhance cross-view representation learning, aided by the confidence-driven fusion framework.

Abstract

Paper Structure (17 sections, 14 equations, 3 figures, 3 tables)

This paper contains 17 sections, 14 equations, 3 figures, 3 tables.

Introduction
The Proposed Methodology
Deep Multi-view Hashing Network
Backbones
Confidence Networks
Adaptive Confidence Multi-View Network
Dilation Network
Hash Layer
Loss Functions
Experiments
Evaluation Datasets and Metrics
Baseline
Analysis of Experimental Results
Ablation Studies
Convergence Analysis
...and 2 more sections

Figures (3)

Figure 1: Adpative Confidence Multi-View Learning. Firstly, perform confidence network on individual view features to extract useful features and suppress redundant features. Secondly, automatically learn the confidence values of each single view feature and then fuse these features by a weighted summation. Finally, a dilation network is implemented on the fused feature to generate global representation.
Figure 2: The flow chart of ACMVH method. The vision and text features are extracted by backbones respectively. Each single view feature needs to be mined for useful information through the confidence network. Then, view-level adaptive confidence learning is performed, and multiple view features are adaptively fused. Subsequently, the dilation network is performed on fused features to enhance the semantic representation. Finally, the hash layer outputs the binary hash codes based on the enhanced semantic representation.
Figure 3: The training loss and test mAP curves on MIR-Flickr25K dataset.

Adaptive Confidence Multi-View Hashing for Multimedia Retrieval

TL;DR

Abstract

Adaptive Confidence Multi-View Hashing for Multimedia Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (3)