Saliency-Aware Automatic Buddhas Statue Recognition
Yong Qi, Fanghan Zhao
TL;DR
The paper tackles Buddha statue recognition, a task hampered by cultural diversity and class imbalance. It introduces Grid-wise Local Self-Attention (GLSA), a saliency-map sampling module that yields grid-based salient features, and a dual-branch architecture that fuses local salient and global image representations for classification. On an expert-curated Buddha statue dataset, GLSA-enabled models outperform baselines by about 4.63 percentage points in Top-1 accuracy while maintaining comparable computational costs, underscoring robustness to imbalance. By reducing reliance on costly context graphs and enhancing salient-region processing, the approach offers practical gains for artwork recognition and can extend to related cultural heritage tasks.
Abstract
Buddha statues, as a symbol of many religions, have significant cultural implications that are crucial for understanding the culture and history of different regions, and the recognition of Buddha statues is therefore the pivotal link in the field of Buddha study. However, the Buddha statue recognition requires extensive time and effort from knowledgeable professionals, making it a costly task to perform. Convolution neural networks (CNNs) are inherently efficient at processing visual information, but CNNs alone are likely to make inaccurate classification decisions when subjected to the class imbalance problem. Therefore, this paper proposes an end-to-end automatic Buddha statue recognition model based on saliency map sampling. The proposed Grid-Wise Local Self-Attention Module (GLSA) provides extra salient features which can serve to enrich the dataset and allow CNNs to observe in a much more comprehensive way. Eventually, our model is evaluated on a Buddha dataset collected with the aid of Buddha experts and outperforms state-of-the-art networks in terms of Top-1 accuracy by 4.63\% on average, while only marginally increasing MUL-ADD.
