Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

Quoc-Khang Tran; Minh-Thien Nguyen; Nguyen-Khang Pham

Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham

TL;DR

A robust framework that integrates the hybrid CoAtNet architecture with modelsoups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory, underscores that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.

Abstract

The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.

Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

TL;DR

Abstract

Paper Structure (20 sections, 5 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 20 sections, 5 equations, 4 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Research Methodology
Classification of Intangible Cultural Heritage Images
CoAtNet: Hybrid Convolution-Attention Design
Weight-Space Ensembling via Model Soups
Experiments
Dataset
Implementation Details
Rationale for $k=8$
Experimental Setup
Evaluation Metrics
Experiment Results
Main Results and Analysis
Comparison between Uniform and Greedy Soup
...and 5 more sections

Figures (4)

Figure 1: An illustrative example of the challenging nature of this task: class 4 (left) and class 8 (right) share highly similar visual contexts, making them difficult to distinguish.
Figure 2: Overview of CoAtNet architecture. The model comprises five stages, gradually transitioning from convolutional blocks (MBConv) to Transformer blocks (self-attention). Each stage reduces spatial resolution while increasing the number of channels. Image credit: dai2021coatnetmarryingconvolutionattention.
Figure 3: Comparison of validation and test accuracy (%) between individual CoAtNet-2 models and their combinations via model soups techniques. The ensemble models -- greedy soup and uniform soup -- outperform all individual models, demonstrating the effectiveness of model soups in enhancing generalization
Figure 4: MDS visualization of model predictions on the validation set. Each point represents a model projected via MDS using cross-entropy-based pairwise distances. The ingredient models (blue) are broadly distributed, while ensemble models -- Greedy (red), Uniform (green), and Soft Voting (orange) -- cluster near the center. This spatial structure confirms that model soups leverages predictive diversity more effectively than Soft Voting. The value above each point represents the accuracy on the validation set for the corresponding model.

Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

TL;DR

Abstract

Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta

Authors

TL;DR

Abstract

Table of Contents

Figures (4)