Multimodal Approach for Harmonized System Code Prediction
Otmane Amel, Sedrick Stassin, Sidi Ahmed Mahmoudi, Xavier Siebert
TL;DR
This work tackles HS6 code prediction for customs by leveraging a multimodal approach that integrates textual data from customs declarations and e-commerce contexts with product images. It compares multiple early fusion strategies and introduces MultConcat, a fusion method that combines concatenation with an element-wise product to capture cross-modal interactions. The best configuration—using ResNet50 with the MultConcat fusion—achieves top-3 and top-5 accuracies of 93.5% and 98.2%, outperforming unimodal baselines and other fusion schemes, and demonstrating the value of multimodal cues in HS code classification. The study also analyzes the impact of different image encoders and textual modalities, and outlines future work in explainability and handling missing modalities to enhance real-world deployment in customs workflows.
Abstract
The rapid growth of e-commerce has placed considerable pressure on customs representatives, prompting advanced methods. In tackling this, Artificial intelligence (AI) systems have emerged as a promising approach to minimize the risks faced. Given that the Harmonized System (HS) code is a crucial element for an accurate customs declaration, we propose a novel multimodal HS code prediction approach using deep learning models exploiting both image and text features obtained through the customs declaration combined with e-commerce platform information. We evaluated two early fusion methods and introduced our MultConcat fusion method. To the best of our knowledge, few studies analyze the featurelevel combination of text and image in the state-of-the-art for HS code prediction, which heightens interest in our paper and its findings. The experimental results prove the effectiveness of our approach and fusion method with a top-3 and top-5 accuracy of 93.5% and 98.2% respectively
