Alibaba International E-commerce Product Search Competition DcuRAGONs Team Technical Report
Thang-Long Nguyen-Ho, Minh-Khoi Pham, Hoang-Bao Le
TL;DR
The paper tackles multilingual e-commerce relevance by addressing two tasks: Query-Category relevance and Query-Item relevance, using real-world Alibaba logs. It proposes a data-centric pipeline built on transformer-based multilingual LLMs, translation augmentation to English, and a two-stage training regimen with Task-Adaptive Pre-Training, augmented by a category-aware cross-validation scheme to prevent leakage. Across experiments, larger multilingual models (notably Gemma-3-12B) outperform smaller baselines, and TAPT yields consistent, though modest, gains, culminating in state-of-the-art performance on the private leaderboard. The work demonstrates that careful data handling, domain-focused pretraining, and scalable architectures can deliver robust multilingual e-commerce search capabilities, with clear avenues for extending to multitask learning and taxonomy-aware modeling to further improve generalization and efficiency.
Abstract
This report details our methodology and results developed for the Multilingual E-commerce Search Competition. The problem aims to recognize relevance between user queries versus product items in a multilingual context and improve recommendation performance on e-commerce platforms. Utilizing Large Language Models (LLMs) and their capabilities in other tasks, our data-centric method achieved the highest score compared to other solutions during the competition. Final leaderboard is publised at https://alibaba-international-cikm2025.github.io. The source code for our project is published at https://github.com/nhtlongcs/e-commerce-product-search.
