Hierarchical Multi-field Representations for Two-Stage E-commerce Retrieval
Niklas Freymuth, Dong Liu, Thomas Ricatte, Saab Mansour
TL;DR
CHARM tackles e-commerce retrieval by exploiting structured, multi-field product data through cascading hierarchical representations produced by block-triangular attention. It yields field-level vectors $h_{p,f}$ and an aggregated vector $h_p$, enabling a fast initial shortlist via two-stage retrieval that is later refined by attending to field-specific representations. The model is trained with a combination of InfoNCE losses that align query representations with both aggregated and field-level product representations, plus a Max loss that promotes diversity across fields. Empirical results on multilingual subsets of a large Amazon-like dataset show CHARM matching or beating strong baselines while significantly reducing computation due to the two-stage scheme, with thorough ablations and analyses confirming the benefits of hierarchy, diversity, and explainability. The work advances practical, scalable, and interpretable dense retrieval for structured product catalogs and lays groundwork for extended attention graphs and field-aware retrieval budgeting.
Abstract
Dense retrieval methods typically target unstructured text data represented as flat strings. However, e-commerce catalogs often include structured information across multiple fields, such as brand, title, and description, which contain important information potential for retrieval systems. We present Cascading Hierarchical Attention Retrieval Model (CHARM), a novel framework designed to encode structured product data into hierarchical field-level representations with progressively finer detail. Utilizing a novel block-triangular attention mechanism, our method captures the interdependencies between product fields in a specified hierarchy, yielding field-level representations and aggregated vectors suitable for fast and efficient retrieval. Combining both representations enables a two-stage retrieval pipeline, in which the aggregated vectors support initial candidate selection, while more expressive field-level representations facilitate precise fine-tuning for downstream ranking. Experiments on publicly available large-scale e-commerce datasets demonstrate that CHARM matches or outperforms state-of-the-art baselines. Our analysis highlights the framework's ability to align different queries with appropriate product fields, enhancing retrieval accuracy and explainability.
