Table of Contents
Fetching ...

Hierarchical Multi-field Representations for Two-Stage E-commerce Retrieval

Niklas Freymuth, Dong Liu, Thomas Ricatte, Saab Mansour

TL;DR

CHARM tackles e-commerce retrieval by exploiting structured, multi-field product data through cascading hierarchical representations produced by block-triangular attention. It yields field-level vectors $h_{p,f}$ and an aggregated vector $h_p$, enabling a fast initial shortlist via two-stage retrieval that is later refined by attending to field-specific representations. The model is trained with a combination of InfoNCE losses that align query representations with both aggregated and field-level product representations, plus a Max loss that promotes diversity across fields. Empirical results on multilingual subsets of a large Amazon-like dataset show CHARM matching or beating strong baselines while significantly reducing computation due to the two-stage scheme, with thorough ablations and analyses confirming the benefits of hierarchy, diversity, and explainability. The work advances practical, scalable, and interpretable dense retrieval for structured product catalogs and lays groundwork for extended attention graphs and field-aware retrieval budgeting.

Abstract

Dense retrieval methods typically target unstructured text data represented as flat strings. However, e-commerce catalogs often include structured information across multiple fields, such as brand, title, and description, which contain important information potential for retrieval systems. We present Cascading Hierarchical Attention Retrieval Model (CHARM), a novel framework designed to encode structured product data into hierarchical field-level representations with progressively finer detail. Utilizing a novel block-triangular attention mechanism, our method captures the interdependencies between product fields in a specified hierarchy, yielding field-level representations and aggregated vectors suitable for fast and efficient retrieval. Combining both representations enables a two-stage retrieval pipeline, in which the aggregated vectors support initial candidate selection, while more expressive field-level representations facilitate precise fine-tuning for downstream ranking. Experiments on publicly available large-scale e-commerce datasets demonstrate that CHARM matches or outperforms state-of-the-art baselines. Our analysis highlights the framework's ability to align different queries with appropriate product fields, enhancing retrieval accuracy and explainability.

Hierarchical Multi-field Representations for Two-Stage E-commerce Retrieval

TL;DR

CHARM tackles e-commerce retrieval by exploiting structured, multi-field product data through cascading hierarchical representations produced by block-triangular attention. It yields field-level vectors and an aggregated vector , enabling a fast initial shortlist via two-stage retrieval that is later refined by attending to field-specific representations. The model is trained with a combination of InfoNCE losses that align query representations with both aggregated and field-level product representations, plus a Max loss that promotes diversity across fields. Empirical results on multilingual subsets of a large Amazon-like dataset show CHARM matching or beating strong baselines while significantly reducing computation due to the two-stage scheme, with thorough ablations and analyses confirming the benefits of hierarchy, diversity, and explainability. The work advances practical, scalable, and interpretable dense retrieval for structured product catalogs and lays groundwork for extended attention graphs and field-aware retrieval budgeting.

Abstract

Dense retrieval methods typically target unstructured text data represented as flat strings. However, e-commerce catalogs often include structured information across multiple fields, such as brand, title, and description, which contain important information potential for retrieval systems. We present Cascading Hierarchical Attention Retrieval Model (CHARM), a novel framework designed to encode structured product data into hierarchical field-level representations with progressively finer detail. Utilizing a novel block-triangular attention mechanism, our method captures the interdependencies between product fields in a specified hierarchy, yielding field-level representations and aggregated vectors suitable for fast and efficient retrieval. Combining both representations enables a two-stage retrieval pipeline, in which the aggregated vectors support initial candidate selection, while more expressive field-level representations facilitate precise fine-tuning for downstream ranking. Experiments on publicly available large-scale e-commerce datasets demonstrate that CHARM matches or outperforms state-of-the-art baselines. Our analysis highlights the framework's ability to align different queries with appropriate product fields, enhancing retrieval accuracy and explainability.

Paper Structure

This paper contains 24 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Retrieval with . Given a product with multiple fields (colored blocks), produces a cascade of field-level representations (diamonds), with each representation containing information from its own and all previous fields. These representations are combined ($\bigoplus$) into an aggregated representation with learned weights, which is used to match query representations (circles) to the product in a first retrieval stage. Queries matched to the aggregated representation are re-evaluated by their distance to the closest field-level representation, allowing for queries with different levels of detail to match the same product.
  • Figure 2: Schematic overview of . Our model receives a tokenized product, including a special token per product field. The encoder's attention mask is set to a block-triangular structure that allows tokens to attend to their own field and all previous fields. By arranging fields hierarchically, the field-wise special tokens can capture increasingly detailed product information, enabling diverse product representations in a single model forward pass. We train the encoded special tokens and an aggregated representation of them to match the query representation.
  • Figure 3: Average length of queries matching a product field by closest dot-product similarity. Product fields that are on a higher hierarchy level generally match longer queries.
  • Figure 4: Weight of the "Description" field in the aggregated representation.
  • Figure 5: Frequency of product fields appearing as top $10$ matches for any query.
  • ...and 4 more figures