Table of Contents
Fetching ...

Efficient and Interpretable Information Retrieval for Product Question Answering with Heterogeneous Data

Biplob Biswas, Rajiv Ramnath

TL;DR

The paper tackles vocabulary mismatch in product-question-answering IR by proposing a hybrid ranking approach that jointly learns dense semantic representations and expansion-enhanced sparse lexical representations via dual encoders and contrastive learning. It blends semantic and lexical signals with a tunable balance, enabling single-stage ranking and improved interpretability, while achieving competitive performance and reduced latency/FLOPs compared to cross-encoders. Evaluated on hetPQA, the method shows substantial gains over sparse lexical and independent dense retrievers (MRR@5 improvements of up to 10.95% for sparse and 2.7% for dense) and maintains strong evidence generation quality with a Fusion-in-Decoder generator. The approach offers practical benefits for diverse and noisy product pages, delivering interpretable expansions and efficient deployment potential in real-world IR systems.

Abstract

Expansion-enhanced sparse lexical representation improves information retrieval (IR) by minimizing vocabulary mismatch problems during lexical matching. In this paper, we explore the potential of jointly learning dense semantic representation and combining it with the lexical one for ranking candidate information. We present a hybrid information retrieval mechanism that maximizes lexical and semantic matching while minimizing their shortcomings. Our architecture consists of dual hybrid encoders that independently encode queries and information elements. Each encoder jointly learns a dense semantic representation and a sparse lexical representation augmented by a learnable term expansion of the corresponding text through contrastive learning. We demonstrate the efficacy of our model in single-stage ranking of a benchmark product question-answering dataset containing the typical heterogeneous information available on online product pages. Our evaluation demonstrates that our hybrid approach outperforms independently trained retrievers by 10.95% (sparse) and 2.7% (dense) in MRR@5 score. Moreover, our model offers better interpretability and performs comparably to state-of-the-art cross encoders while reducing response time by 30% (latency) and cutting computational load by approximately 38% (FLOPs).

Efficient and Interpretable Information Retrieval for Product Question Answering with Heterogeneous Data

TL;DR

The paper tackles vocabulary mismatch in product-question-answering IR by proposing a hybrid ranking approach that jointly learns dense semantic representations and expansion-enhanced sparse lexical representations via dual encoders and contrastive learning. It blends semantic and lexical signals with a tunable balance, enabling single-stage ranking and improved interpretability, while achieving competitive performance and reduced latency/FLOPs compared to cross-encoders. Evaluated on hetPQA, the method shows substantial gains over sparse lexical and independent dense retrievers (MRR@5 improvements of up to 10.95% for sparse and 2.7% for dense) and maintains strong evidence generation quality with a Fusion-in-Decoder generator. The approach offers practical benefits for diverse and noisy product pages, delivering interpretable expansions and efficient deployment potential in real-world IR systems.

Abstract

Expansion-enhanced sparse lexical representation improves information retrieval (IR) by minimizing vocabulary mismatch problems during lexical matching. In this paper, we explore the potential of jointly learning dense semantic representation and combining it with the lexical one for ranking candidate information. We present a hybrid information retrieval mechanism that maximizes lexical and semantic matching while minimizing their shortcomings. Our architecture consists of dual hybrid encoders that independently encode queries and information elements. Each encoder jointly learns a dense semantic representation and a sparse lexical representation augmented by a learnable term expansion of the corresponding text through contrastive learning. We demonstrate the efficacy of our model in single-stage ranking of a benchmark product question-answering dataset containing the typical heterogeneous information available on online product pages. Our evaluation demonstrates that our hybrid approach outperforms independently trained retrievers by 10.95% (sparse) and 2.7% (dense) in MRR@5 score. Moreover, our model offers better interpretability and performs comparably to state-of-the-art cross encoders while reducing response time by 30% (latency) and cutting computational load by approximately 38% (FLOPs).
Paper Structure (19 sections, 5 equations, 5 figures, 5 tables)

This paper contains 19 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Existing neural rankers with different interaction schemes.
  • Figure 2: The proposed hybrid information ranker.
  • Figure 3: Ranking results of our hybrid ranker on heterogeneous evidence sources.
  • Figure 4: MRR@5 with regular (dashed) and source-scaled (solid) interaction scores at different semantic and lexical matching combinations.
  • Figure 5: Results of answer generation.