Table of Contents
Fetching ...

MFBE: Leveraging Multi-Field Information of FAQs for Efficient Dense Retrieval

Debopriyo Banerjee, Mausam Jain, Ashish Kulkarni

TL;DR

This paper addresses FAQ retrieval in NLP, where lexical gaps and limited labeled data hinder effective matching between user queries and FAQs. It proposes MFBE, a multi-field bi-encoder that exploits various FAQ fields (e.g., question, answer, categories) and uses contrastive learning with extended positive/negative pairs, plus in-batch negatives, to learn a robust semantic space. Through LaBSE-based dual-branch encoding and field-specific representations, MFBE pre-computes FAQ embeddings for fast inference and selects the best field representation for retrieval. Empirical evaluation on internal and open datasets shows substantial top-1 accuracy gains over baselines, including strong cross-domain and multi-domain performance, indicating MFBE’s practicality for low-resource and cold-start FAQ systems. The approach offers a scalable, multilingual solution with improved retrieval quality and reduced latency for real-world customer support deployments.

Abstract

In the domain of question-answering in NLP, the retrieval of Frequently Asked Questions (FAQ) is an important sub-area which is well researched and has been worked upon for many languages. Here, in response to a user query, a retrieval system typically returns the relevant FAQs from a knowledge-base. The efficacy of such a system depends on its ability to establish semantic match between the query and the FAQs in real-time. The task becomes challenging due to the inherent lexical gap between queries and FAQs, lack of sufficient context in FAQ titles, scarcity of labeled data and high retrieval latency. In this work, we propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields (like, question, answer, and category) both during model training and inference. Our proposed Multi-Field Bi-Encoder (MFBE) model benefits from the additional context resulting from multiple FAQ fields and performs well even with minimal labeled data. We empirically support this claim through experiments on proprietary as well as open-source public datasets in both unsupervised and supervised settings. Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets, respectively over the best performing baseline.

MFBE: Leveraging Multi-Field Information of FAQs for Efficient Dense Retrieval

TL;DR

This paper addresses FAQ retrieval in NLP, where lexical gaps and limited labeled data hinder effective matching between user queries and FAQs. It proposes MFBE, a multi-field bi-encoder that exploits various FAQ fields (e.g., question, answer, categories) and uses contrastive learning with extended positive/negative pairs, plus in-batch negatives, to learn a robust semantic space. Through LaBSE-based dual-branch encoding and field-specific representations, MFBE pre-computes FAQ embeddings for fast inference and selects the best field representation for retrieval. Empirical evaluation on internal and open datasets shows substantial top-1 accuracy gains over baselines, including strong cross-domain and multi-domain performance, indicating MFBE’s practicality for low-resource and cold-start FAQ systems. The approach offers a scalable, multilingual solution with improved retrieval quality and reduced latency for real-world customer support deployments.

Abstract

In the domain of question-answering in NLP, the retrieval of Frequently Asked Questions (FAQ) is an important sub-area which is well researched and has been worked upon for many languages. Here, in response to a user query, a retrieval system typically returns the relevant FAQs from a knowledge-base. The efficacy of such a system depends on its ability to establish semantic match between the query and the FAQs in real-time. The task becomes challenging due to the inherent lexical gap between queries and FAQs, lack of sufficient context in FAQ titles, scarcity of labeled data and high retrieval latency. In this work, we propose a bi-encoder-based query-FAQ matching model that leverages multiple combinations of FAQ fields (like, question, answer, and category) both during model training and inference. Our proposed Multi-Field Bi-Encoder (MFBE) model benefits from the additional context resulting from multiple FAQ fields and performs well even with minimal labeled data. We empirically support this claim through experiments on proprietary as well as open-source public datasets in both unsupervised and supervised settings. Our model achieves around 27% and 20% better top-1 accuracy for the FAQ retrieval task on internal and open datasets, respectively over the best performing baseline.
Paper Structure (5 sections, 7 equations, 3 figures, 5 tables)

This paper contains 5 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Example of an FAQ in Japanese Language (JA).
  • Figure 2: Illustration of the working of MFBE model across different stages - training, pre-computation and inference.
  • Figure 3: Ablation experiments with MFBE$_{sup^*}$ model. (a) Variation across multi-field combinations on internal datasets (b) Variation across multi-field combinations on open datasets (c) Variation in the number of training query-FAQ pairs