Table of Contents
Fetching ...

HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs

Sangyeop Kim, Hangyeul Lee, Yohan Lee

TL;DR

HEISIR tackles the challenge of retrieving information from conversational data without labeled training by shifting semantic indexing to the data ingestion phase and constructing hierarchical SVOA quadruplets. It introduces a two-step index construction—Hierarchical Triplets Formulation and Adjunct Augmentation—that yields compact, semantically rich inverted indices, enabling efficient retrieval with a multi-component scoring scheme. The final score combines conversation-level similarity with component-wise matches (conversation, message, SV, SVO, SVOA) via $S_{HEISIR} = S_{conv} + \sum_{c \in C} S_c$, achieving competitive or superior results across diverse embeddings and LLMs, including GPT-3.5-turbo with OpenAI-large. The approach offers practical benefits in latency, cost, and interpretability, and supports intent and topic analysis without fine-tuning, making it suitable for production dialogue systems while highlighting limitations in multilingual and document-domain extension. Overall, HEISIR demonstrates that optimized data ingestion and structured semantic indices can deliver high-performance, training-free retrieval for conversational data with broad applicability.

Abstract

The growth of conversational AI services has increased demand for effective information retrieval from dialogue data. However, existing methods often face challenges in capturing semantic intent or require extensive labeling and fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted Semantic Indexing for Retrieval), a novel framework that enhances semantic understanding in conversational data retrieval through optimized data ingestion, eliminating the need for resource-intensive labeling or model adaptation. HEISIR implements a two-step process: (1) Hierarchical Triplets Formulation and (2) Adjunct Augmentation, creating semantic indices consisting of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured representation effectively captures the underlying semantic information from dialogue content. HEISIR achieves high retrieval performance while maintaining low latency during the actual retrieval process. Our experimental results demonstrate that HEISIR outperforms fine-tuned models across various embedding types and language models. Beyond improving retrieval capabilities, HEISIR also offers opportunities for intent and topic analysis in conversational data, providing a versatile solution for dialogue systems.

HEISIR: Hierarchical Expansion of Inverted Semantic Indexing for Training-free Retrieval of Conversational Data using LLMs

TL;DR

HEISIR tackles the challenge of retrieving information from conversational data without labeled training by shifting semantic indexing to the data ingestion phase and constructing hierarchical SVOA quadruplets. It introduces a two-step index construction—Hierarchical Triplets Formulation and Adjunct Augmentation—that yields compact, semantically rich inverted indices, enabling efficient retrieval with a multi-component scoring scheme. The final score combines conversation-level similarity with component-wise matches (conversation, message, SV, SVO, SVOA) via , achieving competitive or superior results across diverse embeddings and LLMs, including GPT-3.5-turbo with OpenAI-large. The approach offers practical benefits in latency, cost, and interpretability, and supports intent and topic analysis without fine-tuning, making it suitable for production dialogue systems while highlighting limitations in multilingual and document-domain extension. Overall, HEISIR demonstrates that optimized data ingestion and structured semantic indices can deliver high-performance, training-free retrieval for conversational data with broad applicability.

Abstract

The growth of conversational AI services has increased demand for effective information retrieval from dialogue data. However, existing methods often face challenges in capturing semantic intent or require extensive labeling and fine-tuning. This paper introduces HEISIR (Hierarchical Expansion of Inverted Semantic Indexing for Retrieval), a novel framework that enhances semantic understanding in conversational data retrieval through optimized data ingestion, eliminating the need for resource-intensive labeling or model adaptation. HEISIR implements a two-step process: (1) Hierarchical Triplets Formulation and (2) Adjunct Augmentation, creating semantic indices consisting of Subject-Verb-Object-Adjunct (SVOA) quadruplets. This structured representation effectively captures the underlying semantic information from dialogue content. HEISIR achieves high retrieval performance while maintaining low latency during the actual retrieval process. Our experimental results demonstrate that HEISIR outperforms fine-tuned models across various embedding types and language models. Beyond improving retrieval capabilities, HEISIR also offers opportunities for intent and topic analysis in conversational data, providing a versatile solution for dialogue systems.

Paper Structure

This paper contains 34 sections, 3 equations, 4 figures, 20 tables.

Figures (4)

  • Figure 1: Architecture of HEISIR framework: Data Ingestion Phase
  • Figure 2: Phrase Structure Grammar and Constituents
  • Figure 3: 2-Step Expansion Process of HEISIR
  • Figure 4: Marginal Performance of Components