A General Framework for Producing Interpretable Semantic Text Embeddings

Yiqun Sun; Qiang Huang; Yixuan Tang; Anthony K. H. Tung; Jun Yu

A General Framework for Producing Interpretable Semantic Text Embeddings

Yiqun Sun, Qiang Huang, Yixuan Tang, Anthony K. H. Tung, Jun Yu

TL;DR

This work introduces \algo{CQG-MBQA} (Contrastive Question Generation - Multi-task Binary Question Answering), a general framework for producing interpretable semantic text embeddings across diverse tasks and demonstrates that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability.

Abstract

Semantic text embedding is essential to many tasks in Natural Language Processing (NLP). While black-box models are capable of generating high-quality embeddings, their lack of interpretability limits their use in tasks that demand transparency. Recent approaches have improved interpretability by leveraging domain-expert-crafted or LLM-generated questions, but these methods rely heavily on expert input or well-prompt design, which restricts their generalizability and ability to generate discriminative questions across a wide range of tasks. To address these challenges, we introduce \algo{CQG-MBQA} (Contrastive Question Generation - Multi-task Binary Question Answering), a general framework for producing interpretable semantic text embeddings across diverse tasks. Our framework systematically generates highly discriminative, low cognitive load yes/no questions through the \algo{CQG} method and answers them efficiently with the \algo{MBQA} model, resulting in interpretable embeddings in a cost-effective manner. We validate the effectiveness and interpretability of \algo{CQG-MBQA} through extensive experiments and ablation studies, demonstrating that it delivers embedding quality comparable to many advanced black-box models while maintaining inherently interpretability. Additionally, \algo{CQG-MBQA} outperforms other interpretable text embedding methods across various downstream tasks.

A General Framework for Producing Interpretable Semantic Text Embeddings

TL;DR

Abstract

Paper Structure (46 sections, 4 equations, 5 figures, 9 tables)

This paper contains 46 sections, 4 equations, 5 figures, 9 tables.

Introduction
Related Work
Black-box Embedding
Interpretable Embedding
Interpretable Text Embedding Framework
Question Generation
Motivations
Contrastive Question Generation (CQG)
Post-Processing
Question Answering
Motivations
Multi-task Binary Question Answering (MBQA) Model
Remarks
Experiments
Metrics
...and 31 more sections

Figures (5)

Figure 1: An overview of the CQG-MBQA framework.
Figure 2: Illustration of the CQG method.
Figure 3: Case study.
Figure 4: Spearman correlation and cognitive load vs. the number of dimensions $m$. Higher Spearman correlation signals better embedding quality; lower cognitive load implies greater interpretability.
Figure 5: Spearman correlation and cognitive load vs. the binary classification threshold $\tau$.

Theorems & Definitions (1)

Example 1

A General Framework for Producing Interpretable Semantic Text Embeddings

TL;DR

Abstract

A General Framework for Producing Interpretable Semantic Text Embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (5)

Theorems & Definitions (1)