An Information-Theoretic Framework for Comparing Voice and Text Explainability

Mona Rajhans; Vishal Khawarey

An Information-Theoretic Framework for Comparing Voice and Text Explainability

Mona Rajhans, Vishal Khawarey

TL;DR

The paper addresses how explanation modality (voice vs text) shapes user understanding and trust by modeling explanation delivery as an information‑transmission channel. It defines $I_M$ (information retention), $CE$ (comprehension efficiency), and $TCE$ (trust calibration error), and introduces a composite score $\Phi$ to balance accuracy and trust. Using a Python-based simulation with synthetic SHAP attributions across explanation styles, the authors show text explanations maximize information retention and efficiency, while voice explanations improve trust calibration, with analogy-based delivery offering a practical middle ground. The framework is reproducible and extensible to real SHAP/LIME outputs on open datasets, providing a theoretical and practical basis for designing multimodal XAI systems.

Abstract

Explainable Artificial Intelligence (XAI) aims to make machine learning models transparent and trustworthy, yet most current approaches communicate explanations visually or through text. This paper introduces an information theoretic framework for analyzing how explanation modality specifically, voice versus text affects user comprehension and trust calibration in AI systems. The proposed model treats explanation delivery as a communication channel between model and user, characterized by metrics for information retention, comprehension efficiency (CE), and trust calibration error (T CE). A simulation framework implemented in Python was developed to evaluate these metrics using synthetic SHAP based feature attributions across multiple modality style configurations (brief, detailed, and analogy based). Results demonstrate that text explanations achieve higher comprehension efficiency, while voice explanations yield improved trust calibration, with analogy based delivery achieving the best overall trade off. This framework provides a reproducible foundation for designing and benchmarking multimodal explainability systems and can be extended to empirical studies using real SHAP or LIME outputs on open datasets such as the UCI Credit Approval or Kaggle Financial Transactions datasets.

An Information-Theoretic Framework for Comparing Voice and Text Explainability

TL;DR

The paper addresses how explanation modality (voice vs text) shapes user understanding and trust by modeling explanation delivery as an information‑transmission channel. It defines

(information retention),

(comprehension efficiency), and

(trust calibration error), and introduces a composite score

to balance accuracy and trust. Using a Python-based simulation with synthetic SHAP attributions across explanation styles, the authors show text explanations maximize information retention and efficiency, while voice explanations improve trust calibration, with analogy-based delivery offering a practical middle ground. The framework is reproducible and extensible to real SHAP/LIME outputs on open datasets, providing a theoretical and practical basis for designing multimodal XAI systems.

Abstract

Paper Structure (20 sections, 6 equations, 7 figures)

This paper contains 20 sections, 6 equations, 7 figures.

Introduction
Methods
Overview
Information-Retention Model
Cognitive-Load Function
Comprehension Efficiency
Trust Calibration
Overall Evaluation Function
Simulation Protocol
Implementation and Evaluation Environment
Results and Discussion
Trade-off between Comprehension and Trust
Distribution of All Samples
Composite Quality Metric
Interpretation and Implications
...and 5 more sections

Figures (7)

Figure 1: Conceptual architecture of the proposed modality-aware explainability framework. Model outputs are first processed through SHAP or LIME to obtain attribution vectors ($A$), which are encoded into modality-specific explanations via the explanation encoder $E(A,M,S)$. The resulting message---delivered as text or syntjhesized voice---forms a communication channel with the user, whose understanding ($U$) is analyzed using information-theoretic metrics ($I_M$, $CE$, $TCE$, and $\Phi$).
Figure 2: Mean trade-off between Comprehension Efficiency (CE) and Trust Calibration Error (TCE) for text (blue) and voice (red) explanations. Labels denote explanation style (Brief, Detailed, Analogy). Text yields higher comprehension efficiency; voice yields lower calibration error.
Figure 3: Distribution of simulated Comprehension Efficiency (CE) vs. Trust Calibration Error (TCE) across all samples. Semi-transparent points show individual simulations; stars mark mean values for each modality--style pair.
Figure 4: Composite Explanation Quality $\Phi$ by Modality and Style. Higher values indicate better trade-off between comprehension efficiency ($CE$) and trust calibration ($TCE$). Text--Detailed explanations maximize $CE$, while Voice--Analogy explanations yield the most balanced performance.
Figure 5: Sensitivity of composite score $\Phi(M,S)$ to trust weighting parameter $\lambda_2$. Higher $\lambda_2$ increases the emphasis on trust calibration. Voice--Analogy explanations remain robust across trust priorities, while Text--Detailed explanations dominate when comprehension efficiency is emphasized.
...and 2 more figures

An Information-Theoretic Framework for Comparing Voice and Text Explainability

TL;DR

Abstract

An Information-Theoretic Framework for Comparing Voice and Text Explainability

Authors

TL;DR

Abstract

Table of Contents

Figures (7)