Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance

Goran Muric; Ben Delay; Steven Minton

Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance

Goran Muric, Ben Delay, Steven Minton

TL;DR

ICE-T addresses the need for interpretable yet high-performing classification with Large Language Models in high-stakes domains like medicine and law. It leverages a primary question plus four secondary yes/no questions per document, converts the LLM responses into a low-dimensional feature vector, and trains a traditional classifier on these features. Across 17 diverse datasets, ICE-T consistently outperforms zero-shot baselines and enables smaller models to match or exceed the performance of larger zero-shot systems. This approach enhances transparency and reproducibility, offering a practical pathway to deploy AI in regulated environments while providing actionable reasoning traces for auditing and validation.

Abstract

In this paper, we introduce the Interpretable Cross-Examination Technique (ICE-T), a novel approach that leverages structured multi-prompt techniques with Large Language Models (LLMs) to improve classification performance over zero-shot and few-shot methods. In domains where interpretability is crucial, such as medicine and law, standard models often fall short due to their "black-box" nature. ICE-T addresses these limitations by using a series of generated prompts that allow an LLM to approach the problem from multiple directions. The responses from the LLM are then converted into numerical feature vectors and processed by a traditional classifier. This method not only maintains high interpretability but also allows for smaller, less capable models to achieve or exceed the performance of larger, more advanced models under zero-shot conditions. We demonstrate the effectiveness of ICE-T across a diverse set of data sources, including medical records and legal documents, consistently surpassing the zero-shot baseline in terms of classification metrics such as F1 scores. Our results indicate that ICE-T can be used for improving both the performance and transparency of AI applications in complex decision-making environments.

Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 3 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Motivation
Related Work
Prompting techniques
In-context learning
Model interpretability
Method
Generating questions
Prompting LLM
Verbalizing the answers
Training a classifier
Data
Clinical trials
Catalonia Independence Corpus
Climate Detection Corpus
...and 10 more sections

Figures (4)

Figure 1: Illustration of training and inference process in ICE-T. In the training phase, the process begins by generating questions to prompt an LLM, which then provides yes/no answers. These answers are verbalized and converted into numerical feature vectors. A classifier is trained using these vectors along with their respective labels. During inference, the LLM is prompted with the same questions, and the answers are similarly processed to predict outcomes using the trained classifier.
Figure 2: Comparative performance of ICE-T-enhanced GPT-3.5 versus zero-shot GPT-4. The figure illustrates the $\mu F1$ achieved by GPT-3.5 utilizing the ICE-T technique and GPT-4 in a zero-shot setting across multiple datasets.
Figure 3: Sensitivity Analysis of Feature Count on $\mu F1$ Score. The figure illustrates the effect of increasing the number of features (secondary questions) on the $\mu F1$ score across 17 datasets. The solid orange line represents the average $\mu F1$ score, and the shaded area indicates the first standard deviation from the mean across 100 repetitions. The graph demonstrates a consistent improvement in $\mu F1$ as more features are added, with key points of increase highlighted at specific feature counts.
Figure 4: Task-Specific Sensitivity Analysis of Feature Count on $\mu F1$ Score. Detailed view of the changes in the $\mu F1$ score for individual tasks as the number of secondary questions increases. Each plot represents one of the 17 datasets analyzed, showing how the micro F1 score varies with the addition of features. The data underscores the variability in performance improvements across different tasks when using the Random Forest classifier.

Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance

TL;DR

Abstract

Interpretable Cross-Examination Technique (ICE-T): Using highly informative features to boost LLM performance

Authors

TL;DR

Abstract

Table of Contents

Figures (4)