Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

Tae Soo Kim; Heechan Lee; Yoonjoo Lee; Joseph Seering; Juho Kim

Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

Tae Soo Kim, Heechan Lee, Yoonjoo Lee, Joseph Seering, Juho Kim

TL;DR

This paper introduces functional fragmentation, a method to evaluate LLM outputs by decomposing them into criterion-relevant fragments and interpreting each fragment's function. It is instantiated in Evalet, an interactive system with an Information Panel and Map Visualization that supports inspecting, rating, and comparing fragment-level functions across outputs, enabling finer-grained analysis than holistic scores. The authors demonstrate, through a technical evaluation and a within-subjects user study (N=10), that fragment-level analysis improves misalignment detection and trust calibration, while still enabling holistic overviews. They also provide case studies and a detailed discussion of integration guidelines, limitations, and potential for handling longer outputs, suggesting significant practical impact for qualitative, interactive AI evaluation at scale.

Abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

TL;DR

Abstract

Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)