Table of Contents
Fetching ...

Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

Tae Soo Kim, Heechan Lee, Yoonjoo Lee, Joseph Seering, Juho Kim

TL;DR

This paper introduces functional fragmentation, a method to evaluate LLM outputs by decomposing them into criterion-relevant fragments and interpreting each fragment's function. It is instantiated in Evalet, an interactive system with an Information Panel and Map Visualization that supports inspecting, rating, and comparing fragment-level functions across outputs, enabling finer-grained analysis than holistic scores. The authors demonstrate, through a technical evaluation and a within-subjects user study (N=10), that fragment-level analysis improves misalignment detection and trust calibration, while still enabling holistic overviews. They also provide case studies and a detailed discussion of integration guidelines, limitations, and potential for handling longer outputs, suggesting significant practical impact for qualitative, interactive AI evaluation at scale.

Abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

Evalet: Evaluating Large Language Models by Fragmenting Outputs into Functions

TL;DR

This paper introduces functional fragmentation, a method to evaluate LLM outputs by decomposing them into criterion-relevant fragments and interpreting each fragment's function. It is instantiated in Evalet, an interactive system with an Information Panel and Map Visualization that supports inspecting, rating, and comparing fragment-level functions across outputs, enabling finer-grained analysis than holistic scores. The authors demonstrate, through a technical evaluation and a within-subjects user study (N=10), that fragment-level analysis improves misalignment detection and trust calibration, while still enabling holistic overviews. They also provide case studies and a detailed discussion of integration guidelines, limitations, and potential for handling longer outputs, suggesting significant practical impact for qualitative, interactive AI evaluation at scale.

Abstract

Practitioners increasingly rely on Large Language Models (LLMs) to evaluate generative AI outputs through "LLM-as-a-Judge" approaches. However, these methods produce holistic scores that obscure which specific elements influenced the assessments. We propose functional fragmentation, a method that dissects each output into key fragments and interprets the rhetoric functions that each fragment serves relative to evaluation criteria -- surfacing the elements of interest and revealing how they fulfill or hinder user goals. We instantiate this approach in Evalet, an interactive system that visualizes fragment-level functions across many outputs to support inspection, rating, and comparison of evaluations. A user study (N=10) found that, while practitioners struggled to validate holistic scores, our approach helped them identify 48% more evaluation misalignments. This helped them calibrate trust in LLM evaluations and rely on them to find more actionable issues in model outputs. Our work shifts LLM evaluation from quantitative scores toward qualitative, fine-grained analysis of model behavior.

Paper Structure

This paper contains 79 sections, 5 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Evalet consists of two main components: (A) Information Panel and (B) Map Visualization. In the Information Panel, users can use the Tab Navigator (C) to switch between managing their input-output dataset, defining their criteria set, and viewing evaluation details. Users can initiate evaluations by clicking on Run Evaluation (D). The Map Visualization helps users explore all fragment-level functions across all outputs, where they can toggle what information is displayed using the Map Controls (E). Each fragment-level function is shown as a dot if rated positive or a cross if negative, and users can hover over these to see the function description (F).
  • Figure 2: In the Database Tab, users can view their dataset of input-output pairs. Each item consists of the input, the output, and an evaluation summary. This summary presents the output's holistic score on each criterion (A) and its list of fragment-level functions (B). Users can see more details by clicking on View Details (C). On the details page, the user selects a criterion to view the relevant evaluations (D). Assessed fragments from the output are highlighted in green if positive and orange if negative (E). The bottom of the interface displays the holistic score and justification provided by the LLM (F). By clicking on each fragment, users can view the corresponding function description (G) and the evaluator's reasoning in detail (H).
  • Figure 3: Users can explore the clusters and fragment-level functions through both the Map Visualization (A) and Explore Tab (B). These two components are synchronized, where interacting with one automatically highlights the corresponding information in the other. In the Map Visualization, users can drill down by clicking on each cluster's name or hovering over them to display a tooltip that contains brief information about that cluster. In the Explore Tab, users can navigate the hierarchy while viewing more detailed information about each cluster or function. Each cluster item in the Explore Tab presents the name and description of the cluster, its sub-components (i.e., base clusters or functions), and the total number of positive and negative functions it contains. Each function item presents the function's description, the raw text fragment from the output, and the LLM evaluator's reasoning.
  • Figure 4: Users can view only the selected fragment-level functions in the Selected Entries mode (A). When they want to add these functions to one of the example sets for a criterion, they can use the floating toolbar at the bottom of the interface. Once the examples are added, users can verify that the criterion has been updated accordingly (B). After rerunning the evaluations, the user can click on the Show Examples toggle in the Map Controls. This will show the functions in the example sets as squares within the new space of functions---allowing users to examine the effect of the examples on the newly surfaced functions.
  • Figure 5: Comparisons of the main interface components across the study conditions. (A) The Fragmented condition's Details Tab displays the list of fragment-level functions for each output, while the Holistic condition shows a label that summarizes the holistic justification for that output. (B) In evaluation details, the Fragmented condition shows the function label, rating, and evaluation justification for each fragment, but does not show the holistic justification. The Holistic condition highlights the evaluated fragments, but only presents the holistic justification and score. (C) Both conditions feature the Map Visualization. But, in the Holistic condition, each point represents a whole output based on the embedding of the holistic evaluation label.
  • ...and 11 more figures