iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries

Adam Coscia; Langdon Holmes; Wesley Morris; Joon Suh Choi; Scott Crossley; Alex Endert

iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries

Adam Coscia, Langdon Holmes, Wesley Morris, Joon Suh Choi, Scott Crossley, Alex Endert

TL;DR

iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously, is developed and qualitative interviews with the learning engineers revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.

Abstract

The recent explosion in popularity of large language models (LLMs) has inspired learning engineers to incorporate them into adaptive educational tools that automatically score summary writing. Understanding and evaluating LLMs is vital before deploying them in critical learning environments, yet their unprecedented size and expanding number of parameters inhibits transparency and impedes trust when they underperform. Through a collaborative user-centered design process with several learning engineers building and deploying summary scoring LLMs, we characterized fundamental design challenges and goals around interpreting their models, including aggregating large text inputs, tracking score provenance, and scaling LLM interpretability methods. To address their concerns, we developed iScore, an interactive visual analytics tool for learning engineers to upload, score, and compare multiple summaries simultaneously. Tightly integrated views allow users to iteratively revise the language in summaries, track changes in the resulting LLM scores, and visualize model weights at multiple levels of abstraction. To validate our approach, we deployed iScore with three learning engineers over the course of a month. We present a case study where interacting with iScore led a learning engineer to improve their LLM's score accuracy by three percentage points. Finally, we conducted qualitative interviews with the learning engineers that revealed how iScore enabled them to understand, evaluate, and build trust in their LLMs during deployment.

iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries

TL;DR

Abstract

Paper Structure (27 sections, 5 figures, 1 table)

This paper contains 27 sections, 5 figures, 1 table.

Introduction
Related Work
Automatically Scoring Summary Writing
Modeling Language With Transformers
Interpreting ML Using Visual Analytics
Design Process
Background: Summary Scoring LLMs
iTELL: Textbooks That Score Summaries
Sources, Summaries and Scores
Design Challenges and User Tasks
The iScore System
Assignments Panel
Scores Dashboard
Model Analysis View
Input Perturbation
...and 12 more sections

Figures (5)

Figure 1: An example of a textbook source, learner summaries and expert scores used to train the LLMs visualized in iScore. Content and Wording LLMs each assign a continuous score that represents components of an analytic rubric. Learning engineers seek to characterize how changes in scores relate to differences in summaries via comparison. iScore provides inputs for multiple summaries per source and visualizes their predicted scores simultaneously in context of the "ground truth" training data.
Figure 2: iScore visualizes multiple LLM-scored writing samples to help learning engineers interpret model performance. Above, a learning engineer interprets how two plagiarized summaries are scored across two runs (Sect. \ref{['sec:usage']}). Users can upload, score and manually revise and re-score multiple source/summary pairs simultaneously in the Assignments Panel(A), visually track how scores change across revisions in the context of expert-scored LLM training data in the Scores Dashboard(B), and compare model weights between words across model layers/heads, as well as differences in scores between automatically revised summary perturbations, using two model interpretability methods in the Model Analysis View(C).
Figure 3: Breakdown of the Scores Dashboard. The table and scatter plot help users compare variations on summaries by tracking how scores change across manual revisions.
Figure 4: Breakdown of the Input Perturbation visualization. Multiple perturbation methods help users test hundreds of different kinds of revisions at scale by automatically applying and re-scoring summaries for them.
Figure 5: Breakdown of the Token Attention visualization. The combination of heat maps, rug plot and text underlining helps users make sense of complex model behaviors by keeping them in the loop at multiple levels of abstraction.

iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries

TL;DR

Abstract

iScore: Visual Analytics for Interpreting How Language Models Automatically Score Summaries

Authors

TL;DR

Abstract

Table of Contents

Figures (5)