Table of Contents
Fetching ...

Supporting Sensemaking of Large Language Model Outputs at Scale

Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K. Kummerfeld, Elena L. Glassman

TL;DR

This work tackles how to make sense of many LLM outputs at meso-scale by introducing five text-analysis and rendering features (including a novel Positional Diction Clustering). It grounds design in Variation Theory and Analogical Learning Theory, and validates the approach through a controlled study (n=24) plus eight case studies, showing that the features support diverse sensemaking tasks and reveal insights unreachable by traditional single-output UIs. Key contributions include a practical interface with exact matches, unique words, PDC-based grids and interleaved renderings, plus explicit design guidelines for future LLM inspectors. The findings suggest that preserving full-text outputs, avoiding predefined lenses, and pre-computing cross-document relationships enable richer, more scalable analysis of LLM-generated content, with broad implications for end-user workflows, model auditing, and prompt engineering.

Abstract

Large language models (LLMs) are capable of generating multiple responses to a single prompt, yet little effort has been expended to help end-users or system designers make use of this capability. In this paper, we explore how to present many LLM responses at once. We design five features, which include both pre-existing and novel methods for computing similarities and differences across textual documents, as well as how to render their outputs. We report on a controlled user study (n=24) and eight case studies evaluating these features and how they support users in different tasks. We find that the features support a wide variety of sensemaking tasks and even make tasks previously considered to be too difficult by our participants now tractable. Finally, we present design guidelines to inform future explorations of new LLM interfaces.

Supporting Sensemaking of Large Language Model Outputs at Scale

TL;DR

This work tackles how to make sense of many LLM outputs at meso-scale by introducing five text-analysis and rendering features (including a novel Positional Diction Clustering). It grounds design in Variation Theory and Analogical Learning Theory, and validates the approach through a controlled study (n=24) plus eight case studies, showing that the features support diverse sensemaking tasks and reveal insights unreachable by traditional single-output UIs. Key contributions include a practical interface with exact matches, unique words, PDC-based grids and interleaved renderings, plus explicit design guidelines for future LLM inspectors. The findings suggest that preserving full-text outputs, avoiding predefined lenses, and pre-computing cross-document relationships enable richer, more scalable analysis of LLM-generated content, with broad implications for end-user workflows, model auditing, and prompt engineering.

Abstract

Large language models (LLMs) are capable of generating multiple responses to a single prompt, yet little effort has been expended to help end-users or system designers make use of this capability. In this paper, we explore how to present many LLM responses at once. We design five features, which include both pre-existing and novel methods for computing similarities and differences across textual documents, as well as how to render their outputs. We report on a controlled user study (n=24) and eight case studies evaluating these features and how they support users in different tasks. We find that the features support a wide variety of sensemaking tasks and even make tasks previously considered to be too difficult by our participants now tractable. Finally, we present design guidelines to inform future explorations of new LLM interfaces.
Paper Structure (87 sections, 12 figures, 2 tables)

This paper contains 87 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Example of the 'exact matches' feature for the prompt "Who invented the {object}?" where the objects are 'pencil' and 'telescope' and each prompt had $n = 3$ generations. Exact matches makes it easy to identify portions of responses that are matching across multiple responses.
  • Figure 2: Example of the 'unique words' feature for the prompt "Write a short paragraph about the sea in the style of {style}." where the styles are 'a horror novel' and 'a romance novel' and each prompt had $n = 3$ generations. Unique words makes it easy to see how word choice is influenced by the style.
  • Figure 3: Example of the PDC feature in the grid layout for the prompt "Explain how a lightbulb works to a 12 year old." for GPT4 temperature=1 and GPT4 temperature=1.3. In the grid view, structurally and semantically similar sentences are highlighted in the same color; notice that the sentences highlighted in yellow are both about how gas supports filament longevity.
  • Figure 4: Example of the PDC feature in the interleaved layout for the same prompt, model, and temperature settings as in \ref{['fig:features-ss']}, i.e., "Explain how a lightbulb works to a 12 year old." for GPT4 temperature=1 and GPT4 temperature=1.3. In the interleaved view, structurally and semantically similar sentences are grouped, with the color patch to the left indicating which model version produced them; notice that all the opening 'topic' sentences are shown together with redundant text grayed out.
  • Figure 5: Study Process: each participant performs two email rewriting tasks with different UIs and two model comparison tasks with different UIs. Interface conditions are counterbalanced.
  • ...and 7 more figures