Enhancing Text Corpus Exploration with Post Hoc Explanations and Comparative Design

Michael Gleicher; Keaton Leppenan; Yunyu Bai

Enhancing Text Corpus Exploration with Post Hoc Explanations and Comparative Design

Michael Gleicher, Keaton Leppenan, Yunyu Bai

TL;DR

This paper addresses the limitations of existing text corpus exploration (TCE) tools, which struggle to support the full range of exploratory and learning tasks across multiple scales. It introduces salience functions as post hoc explanations for similarity, recommendations, and spatial layouts, enabling exemplar- and feature-based interpretations that can accompany a variety of underlying algorithms. The authors couple these explanations with multiscale, comparative views and demonstrate the approach in an open-source prototype, AbstractsViewer, applied to scientific abstracts and newspaper leads. User studies indicate that researchers can flexibly perform a broad set of discovery and learning tasks using the proposed explanations and coordinated views, highlighting the practical impact of enhanced interpretability and workflow flexibility in TCE.

Abstract

Text corpus exploration (TCE) spans the range of exploratory search tasks: it goes beyond simple retrieval to include item discovery and learning about the corpus and topic. Systems support TCE with tools such as similarity-based recommendations and embedding-based spatial maps. However, these tools address specific tasks; current systems lack the flexibility to support the range of tasks encountered in practice and the iterative, multiscale, workflows users employ. In this paper, we provide methods that enhance TCE tools with post hoc explanations and multiscale, comparative designs to provide flexible support for user needs. We introduce salience functions as a mechanism to provide post hoc explanations of similarity, recommendations, and spatial placement. This post hoc strategy allows our approach to complement a variety of underlying algorithms; the salience functions provide both exemplar- and feature-based explanations at scales ranging from individual documents through to the entire corpus. These explanations are incorporated into a set of views that operate at multiple scales. The views use design elements that explicitly support comparison to enable flexible integration. Together, these form an approach that provides a flexible toolset that can address a range of tasks. We demonstrate our approach in a prototype system that enables the exploration of corpora of paper abstracts and newspaper archives. Examples illustrate how our approach enables the system to flexibly support a wide range of tasks and workflows that emerge in user scenarios. A user study confirms that researchers are able to use our system to achieve a variety of tasks.

Enhancing Text Corpus Exploration with Post Hoc Explanations and Comparative Design

TL;DR

Abstract

Paper Structure (4 sections, 4 figures)

This paper contains 4 sections, 4 figures.

Introduction
Overview and Example
Related Work
Text Corpora Exploration

Figures (4)

Figure 1: Our approach enhances standard text corpus exploration views with post hoc explanations and support for comparison. [fill color=red, outer color=red, inner color=white]A An embedding-based corpus map is shown as a gridded heatmap with circle overlays for search results. This view is enhanced with explanations of region contents (either by hovering over a heatmap square or selecting an arbitrary region shown in yellow), the ability to compare two searches (green and gray circles), and two selected documents (pink and yellow stars) allowing their neighbors to be compared (pink and yellow circles). [fill color=red, outer color=red, inner color=white]B A term-document matrix view is enhanced with salience functions that reorder it to emphasize subsets that explain selected groups. Comparative features highlight differences between sets of documents. [fill color=red, outer color=red, inner color=white]C A text view is enhanced with comparison features to show two selected documents. Each document view can highlight explanations for why the document is in its map regions (blue) and why the documents may be considered similar (yellow). Each document provides its most similar neighbors in two vector spaces, with colored symbols to enable comparison between lists.
Figure 2: Screenshot of AbstractsViewer showing its views described in \ref{['sec:views']}: [fill color=red, outer color=red, inner color=white]ASearch Tools Panel including the Search List, [fill color=red, outer color=red, inner color=white]BCorpus Map, [fill color=red, outer color=red, inner color=white]CRegion Scatter Plot View, [fill color=red, outer color=red, inner color=white]DRegion Matrix View, [fill color=red, outer color=red, inner color=white]ERegion List, [fill color=red, outer color=red, inner color=white]FNeighborhood Matrix View, [fill color=red, outer color=red, inner color=white]GDocument View, [fill color=red, outer color=red, inner color=white]HNeighbor List View, and [fill color=red, outer color=red, inner color=white]IRadial Neighborhood View. Two of G, H and I are shown, one for each selection.
Figure 3: This example illustrates four exemplary workflows in the context of the present paper. The objectives are to discover related papers (to provide context for our work and generate ideas for improvement), to learn more about the corpus and find commonly used terms. Workflow[fill color=red, outer color=red, inner color=white]A. [fill color=red, outer color=red, inner color=white]A1: We search for the relevant term text. This provides too many documents (292) to examine individually. [fill color=red, outer color=red, inner color=white]A2: However, we can use the Corpus Map to see how the documents are distributed and examine particularly dense regions. [fill color=red, outer color=red, inner color=white]A3: Selecting a region (yellow rectangle) enables explanation of the region: a term-based explanation of salient words or an exemplar-baseditem-based explanation of representative documents. Here the terms text, document, collect, topic, word are salient and the most representative documents include other text exploration systems or have terms which suggest similar topics, such as topic, theme, citation. We identify this region as one focusing on Text Corpus Exploration, the "TCE region:" saving the representative document list allows for systematic, exploration. Workflow[fill color=red, outer color=red, inner color=white]B. [fill color=red, outer color=red, inner color=white]B1: Selecting another dense region shows with a different explanation. [fill color=red, outer color=red, inner color=white]B2: The Region Matrix View reveals the terms event and challenge are very salient. Sorting documents by relevant terms shows many of these papers are Vast Challenge solutions which often involve text analysis, but are less relevant. Workflow[fill color=red, outer color=red, inner color=white]C. [fill color=red, outer color=red, inner color=white]C1: We select a small, dense outlier region. [fill color=red, outer color=red, inner color=white]C2: text is not a salient term, but labeling is. [fill color=red, outer color=red, inner color=white]C3: examining the selected text papers from the Region List, we see that many refer to text labeling, but there are text analysis systems which use labeling. While the region is generally not relevant, the specific papers can seed a similarity-based search to discover more papers about using labeling in exploration. Workflow[fill color=red, outer color=red, inner color=white]D. We want to determine which term, corpus or collection more accurately describes our work. [fill color=red, outer color=red, inner color=white]D1: we search both terms and use comparison features to show both distributions on the Corpus Map. This allows us to examine differences in how these terms are used. Corpus is localized in a few clumps (green), while collection is more scattered (gray). [fill color=red, outer color=red, inner color=white]D2: Examining the clumps for corpus shows one is in a region explained by language related terms such as linguist and language, while the other is the identified TCE region. In contrast, collection's scattered points suggest its use is more broad. Examining dense regions, we see that collection is frequently used to describe things other than text, such as images, graphs, and ensembles. The term corpus is more aligned with our usage. Other uses of collection suggest similar problems to find inspirations.
Figure 4: AbstractsViewer supports Text Corpus Exploration through a variety of flexible views which operate on multiple scales. Standard designs are enhanced with explanations and comparative features.

Enhancing Text Corpus Exploration with Post Hoc Explanations and Comparative Design

TL;DR

Abstract

Enhancing Text Corpus Exploration with Post Hoc Explanations and Comparative Design

Authors

TL;DR

Abstract

Table of Contents

Figures (4)