Table of Contents
Fetching ...

Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora

Paul Bucci, Leo Foord-Kelcey, Patrick Yung Kang Lee, Alamjeet Singh, Ivan Beschastnikh

TL;DR

Teleoscope tackles the challenge of interpreting large text corpora for qualitative researchers by combining an auditable data-curation workflow with NLP-assisted guidance. It introduces schema crystallization as a method to externalize and iteratively refine researchers' cognitive schemas, supported by a provenance-focused visualization and an example-based NLP approach (BGEM3, UMAP, HDBSCAN). Through three field deployments on Reddit data and nursing context, the authors demonstrate how collaborative, live data exploration can yield richer information power and reproducible themes. The system is open-source and cloud-native, designed to scale to 100K–1M documents while maintaining rigor and interpretability. The practical impact is a rigorous, collaborative toolkit for theme discovery and justification in qualitative research at scale.

Abstract

Making sense of large text corpora is difficult when scales reach thousands or millions of documents. With the advent of LLMs, the potential for large-scale sense-making is being realized. However, this presents a need for rigour in the data curation stage of thematic analysis: selecting the right documents to achieve appropriate information power (saturation) requires an auditable trace of researchers' thought processes. In this paper, we present methodological and design findings from a three-year design process where we worked with qualitative researchers to create an open-source platform called Teleoscope designed to rigorously curate documents at scale. By implementing the qualitative research values common to thematic analysis during the curation stage (which we call thematic curation), we found researchers could come to a shared understanding of a large corpus and feel confident in their curation decisions (which we call schema crystallization).

Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora

TL;DR

Teleoscope tackles the challenge of interpreting large text corpora for qualitative researchers by combining an auditable data-curation workflow with NLP-assisted guidance. It introduces schema crystallization as a method to externalize and iteratively refine researchers' cognitive schemas, supported by a provenance-focused visualization and an example-based NLP approach (BGEM3, UMAP, HDBSCAN). Through three field deployments on Reddit data and nursing context, the authors demonstrate how collaborative, live data exploration can yield richer information power and reproducible themes. The system is open-source and cloud-native, designed to scale to 100K–1M documents while maintaining rigor and interpretability. The practical impact is a rigorous, collaborative toolkit for theme discovery and justification in qualitative research at scale.

Abstract

Making sense of large text corpora is difficult when scales reach thousands or millions of documents. With the advent of LLMs, the potential for large-scale sense-making is being realized. However, this presents a need for rigour in the data curation stage of thematic analysis: selecting the right documents to achieve appropriate information power (saturation) requires an auditable trace of researchers' thought processes. In this paper, we present methodological and design findings from a three-year design process where we worked with qualitative researchers to create an open-source platform called Teleoscope designed to rigorously curate documents at scale. By implementing the qualitative research values common to thematic analysis during the curation stage (which we call thematic curation), we found researchers could come to a shared understanding of a large corpus and feel confident in their curation decisions (which we call schema crystallization).
Paper Structure (36 sections, 11 figures)

This paper contains 36 sections, 11 figures.

Figures (11)

  • Figure 1: An image of the Teleoscope workspace. (1) Users start by performing a keyword search to explore documents; (2) Documents are dragged onto the workspace; (3) Documents can be put into groups for organization; (4) Rank nodes can use documents, notes or groups as control inputs; (5) Projections create clusters using groups as control input; (6) Notes can contain arbitrary text which is also vectorized and can be used as a control input to a Rank; (7) the sidebar has a quick viewer for documents, saved items, bookmarks, and settings. Keyboard navigation is used for quick exploration and group creation.
  • Figure 2: Large corpora in the thousands to millions of documents are difficult to make sense of, but LLMs are making it technically feasible to try. How do you find which documents are important to you in a rigorous, repeatable, sharable manner?
  • Figure 3: During nucleation, ideas about the corpus are just starting to unfold and develop. Quick interaction is key to keeping an open mind while exploring. Crucially, each interaction is both expanding possibilities about the corpus through discovering new ideas, and potentially prematurely creating closure and opening up the possibility of confirmation bias. Keeping a visual trace of explored avenues allows for the necessary systematic challenge to one's biases. This image includes screenshots and descriptions of our real process of nucleation. For each nucleation, we started from a vague conceptual keyword such as "privacy and discovered more concrete keywords such as "passcode."
  • Figure 4: Ideas start off ambiguous. As we wonder about our corpus, hunches, notions, and predictions emerge that we can test against the corpus. Our pre-existing schemas need to be developed by externalization (e.g., writing on a whiteboard). For this interface, externalization starts with guessing keywords to discover relevant documents.
  • Figure 5: When developing a theme, some documents may be more or less illustrative of that theme and therefore more or less relevant to analysis for that theme. When organizing themes, documents must eventually be included or excluded from analysis, however, getting a sense for which documents are in or out takes iteration.
  • ...and 6 more figures