Crystallizing Schemas with Teleoscope: Thematic Curation of Large Text Corpora
Paul Bucci, Leo Foord-Kelcey, Patrick Yung Kang Lee, Alamjeet Singh, Ivan Beschastnikh
TL;DR
Teleoscope tackles the challenge of interpreting large text corpora for qualitative researchers by combining an auditable data-curation workflow with NLP-assisted guidance. It introduces schema crystallization as a method to externalize and iteratively refine researchers' cognitive schemas, supported by a provenance-focused visualization and an example-based NLP approach (BGEM3, UMAP, HDBSCAN). Through three field deployments on Reddit data and nursing context, the authors demonstrate how collaborative, live data exploration can yield richer information power and reproducible themes. The system is open-source and cloud-native, designed to scale to 100K–1M documents while maintaining rigor and interpretability. The practical impact is a rigorous, collaborative toolkit for theme discovery and justification in qualitative research at scale.
Abstract
Making sense of large text corpora is difficult when scales reach thousands or millions of documents. With the advent of LLMs, the potential for large-scale sense-making is being realized. However, this presents a need for rigour in the data curation stage of thematic analysis: selecting the right documents to achieve appropriate information power (saturation) requires an auditable trace of researchers' thought processes. In this paper, we present methodological and design findings from a three-year design process where we worked with qualitative researchers to create an open-source platform called Teleoscope designed to rigorously curate documents at scale. By implementing the qualitative research values common to thematic analysis during the curation stage (which we call thematic curation), we found researchers could come to a shared understanding of a large corpus and feel confident in their curation decisions (which we call schema crystallization).
