Table of Contents
Fetching ...

Edu-ConvoKit: An Open-Source Library for Education Conversation Data

Rose E. Wang, Dorottya Demszky

TL;DR

Edu-ConvoKit tackles the absence of centralized tools for education-focused conversation analysis by delivering an open-source, modular pipeline for pre-processing, annotation, and analysis. The framework de-identifies data, supports seven annotation types (including talk moves and student reasoning), and enables diverse analyses (qualitative, quantitative, lexical, temporal, and GPT-assisted) via a consistent, dataframe-based interface. It is demonstrated through Colab notebooks and three education datasets, and supported by tutorials and a paper repository to foster reproducibility and collaboration. While offering broad capabilities, the authors note limitations in transcription support, metadata ties, domain generalization, and reliance on known-speaker de-identification, underscoring the need for cautious interpretation and ongoing enhancement. Overall, Edu-ConvoKit provides a practical, community-driven platform to accelerate research in education-focused NLP and improve teaching and learning outcomes.

Abstract

We introduce Edu-ConvoKit, an open-source library designed to handle pre-processing, annotation and analysis of conversation data in education. Resources for analyzing education conversation data are scarce, making the research challenging to perform and therefore hard to access. We address these challenges with Edu-ConvoKit. Edu-ConvoKit is open-source (https://github.com/stanfordnlp/edu-convokit ), pip-installable (https://pypi.org/project/edu-convokit/ ), with comprehensive documentation (https://edu-convokit.readthedocs.io/en/latest/ ). Our demo video is available at: https://youtu.be/zdcI839vAko?si=h9qlnl76ucSuXb8- . We include additional resources, such as Colab applications of Edu-ConvoKit to three diverse education datasets and a repository of Edu-ConvoKit related papers, that can be found in our GitHub repository.

Edu-ConvoKit: An Open-Source Library for Education Conversation Data

TL;DR

Edu-ConvoKit tackles the absence of centralized tools for education-focused conversation analysis by delivering an open-source, modular pipeline for pre-processing, annotation, and analysis. The framework de-identifies data, supports seven annotation types (including talk moves and student reasoning), and enables diverse analyses (qualitative, quantitative, lexical, temporal, and GPT-assisted) via a consistent, dataframe-based interface. It is demonstrated through Colab notebooks and three education datasets, and supported by tutorials and a paper repository to foster reproducibility and collaboration. While offering broad capabilities, the authors note limitations in transcription support, metadata ties, domain generalization, and reliance on known-speaker de-identification, underscoring the need for cautious interpretation and ongoing enhancement. Overall, Edu-ConvoKit provides a practical, community-driven platform to accelerate research in education-focused NLP and improve teaching and learning outcomes.

Abstract

We introduce Edu-ConvoKit, an open-source library designed to handle pre-processing, annotation and analysis of conversation data in education. Resources for analyzing education conversation data are scarce, making the research challenging to perform and therefore hard to access. We address these challenges with Edu-ConvoKit. Edu-ConvoKit is open-source (https://github.com/stanfordnlp/edu-convokit ), pip-installable (https://pypi.org/project/edu-convokit/ ), with comprehensive documentation (https://edu-convokit.readthedocs.io/en/latest/ ). Our demo video is available at: https://youtu.be/zdcI839vAko?si=h9qlnl76ucSuXb8- . We include additional resources, such as Colab applications of Edu-ConvoKit to three diverse education datasets and a repository of Edu-ConvoKit related papers, that can be found in our GitHub repository.
Paper Structure (27 sections, 15 figures)

This paper contains 27 sections, 15 figures.

Figures (15)

  • Figure 1: Overview of Edu-ConvoKit.Edu-ConvoKit is designed to facilitate the study of conversation data in education. It is a modular, end-to-end pipeline for A. pre-processing, B. annotating, and C. analyzing education conversation data. As additional resources, the toolkit includes Colab notebooks applying Edu-ConvoKit to three existing, large education datasets and a centralized database of Edu-ConvoKitpapers. This toolkit aims to enhance the accessibility and reproducibility of NLP and education research.
  • Figure 2: Example for text de-identification.PreProcessor accounts for multiple names (e.g., "John Paul" matches to "John"), and handles word boundaries (e.g., "John" does not match to "Johnson").
  • Figure 3: Example for Annotator.
  • Figure :
  • Figure :
  • ...and 10 more figures