Edu-ConvoKit: An Open-Source Library for Education Conversation Data
Rose E. Wang, Dorottya Demszky
TL;DR
Edu-ConvoKit tackles the absence of centralized tools for education-focused conversation analysis by delivering an open-source, modular pipeline for pre-processing, annotation, and analysis. The framework de-identifies data, supports seven annotation types (including talk moves and student reasoning), and enables diverse analyses (qualitative, quantitative, lexical, temporal, and GPT-assisted) via a consistent, dataframe-based interface. It is demonstrated through Colab notebooks and three education datasets, and supported by tutorials and a paper repository to foster reproducibility and collaboration. While offering broad capabilities, the authors note limitations in transcription support, metadata ties, domain generalization, and reliance on known-speaker de-identification, underscoring the need for cautious interpretation and ongoing enhancement. Overall, Edu-ConvoKit provides a practical, community-driven platform to accelerate research in education-focused NLP and improve teaching and learning outcomes.
Abstract
We introduce Edu-ConvoKit, an open-source library designed to handle pre-processing, annotation and analysis of conversation data in education. Resources for analyzing education conversation data are scarce, making the research challenging to perform and therefore hard to access. We address these challenges with Edu-ConvoKit. Edu-ConvoKit is open-source (https://github.com/stanfordnlp/edu-convokit ), pip-installable (https://pypi.org/project/edu-convokit/ ), with comprehensive documentation (https://edu-convokit.readthedocs.io/en/latest/ ). Our demo video is available at: https://youtu.be/zdcI839vAko?si=h9qlnl76ucSuXb8- . We include additional resources, such as Colab applications of Edu-ConvoKit to three diverse education datasets and a repository of Edu-ConvoKit related papers, that can be found in our GitHub repository.
