ReflectSumm: A Benchmark for Course Reflection Summarization
Yang Zhong, Mohamed Elaraby, Diane Litman, Ahmed Ashraf Butt, Muhsin Menekse
TL;DR
ReflectSumm introduces a large, metadata-rich dataset for course reflection summarization, addressing the need for real-world, low-resource benchmarks in education. The dataset contains 17,512 reflections across 782 lectures in 24 STEM courses and supports three summarization formats (extractive, extractive-phrase, abstractive) with additional specificity and demographic metadata. The authors benchmark a range of baselines, from classical extractive methods (LexRank, MatchSum) to fine-tuned BART and GPT-based prompts, using cross-validation and multiple evaluation metrics including ROUGE, BERTScore, and SummaC. Key findings show that in-domain fine-tuning and specificity cues improve abstractive results, while LLM prompts excel on standard lexical metrics but may struggle with extractiveness and factuality; the dataset also enables fairness analyses and educational applications. The work provides a solid foundation for future research into finely-grained, domain-aware summarization of student reflections and invites extension to other domains.
Abstract
This paper introduces ReflectSumm, a novel summarization dataset specifically designed for summarizing students' reflective writing. The goal of ReflectSumm is to facilitate developing and evaluating novel summarization techniques tailored to real-world scenarios with little training data, %practical tasks with potential implications in the opinion summarization domain in general and the educational domain in particular. The dataset encompasses a diverse range of summarization tasks and includes comprehensive metadata, enabling the exploration of various research questions and supporting different applications. To showcase its utility, we conducted extensive evaluations using multiple state-of-the-art baselines. The results provide benchmarks for facilitating further research in this area.
