Table of Contents
Fetching ...

Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

Francesco Dente, Fabiano Dalpiaz, Paolo Papotti

TL;DR

Text2Stories addresses the challenge of evaluating whether automatically generated or human-derived software requirements faithfully reflect elicitation transcripts. It defines the Text-to-Story Alignment (T2SA) task and two source-grounded metrics, Correctness $Corr$ and Completeness $Comp$, and implements a scalable two-stage pipeline with optional embedding-based blocking to align transcript chunks $C$ with stories $S$. Empirical results across 17 datasets show that LLM-based judges achieve high alignment accuracy (macro-F1 around $0.86$) and that generator size improves completeness while maintaining grounding, illustrating the framework’s utility for comparing human- vs. AI-generated stories. The approach provides a practical, source-faithful complement to existing quality criteria, with clear trade-offs between chunking, blocking efficiency, and model choice, and it emphasizes human-in-the-loop validation and domain considerations.

Abstract

Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.

Text2Stories: Evaluating the Alignment Between Stakeholder Interviews and Generated User Stories

TL;DR

Text2Stories addresses the challenge of evaluating whether automatically generated or human-derived software requirements faithfully reflect elicitation transcripts. It defines the Text-to-Story Alignment (T2SA) task and two source-grounded metrics, Correctness and Completeness , and implements a scalable two-stage pipeline with optional embedding-based blocking to align transcript chunks with stories . Empirical results across 17 datasets show that LLM-based judges achieve high alignment accuracy (macro-F1 around ) and that generator size improves completeness while maintaining grounding, illustrating the framework’s utility for comparing human- vs. AI-generated stories. The approach provides a practical, source-faithful complement to existing quality criteria, with clear trade-offs between chunking, blocking efficiency, and model choice, and it emphasizes human-in-the-loop validation and domain considerations.

Abstract

Large language models (LLMs) can be employed for automating the generation of software requirements from natural language inputs such as the transcripts of elicitation interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a largely manual task. We introduce Text2Stories, a task and metrics for text-to-story alignment that allow quantifying the extent to which requirements (in the form of user stories) match the actual needs expressed by the elicitation session participants. Given an interview transcript and a set of user stories, our metric quantifies (i) correctness: the proportion of stories supported by the transcript, and (ii) completeness: the proportion of transcript supported by at least one story. We segment the transcript into text chunks and instantiate the alignment as a matching problem between chunks and stories. Experiments over four datasets show that an LLM-based matcher achieves 0.86 macro-F1 on held-out annotations, while embedding models alone remain behind but enable effective blocking. Finally, we show how our metrics enable the comparison across sets of stories (e.g., human vs. generated), positioning Text2Stories as a scalable, source-faithful complement to existing user-story quality criteria.

Paper Structure

This paper contains 27 sections, 3 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of the workflow. Elicitation from stakeholders produces an interview transcript. From this transcript, human analysts (or LLMs) craft user stories. Text2Stories computes two metrics: Correctness (are stories supported by the interview?) and Completeness (is the whole conversation covered by the stories?).
  • Figure 2: Example of Text-to-Story Alignment (T2SA). A snippet from the elicitation interview (left) is aligned against two candidate user stories by a human analyst (right). One is a valid match (approval of new teams is evidenced in the interview), the other is not a match (allocation of match officials is not supported in this excerpt). T2SA produces traceable chunk-story links used by our Correctness and Completeness metrics.
  • Figure 3: Average correctness of user stories generated by Qwen models of increasing size. Error bars denote $\pm$1 standard deviation across datasets (n=15).
  • Figure 4: Average completeness of user stories generated by Qwen models of increasing size. Error bars denote $\pm$1 standard deviation across datasets (n=15).
  • Figure 5: Recall of positive pairs vs. percentage of tokens retrieved by blocking operator ($B_K$) across datasets. Lower values indicate higher efficiency.
  • ...and 1 more figures