Table of Contents
Fetching ...

Hints Help Finding and Fixing Bugs Differently in Python and Text-based Program Representations

Ruchit Rawal, Victor-Alexandru Pădurean, Sven Apel, Adish Singla, Mariya Toneva

TL;DR

This study investigates how hints influence bug finding and fixing when algorithms are represented in Python code versus natural-language text, across users who differ in their initial understanding of the task. Through a large crowd-sourced experiment (N=753) spanning eight condition combinations (two representations × four hint types) and two tasks per participant, the authors measure accuracy on bug-related questions and response time. Key findings show that text-based representations boost accuracy for users with clear understanding, while hints significantly improve Python-based debugging and can bridge gaps between representations and understanding levels; detailed fixes consistently outperform other hint types. The work provides practical guidance for designing adaptive programming tools that tailor representation and hints to user skill, with data and scripts publicly available to enable replication and further study.

Abstract

With the recent advances in AI programming assistants such as GitHub Copilot, programming is not limited to classical programming languages anymore--programming tasks can also be expressed and solved by end-users in natural text. Despite the availability of this new programming modality, users still face difficulties with algorithmic understanding and program debugging. One promising approach to support end-users is to provide hints to help them find and fix bugs while forming and improving their programming capabilities. While it is plausible that hints can help, it is unclear which type of hint is helpful and how this depends on program representations (classic source code or a textual representation) and the user's capability of understanding the algorithmic task. To understand the role of hints in this space, we conduct a large-scale crowd-sourced study involving 753 participants investigating the effect of three types of hints (test cases, conceptual, and detailed), across two program representations (Python and text-based), and two groups of users (with clear understanding or confusion about the algorithmic task). We find that the program representation (Python vs. text) has a significant influence on the users' accuracy at finding and fixing bugs. Surprisingly, users are more accurate at finding and fixing bugs when they see the program in natural text. Hints are generally helpful in improving accuracy, but different hints help differently depending on the program representation and the user's understanding of the algorithmic task. These findings have implications for designing next-generation programming tools that provide personalized support to users, for example, by adapting the programming modality and providing hints with respect to the user's skill level and understanding.

Hints Help Finding and Fixing Bugs Differently in Python and Text-based Program Representations

TL;DR

This study investigates how hints influence bug finding and fixing when algorithms are represented in Python code versus natural-language text, across users who differ in their initial understanding of the task. Through a large crowd-sourced experiment (N=753) spanning eight condition combinations (two representations × four hint types) and two tasks per participant, the authors measure accuracy on bug-related questions and response time. Key findings show that text-based representations boost accuracy for users with clear understanding, while hints significantly improve Python-based debugging and can bridge gaps between representations and understanding levels; detailed fixes consistently outperform other hint types. The work provides practical guidance for designing adaptive programming tools that tailor representation and hints to user skill, with data and scripts publicly available to enable replication and further study.

Abstract

With the recent advances in AI programming assistants such as GitHub Copilot, programming is not limited to classical programming languages anymore--programming tasks can also be expressed and solved by end-users in natural text. Despite the availability of this new programming modality, users still face difficulties with algorithmic understanding and program debugging. One promising approach to support end-users is to provide hints to help them find and fix bugs while forming and improving their programming capabilities. While it is plausible that hints can help, it is unclear which type of hint is helpful and how this depends on program representations (classic source code or a textual representation) and the user's capability of understanding the algorithmic task. To understand the role of hints in this space, we conduct a large-scale crowd-sourced study involving 753 participants investigating the effect of three types of hints (test cases, conceptual, and detailed), across two program representations (Python and text-based), and two groups of users (with clear understanding or confusion about the algorithmic task). We find that the program representation (Python vs. text) has a significant influence on the users' accuracy at finding and fixing bugs. Surprisingly, users are more accurate at finding and fixing bugs when they see the program in natural text. Hints are generally helpful in improving accuracy, but different hints help differently depending on the program representation and the user's understanding of the algorithmic task. These findings have implications for designing next-generation programming tools that provide personalized support to users, for example, by adapting the programming modality and providing hints with respect to the user's skill level and understanding.

Paper Structure

This paper contains 27 sections, 6 figures.

Figures (6)

  • Figure 1: An illustrative example from the study showcasing an algorithmic task. After showing a task, the user is asked to answer a question related to understanding of the task (Q1). Afterward, the user is shown a buggy program (in Python or text-based representation), possibly along with a hint. Then, the user is asked to answer questions related to bug understanding (Q2), bug finding (Q3), and bug fixing (Q4). These questions are posed as multiple-choice questions--the answer options are not shown in the figure for brevity.
  • Figure 2: Visual summary of main findings to our Research Questions. Circular nodes represent main factors of variation (Program Representation, User Group, Hint Presence, and Hint Type). Rectangular blocks contain key takeaways, color-coded by research question (RQ1: purple, RQ2: red, RQ3: green). Connecting lines illustrate how factors combine to address different research questions.
  • Figure 3: Accuracy of participants when presented with text-based vs. Python-based program representations and no hints. The bar plot represents the mean Q2--Q4 average accuracy, with the vertical lines indicating the standard error of the mean. The red dotted line represents chance accuracy, and significant differences between program representations are indicated with an asterisk ($*$). Surprisingly, clear participants perform significantly better when presented with text-based representations than Python-based representations.
  • Figure 4: Accuracy of participants when presented with hints and no hints, across different program representations. The bar plot represents the mean Q2--Q4 average accuracy, with the vertical lines indicating the standard error of the mean. The red dotted line represents chance accuracy, and significant differences between the no hint and with hint conditions are indicated with an asterisk ($*$). Hints significantly improved accuracy for confused and clear participants for Python program representations. Hints also bridged accuracy gaps between representations (for clear participants) and understanding levels (for Python representation).
  • Figure 5: Accuracy of participants when presented with different hints or no hint, across different program representations and participants' level of understanding separately. The bar plot represents the mean Q2--Q4 average accuracy, with the vertical lines indicating the standard error of the mean. The red dotted line represents chance accuracy, and significant differences between the no hint and different hint type conditions are indicated with an asterisk ($*$). Detailed fixes are generally most helpful, while conceptual hints are particularly useful for participants with confused understanding in the Python representation condition.
  • ...and 1 more figures