Table of Contents
Fetching ...

An Exploratory Study of ML Sketches and Visual Code Assistants

Luís F. Gomes, Vincent J. Hellendoorn, Jonathan Aldrich, Rui Abreu

TL;DR

This work introduces an in-IDE Visual Code Assistant that translates ML data-science sketches into executable Jupyter Notebooks using a vision-enabled LLM. Through a study with 19 data scientists, the authors analyze sketch patterns and evaluate code generation via a GPT-4o-based judge, finding that outlines can reach 70-80% accuracy while detailed instantiations lag at 25-40%. Longer sketch times correlate with higher outline quality, and participants view visual assistants as valuable for education, prototyping, and collaborative contexts. The study demonstrates the feasibility and potential impact of integrating visual inputs into code generation tools, guiding future development of interactive, explainable, and iterative Visual Code Assistants in IDEs.

Abstract

This paper explores the integration of Visual Code Assistants in Integrated Development Environments (IDEs). In Software Engineering, whiteboard sketching is often the initial step before coding, serving as a crucial collaboration tool for developers. Previous studies have investigated patterns in SE sketches and how they are used in practice, yet methods for directly using these sketches for code generation remain limited. The emergence of visually-equipped large language models presents an opportunity to bridge this gap, which is the focus of our research. In this paper, we built a first prototype of a Visual Code Assistant to get user feedback regarding in-IDE sketch-to-code tools. We conduct an experiment with 19 data scientists, most of whom regularly sketch as part of their job. We investigate developers' mental models by analyzing patterns commonly observed in their sketches when developing an ML workflow. Analysis indicates that diagrams were the preferred organizational component (52.6%), often accompanied by lists (42.1%) and numbered points (36.8%). Our tool converts their sketches into a Python notebook by querying an LLM. We use an LLM-as-judge setup to score the quality of the generated code, finding that even brief sketching can effectively generate useful code outlines. We also find a positive correlation between sketch time and the quality of the generated code. We conclude the study by conducting extensive interviews to assess the tool's usefulness, explore potential use cases, and understand developers' needs. As noted by participants, promising applications for these assistants include education, prototyping, and collaborative settings. Our findings signal promise for the next generation of Code Assistants to integrate visual information, both to improve code generation and to better leverage developers' existing sketching practices.

An Exploratory Study of ML Sketches and Visual Code Assistants

TL;DR

This work introduces an in-IDE Visual Code Assistant that translates ML data-science sketches into executable Jupyter Notebooks using a vision-enabled LLM. Through a study with 19 data scientists, the authors analyze sketch patterns and evaluate code generation via a GPT-4o-based judge, finding that outlines can reach 70-80% accuracy while detailed instantiations lag at 25-40%. Longer sketch times correlate with higher outline quality, and participants view visual assistants as valuable for education, prototyping, and collaborative contexts. The study demonstrates the feasibility and potential impact of integrating visual inputs into code generation tools, guiding future development of interactive, explainable, and iterative Visual Code Assistants in IDEs.

Abstract

This paper explores the integration of Visual Code Assistants in Integrated Development Environments (IDEs). In Software Engineering, whiteboard sketching is often the initial step before coding, serving as a crucial collaboration tool for developers. Previous studies have investigated patterns in SE sketches and how they are used in practice, yet methods for directly using these sketches for code generation remain limited. The emergence of visually-equipped large language models presents an opportunity to bridge this gap, which is the focus of our research. In this paper, we built a first prototype of a Visual Code Assistant to get user feedback regarding in-IDE sketch-to-code tools. We conduct an experiment with 19 data scientists, most of whom regularly sketch as part of their job. We investigate developers' mental models by analyzing patterns commonly observed in their sketches when developing an ML workflow. Analysis indicates that diagrams were the preferred organizational component (52.6%), often accompanied by lists (42.1%) and numbered points (36.8%). Our tool converts their sketches into a Python notebook by querying an LLM. We use an LLM-as-judge setup to score the quality of the generated code, finding that even brief sketching can effectively generate useful code outlines. We also find a positive correlation between sketch time and the quality of the generated code. We conclude the study by conducting extensive interviews to assess the tool's usefulness, explore potential use cases, and understand developers' needs. As noted by participants, promising applications for these assistants include education, prototyping, and collaborative settings. Our findings signal promise for the next generation of Code Assistants to integrate visual information, both to improve code generation and to better leverage developers' existing sketching practices.

Paper Structure

This paper contains 25 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Example of usage of our prototype. Left editor tab: The whiteboard sketch uploaded by the participant to be used by the code assistant to generate code. Right editor tab: Generated Jupyter Notebook with code and markdown explanations.
  • Figure 2: Overview of our benchmarking using LLM-as-a-Judge. To evaluate a Jupyter Notebook generated from a sketch (1), the Judge has access to knowledge elements related to the original drawing: a Task description that relates to users' coding intentions depicted on the drawing (2), and the grading criteria for the Outline and Instantiation metrics (3 and 4).
  • Figure 3: Performance comparison of top visual LLMs on the 19 notebooks. No statistically significant difference was detected in either dimension in our study.