Table of Contents
Fetching ...

Xavier: Toward Better Coding Assistance in Authoring Tabular Data Wrangling Scripts

Yunfan Zhou, Xiwen Cai, Qiming Shi, Yanwei Huang, Haotian Li, Huamin Qu, Di Weng, Yingcai Wu

TL;DR

This work tackles the misalignment between data contexts and AI-driven code completions in data wrangling tasks. It introduces Xavier, a computational notebook extension that couples code context with three-dimensional data contexts (tables, columns, rows) to deliver data-context-aware code suggestions, automatic data highlighting, and real-time transformation previews. A preliminary study informs design requirements, and a user study with 16 analysts demonstrates that Xavier substantially reduces context switches and errors during scripting, with positive user feedback on its transparency and verification aids. The findings suggest that integrating data contexts into coding assistance can significantly improve the efficiency, accuracy, and trust of data wrangling workflows, with potential for broader adoption across data tools and languages.

Abstract

Data analysts frequently employ code completion tools in writing custom scripts to tackle complex tabular data wrangling tasks. However, existing tools do not sufficiently link the data contexts such as schemas and values with the code being edited. This not only leads to poor code suggestions, but also frequent interruptions in coding processes as users need additional code to locate and understand relevant data. We introduce Xavier, a tool designed to enhance data wrangling script authoring in computational notebooks. Xavier maintains users' awareness of data contexts while providing data-aware code suggestions. It automatically highlights the most relevant data based on the user's code, integrates both code and data contexts for more accurate suggestions, and instantly previews data transformation results for easy verification. To evaluate the effectiveness and usability of Xavier, we conducted a user study with 16 data analysts, showing its potential to streamline data wrangling scripts authoring.

Xavier: Toward Better Coding Assistance in Authoring Tabular Data Wrangling Scripts

TL;DR

This work tackles the misalignment between data contexts and AI-driven code completions in data wrangling tasks. It introduces Xavier, a computational notebook extension that couples code context with three-dimensional data contexts (tables, columns, rows) to deliver data-context-aware code suggestions, automatic data highlighting, and real-time transformation previews. A preliminary study informs design requirements, and a user study with 16 analysts demonstrates that Xavier substantially reduces context switches and errors during scripting, with positive user feedback on its transparency and verification aids. The findings suggest that integrating data contexts into coding assistance can significantly improve the efficiency, accuracy, and trust of data wrangling workflows, with potential for broader adoption across data tools and languages.

Abstract

Data analysts frequently employ code completion tools in writing custom scripts to tackle complex tabular data wrangling tasks. However, existing tools do not sufficiently link the data contexts such as schemas and values with the code being edited. This not only leads to poor code suggestions, but also frequent interruptions in coding processes as users need additional code to locate and understand relevant data. We introduce Xavier, a tool designed to enhance data wrangling script authoring in computational notebooks. Xavier maintains users' awareness of data contexts while providing data-aware code suggestions. It automatically highlights the most relevant data based on the user's code, integrates both code and data contexts for more accurate suggestions, and instantly previews data transformation results for easy verification. To evaluate the effectiveness and usability of Xavier, we conducted a user study with 16 data analysts, showing its potential to streamline data wrangling scripts authoring.

Paper Structure

This paper contains 35 sections, 7 figures.

Figures (7)

  • Figure 1: The timelines of observed activities of each participant during the code authoring experiment. DI and CA refer to Data Inspection and Code Authoring, respectively. Users may profile data (DI_1), verify results (DI_2), create new transformations (CA_1) or modify the written code (CA_2) during scripting. To facilitate comparison, we scaled the duration of activities by normalizing each participant's total time spent on code authoring.
  • Figure 2: The workflow of Xavier. The input of Xavier consists of code in the editor (A1) and DataFrames in the notebook kernel (A2). The code is divided into the complete part and the incomplete part by the code context manager (B1) where the incomplete code is further parsed. Data contexts for each DataFrame are pre-calculated in the data context manager (B2) since the last run of code. The complete code, the parsing result and data contexts are transferred to completion generator (B3) for data context-aware code suggestions (C1). Meanwhile, Xavier highlights the most relevant data based on user's code and the completion suggestions in the data view, previewing transformation results to facilitate code verification (C2).
  • Figure 3: The usage scenario of automatic data context highlighting. A) Xavier detected the existing DataFrame "joined" and showed the corresponding schema. B) When Sarah was selecting the suggested column names for the partial code (B1), Xavier displayed sample rows of the DataFrame "joined" and highlighted relevant columns based on Sarah's code and the selected suggestion (B2). C) Finally, Sarah selected three columns which were highlighted by Xavier.
  • Figure 4: The usage scenario of real-time transformation preview. When Sarah switched to a completion item about column format transformation (A), Xavier automatically computed the transformation result and added a preview column (B2) to the right of the original column (B1), with bold text in changed table cells.
  • Figure 5: Three preview forms of Xavier. A) For the column format transformation, a new column is created to the right of the original column. B) For the table filtering transformation, rows to be deleted are highlighted. C) For transformations that generate a new table or change the whole table (e.g. Sort movies by total votes. For movies having equal total votes, sort them by country names), both the original table and the result table are displayed.
  • ...and 2 more figures