Table of Contents
Fetching ...

StructVizor: Interactive Profiling of Semi-Structured Textual Data

Yanwei Huang, Yan Miao, Di Weng, Adam Perer, Yingcai Wu

TL;DR

StructVizor tackles the challenge of profiling and wrangling semi-structured textual data by combining automatic structure mining with interactive visual profiling. The system parses raw text into records and fields, clusters patterns, and visualizes structure in three coordinated views to support sensemaking and in-situ data transformations, including table construction. A user study with 12 participants shows StructVizor enables faster data wrangling with lower workload than Wrangler and facilitates explorative analysis via expressive profiles, though integration and onboarding refinements are identified. Overall, the work contributes a novel visual profiling paradigm that blends automatic structure discovery with interactive data wrangling, promising practical impact for log analysis, data cleaning, and social media analytics.

Abstract

Data profiling plays a critical role in understanding the structure of complex datasets and supporting numerous downstream tasks, such as social media analytics and financial fraud detection. While existing research predominantly focuses on structured data formats, a substantial portion of semi-structured textual data still requires ad-hoc and arduous manual profiling to extract and comprehend its internal structures. In this work, we propose StructVizor, an interactive profiling system that facilitates sensemaking and transformation of semi-structured textual data. Our tool mainly addresses two challenges: a) extracting and visualizing the diverse structural patterns within data, such as how information is organized or related, and b) enabling users to efficiently perform various wrangling operations on textual data. Through automatic data parsing and structure mining, StructVizor enables visual analytics of structural patterns, while incorporating novel interactions to enable profile-based data wrangling. A comparative user study involving 12 participants demonstrates the system's usability and its effectiveness in supporting exploratory data analysis and transformation tasks.

StructVizor: Interactive Profiling of Semi-Structured Textual Data

TL;DR

StructVizor tackles the challenge of profiling and wrangling semi-structured textual data by combining automatic structure mining with interactive visual profiling. The system parses raw text into records and fields, clusters patterns, and visualizes structure in three coordinated views to support sensemaking and in-situ data transformations, including table construction. A user study with 12 participants shows StructVizor enables faster data wrangling with lower workload than Wrangler and facilitates explorative analysis via expressive profiles, though integration and onboarding refinements are identified. Overall, the work contributes a novel visual profiling paradigm that blends automatic structure discovery with interactive data wrangling, promising practical impact for log analysis, data cleaning, and social media analytics.

Abstract

Data profiling plays a critical role in understanding the structure of complex datasets and supporting numerous downstream tasks, such as social media analytics and financial fraud detection. While existing research predominantly focuses on structured data formats, a substantial portion of semi-structured textual data still requires ad-hoc and arduous manual profiling to extract and comprehend its internal structures. In this work, we propose StructVizor, an interactive profiling system that facilitates sensemaking and transformation of semi-structured textual data. Our tool mainly addresses two challenges: a) extracting and visualizing the diverse structural patterns within data, such as how information is organized or related, and b) enabling users to efficiently perform various wrangling operations on textual data. Through automatic data parsing and structure mining, StructVizor enables visual analytics of structural patterns, while incorporating novel interactions to enable profile-based data wrangling. A comparative user study involving 12 participants demonstrates the system's usability and its effectiveness in supporting exploratory data analysis and transformation tasks.

Paper Structure

This paper contains 48 sections, 4 equations, 7 figures, 3 algorithms.

Figures (7)

  • Figure 1: The StructVizor system. (A) The structure view visualizes the structural patterns present in the dataset. The relationship view (A1) illustrates the similarity between different data fields. The tabular view (A2) depicts the dataset's structural distribution, where rows represent data records and columns represent data fields. Users can click on cells (A3) to view the value distributions of records in the selected field. Various interactions are supported for in-situ data wrangling, such as splitting fields into subfields (A4), applying filters (A5), and performing transformations on cells (A6). (B) The data view displays the annotated dataset, with parsed data records separated into different lines and clustered (B1). An overview of the dataset is provided through the thumbnail view (B2). (C) The Wrangler view empowers users to construct relational tables based on the dataset profiles. Users can navigate to specific records by clicking on the table cells (C1).
  • Figure 2: Scenes for the usage scenario. (A) The data view after importing the dataset. (B) Cells in the structure view are used for field analysis and data filtering. (C) The updated structure view for detailed analysis of lengthy unstructured strings. (D) The heatmap showing the similarity between fields. (E) The relational table in the wrangler view constructed by the user in for further analysis.
  • Figure 3: (A) The panel for importing data. The automatically sampled dataset is initially shown. Users may edit and refine it before entering the visualization interface. (B) The panel for editing the fields of a record, where users can put the updated fields in separate lines.
  • Figure 4: Average running time of StructVizor's data processing pipeline with respect to the dataset size and complexity.
  • Figure 5: Failure cases of GPT-4o in data parsing.
  • ...and 2 more figures