Table of Contents
Fetching ...

Automating Date Format Detection for Data Visualization

Zixuan Liang

TL;DR

The paper tackles the bottleneck of date parsing in data preparation for visualization by introducing two automated format-detection methods: a minimum description length (MDL) based approach and a natural language processing (NLP) based method. The MDL method emphasizes speed and compact representations, while the NLP approach leverages probabilistic grammars to handle ambiguous and irregular date formats. Evaluations on a large open-data corpus show high accuracy and good cross-method agreement, with MDL favored for production due to performance, and NLP offering flexibility for complex patterns. The work demonstrates practical integration potential with visualization tools and databases, advancing automated, scalable data cleaning and preparation pipelines.

Abstract

Data preparation, specifically date parsing, is a significant bottleneck in analytic workflows. To address this, we present two algorithms, one based on minimum entropy and the other on natural language modeling that automatically derive date formats from string data. These algorithms achieve over 90% accuracy on a large corpus of data columns, streamlining the data preparation process within visualization environments. The minimal entropy approach is particularly fast, providing interactive feedback. Our methods simplify date format extraction, making them suitable for integration into data visualization tools and databases.

Automating Date Format Detection for Data Visualization

TL;DR

The paper tackles the bottleneck of date parsing in data preparation for visualization by introducing two automated format-detection methods: a minimum description length (MDL) based approach and a natural language processing (NLP) based method. The MDL method emphasizes speed and compact representations, while the NLP approach leverages probabilistic grammars to handle ambiguous and irregular date formats. Evaluations on a large open-data corpus show high accuracy and good cross-method agreement, with MDL favored for production due to performance, and NLP offering flexibility for complex patterns. The work demonstrates practical integration potential with visualization tools and databases, advancing automated, scalable data cleaning and preparation pipelines.

Abstract

Data preparation, specifically date parsing, is a significant bottleneck in analytic workflows. To address this, we present two algorithms, one based on minimum entropy and the other on natural language modeling that automatically derive date formats from string data. These algorithms achieve over 90% accuracy on a large corpus of data columns, streamlining the data preparation process within visualization environments. The minimal entropy approach is particularly fast, providing interactive feedback. Our methods simplify date format extraction, making them suitable for integration into data visualization tools and databases.
Paper Structure (46 sections, 1 equation, 5 figures, 3 tables)

This paper contains 46 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Categorical Date Scalars.
  • Figure 2: Quantitative Date Scalars.
  • Figure 3: MDL Error Rate
  • Figure 4: MDL Output
  • Figure 5: Most common date formats identified by the NLP algorithm.