Dataset Discovery via Line Charts
Daomin Ji, Hui Luo, Zhifeng Bao, J. Shane Culpepper
TL;DR
The paper tackles dataset discovery where a line chart query is used to locate datasets capable of generating similar charts, addressing the absence of underlying data by learning a cross-modal relevance function Rel'(V,T). It introduces Fine-grained Cross-modal Relevance Learning Model (FCM) with segment-level encoders and a hierarchical cross-modal matcher to align line charts with datasets, and extends the model to handle data-aggregation-based queries via transformation, multi-scale representation, and MoE layers. A Plotly-based benchmark demonstrates that FCM outperforms strong baselines by substantial margins (e.g., prec@50 up to 0.454 and ndcg@50 up to 0.347), and the system achieves notable efficiency gains using a hybrid interval tree and LSH indexing strategy. The work provides a new cross-modal framework for chart-driven data discovery with practical implications for journalism, business, and clinical research, and outlines directions for broader chart types and data transformations.
Abstract
Line charts are a valuable tool for data analysis and exploration, distilling essential insights from a dataset. However, access to the underlying dataset behind a line chart is rarely readily available. In this paper, we explore a novel dataset discovery problem, dataset discovery via line charts, focusing on the use of line charts as queries to discover datasets within a large data repository that are capable of generating similar line charts. To solve this problem, we propose a novel approach called Fine-grained Cross-modal Relevance Learning Model (FCM), which aims to estimate the relevance between a line chart and a candidate dataset. To achieve this goal, FCM first employs a visual element extractor to extract informative visual elements, i.e., lines and y-ticks, from a line chart. Then, two novel segment-level encoders are adopted to learn representations for a line chart and a dataset, preserving fine-grained information, followed by a cross-modal matcher to match the learned representations in a fine-grained way. Furthermore, we extend FCM to support line chart queries generated based on data aggregation. Last, we propose a benchmark tailored for this problem since no such dataset exists. Extensive evaluation on the new benchmark verifies the effectiveness of our proposed method. Specifically, our proposed approach surpasses the best baseline by 30.1% and 41.0% in terms of prec@50 and ndcg@50, respectively.
