Table of Contents
Fetching ...

CoddLLM: Empowering Large Language Models for Data Analytics

Jiani Zhang, Hengrui Zhang, Rishav Chakravarti, Yiqun Hu, Patrick Ng, Asterios Katsifodimos, Huzefa Rangwala, George Karypis, Alon Halevy

TL;DR

CoddLLM presents a data-centric approach to building analytics-focused foundation models by post-training a 12B decoder-only LLM (based on Mistral-NeMo-Instruct) with a scalable, reference-grounded data recipe. The training corpus is organized into three chapters: analytics knowledge, table-text alignment (Text-to-Schema, Row-to-Text), and analytics tasks (Table Selection, Text-to-SQL), augmented by newly released benchmarks AnalyticsMMLU and WikiPage-TS. Across eight evaluation datasets, CoddLLM achieves the highest overall score (0.697) and outperforms baselines including GPT-3.5-Turbo and GPT-4o on several tasks, notably Table Selection (12.1% lead) and Text-to-SQL (24.9% gain over the base). The work demonstrates that instruction-tuned, reference-grounded, multi-chapter training can substantially improve data discovery, schema interpretation, and cross-modal reasoning in data analytics. Future directions include integrating retrieval-augmented generation and analytics tooling to further enhance practical data analytics tasks.

Abstract

Large Language Models (LLMs) have the potential to revolutionize data analytics by simplifying tasks such as data discovery and SQL query synthesis through natural language interactions. This work serves as a pivotal first step toward the development of foundation models explicitly designed for data analytics applications. To propel this vision forward, we unveil a new data recipe for post-training LLMs, enhancing their comprehension of data management and empowering them to tackle complex real-world analytics tasks. Specifically, our innovative approach includes a scalable synthetic data generation method that enables the creation of a broad spectrum of topics centered on data representation and manipulation. Furthermore, we introduce two new tasks that seamlessly bridge tables and text. We show that such tasks can enhance models' understanding of schema creation and the nuanced translation between natural language and tabular data. Leveraging this data recipe, we post-train a new foundation model, named CoddLLM, based on Mistral-NeMo-12B. To assess the language understanding and reasoning capabilities of LLMs in the realm of data analytics, we contribute AnalyticsMMLU, a benchmark containing thousands of multiple-choice questions on databases, data analysis, and machine learning. Our focus on data discovery, has resulted in the contribution of three comprehensive benchmarks that address both database and data lake scenarios. CoddLLM not only excels in performance but also sets a new standard, achieving the highest average accuracy across eight datasets. It outperforms GPT-3.5-Turbo on AnalyticsMMLU, exceeding GPT-4o by 12.1% in table selection and showing an average improvement of 24.9% in Text-to-SQL compared to the base model.

CoddLLM: Empowering Large Language Models for Data Analytics

TL;DR

CoddLLM presents a data-centric approach to building analytics-focused foundation models by post-training a 12B decoder-only LLM (based on Mistral-NeMo-Instruct) with a scalable, reference-grounded data recipe. The training corpus is organized into three chapters: analytics knowledge, table-text alignment (Text-to-Schema, Row-to-Text), and analytics tasks (Table Selection, Text-to-SQL), augmented by newly released benchmarks AnalyticsMMLU and WikiPage-TS. Across eight evaluation datasets, CoddLLM achieves the highest overall score (0.697) and outperforms baselines including GPT-3.5-Turbo and GPT-4o on several tasks, notably Table Selection (12.1% lead) and Text-to-SQL (24.9% gain over the base). The work demonstrates that instruction-tuned, reference-grounded, multi-chapter training can substantially improve data discovery, schema interpretation, and cross-modal reasoning in data analytics. Future directions include integrating retrieval-augmented generation and analytics tooling to further enhance practical data analytics tasks.

Abstract

Large Language Models (LLMs) have the potential to revolutionize data analytics by simplifying tasks such as data discovery and SQL query synthesis through natural language interactions. This work serves as a pivotal first step toward the development of foundation models explicitly designed for data analytics applications. To propel this vision forward, we unveil a new data recipe for post-training LLMs, enhancing their comprehension of data management and empowering them to tackle complex real-world analytics tasks. Specifically, our innovative approach includes a scalable synthetic data generation method that enables the creation of a broad spectrum of topics centered on data representation and manipulation. Furthermore, we introduce two new tasks that seamlessly bridge tables and text. We show that such tasks can enhance models' understanding of schema creation and the nuanced translation between natural language and tabular data. Leveraging this data recipe, we post-train a new foundation model, named CoddLLM, based on Mistral-NeMo-12B. To assess the language understanding and reasoning capabilities of LLMs in the realm of data analytics, we contribute AnalyticsMMLU, a benchmark containing thousands of multiple-choice questions on databases, data analysis, and machine learning. Our focus on data discovery, has resulted in the contribution of three comprehensive benchmarks that address both database and data lake scenarios. CoddLLM not only excels in performance but also sets a new standard, achieving the highest average accuracy across eight datasets. It outperforms GPT-3.5-Turbo on AnalyticsMMLU, exceeding GPT-4o by 12.1% in table selection and showing an average improvement of 24.9% in Text-to-SQL compared to the base model.

Paper Structure

This paper contains 29 sections, 1 equation, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Building Chapter 1 data. Step 1 (the top and middle figures): We first train a document classifier using annotations from a LLM. After training, we use the classifier to filter documents related to data analytics from the FineWeb-Edu dataset. Step 2 and Step 3 (the bottom figure): The filtered documents are then converted into question-answer pairs using a rule-based extractor or a LLM-based synthesizer. Finally, we adopt LLM-as-a-judge to eliminate low-quality examples.
  • Figure 2: A question and wikipage data sample from WikiPage-TS. In this example, we need to understand that Heat_2_1, Heat_3_1, and Heat_4_1 occurred in chronological order. In Heat_2_1, Marlene ran 11.5 seconds, which matched the Olympic record. In Heat_3_1, Betty ran 11.4 seconds, breaking the Olympic record. By Heat_4_1, although Heather ran 11.5 seconds, the Olympic record had already been lowered to 11.4 seconds, so Heather did not match the new record. The answer is 2 athletes. For the Table Selection task, we regard all referenced tables as ground truth tables.