Table of Contents
Fetching ...

Towards Human-Guided, Data-Centric LLM Co-Pilots

Evgeny Saveliev, Jiashuo Liu, Nabeel Seedat, Anders Boyd, Mihaela van der Schaar

TL;DR

CliMB-DC addresses a critical gap in LLM co-pilots by embedding data-centric AI tools within a human-in-the-loop, multi-agent framework. The paper formalizes a data-centric taxonomy, introduces a coordinator-worker architecture, and demonstrates robust handling of healthcare data challenges like label leakage and aggregation through real-world case studies. Empirical results show CliMB-DC outperforms model-centric baselines while maintaining domain alignment and interpretability, aided by its open-source toolkit design. These contributions pave the way for domain experts to actively participate in ML workflows, enhancing reliability and safety in high-stakes settings.

Abstract

Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.

Towards Human-Guided, Data-Centric LLM Co-Pilots

TL;DR

CliMB-DC addresses a critical gap in LLM co-pilots by embedding data-centric AI tools within a human-in-the-loop, multi-agent framework. The paper formalizes a data-centric taxonomy, introduces a coordinator-worker architecture, and demonstrates robust handling of healthcare data challenges like label leakage and aggregation through real-world case studies. Empirical results show CliMB-DC outperforms model-centric baselines while maintaining domain alignment and interpretability, aided by its open-source toolkit design. These contributions pave the way for domain experts to actively participate in ML workflows, enhancing reliability and safety in high-stakes settings.

Abstract

Machine learning (ML) has the potential to revolutionize various domains, but its adoption is often hindered by the disconnect between the needs of domain experts and translating these needs into robust and valid ML tools. Despite recent advances in LLM-based co-pilots to democratize ML for non-technical domain experts, these systems remain predominantly focused on model-centric aspects while overlooking critical data-centric challenges. This limitation is problematic in complex real-world settings where raw data often contains complex issues, such as missing values, label noise, and domain-specific nuances requiring tailored handling. To address this we introduce CliMB-DC, a human-guided, data-centric framework for LLM co-pilots that combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. At its core, CliMB-DC introduces a novel, multi-agent reasoning system that combines a strategic coordinator for dynamic planning and adaptation with a specialized worker agent for precise execution. Domain expertise is then systematically incorporated to guide the reasoning process using a human-in-the-loop approach. To guide development, we formalize a taxonomy of key data-centric challenges that co-pilots must address. Thereafter, to address the dimensions of the taxonomy, we integrate state-of-the-art data-centric tools into an extensible, open-source architecture, facilitating the addition of new tools from the research community. Empirically, using real-world healthcare datasets we demonstrate CliMB-DC's ability to transform uncurated datasets into ML-ready formats, significantly outperforming existing co-pilot baselines for handling data-centric challenges. CliMB-DC promises to empower domain experts from diverse domains -- healthcare, finance, social sciences and more -- to actively participate in driving real-world impact using ML.
Paper Structure (74 sections, 4 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 74 sections, 4 equations, 10 figures, 9 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustrative examples of potential data issues in real-world healthcare scenarios, highlighting challenges at various levels and demonstrating how the current LLM co-pilot struggles to address these issues.
  • Figure 2: Addressing real data challenges is complex and requires multi-step reasoning.
  • Figure 3: The overall architecture and workflow of CliMB-DC, which primarily consists of three entities that interact with the the evolving state bank: (i) a coordinator agent responsible for reasoning and planning, (ii) a worker agent for code writing and execution, and (iii) the user or human experts.
  • Figure 4: Challenges of Monte Carlo Tree Search (MCTS). We highlight two key drawbacks of MCTS. First, prediction performance cannot serve as a reliable reward, as it may favor data issues such as label leakage or meaningless problem setups (middle). Second, MCTS suffers from low efficiency, requiring experts to endure long waiting times and evaluate a large number of trials (right). In contrast, CliMB-DC's proposed reasoning approach enables immediate backtracking and replanning, significantly enhancing efficiency.
  • Figure 5: The framework of the coordinator agent in CliMB-DC, encompassing three parts named State Observation (SO), Backtracking Assessment (BA), and Lookahead Planning (LP).
  • ...and 5 more figures

Theorems & Definitions (1)

  • Example 1: Demonstration of the reasoning process.