Table of Contents
Fetching ...

InData: Towards Secure Multi-Step, Tool-Based Data Analysis

Karthikeyan K, Raghuveer Thirukovalluru, Bhuwan Dhingra, David Edwin Carlson

TL;DR

This work addresses privacy-sensitive data analysis by replacing direct code generation with a secure, tool-based interface. It introduces InData, a dataset designed to evaluate multi-step, tool-based reasoning under strict data-access constraints, using a fixed set of vetted tools to interact with tabular data. Benchmarking 15 open-source LLMs reveals that models excel at simple tool use and code generation but struggle with complex, multi-step tool reasoning, with performance strongly dependent on model size and access to the full toolset. The authors analyze factors that influence performance, such as hints and tool sufficiency, and propose future directions (planning stages, supervision, and on-prem deployment) to advance robust tool-based reasoning in LLMs.

Abstract

Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs' multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels--Easy, Medium, and Hard--capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.

InData: Towards Secure Multi-Step, Tool-Based Data Analysis

TL;DR

This work addresses privacy-sensitive data analysis by replacing direct code generation with a secure, tool-based interface. It introduces InData, a dataset designed to evaluate multi-step, tool-based reasoning under strict data-access constraints, using a fixed set of vetted tools to interact with tabular data. Benchmarking 15 open-source LLMs reveals that models excel at simple tool use and code generation but struggle with complex, multi-step tool reasoning, with performance strongly dependent on model size and access to the full toolset. The authors analyze factors that influence performance, such as hints and tool sufficiency, and propose future directions (planning stages, supervision, and on-prem deployment) to advance robust tool-based reasoning in LLMs.

Abstract

Large language model agents for data analysis typically generate and execute code directly on databases. However, when applied to sensitive data, this approach poses significant security risks. To address this issue, we propose a security-motivated alternative: restrict LLMs from direct code generation and data access, and require them to interact with data exclusively through a predefined set of secure, verified tools. Although recent tool-use benchmarks exist, they primarily target tool selection and simple execution rather than the compositional, multi-step reasoning needed for complex data analysis. To reduce this gap, we introduce Indirect Data Engagement (InData), a dataset designed to assess LLMs' multi-step tool-based reasoning ability. InData includes data analysis questions at three difficulty levels--Easy, Medium, and Hard--capturing increasing reasoning complexity. We benchmark 15 open-source LLMs on InData and find that while large models (e.g., gpt-oss-120b) achieve high accuracy on Easy tasks (97.3%), performance drops sharply on Hard tasks (69.6%). These results show that current LLMs still lack robust multi-step tool-based reasoning ability. With InData, we take a step toward enabling the development and evaluation of LLMs with stronger multi-step tool-use capabilities. We will publicly release the dataset and code.

Paper Structure

This paper contains 34 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Predefined tools act as a secure barrier between the LLM and sensitive data.
  • Figure 2: Average number of turns per question with and without tool calls.