DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Maojun Sun; Yue Wu; Yifei Xie; Ruijian Han; Binyan Jiang; Defeng Sun; Yancheng Yuan; Jian Huang

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Maojun Sun, Yue Wu, Yifei Xie, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang

TL;DR

This work proposes DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval and helps narrow the gap between LLM automation and the mature R statistical ecosystem.

Abstract

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG at 10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

TL;DR

Abstract

Paper Structure (28 sections, 9 equations, 7 figures, 2 tables)

This paper contains 28 sections, 9 equations, 7 figures, 2 tables.

Introduction
Related Works
Methodology
Database Construction
Problem Formulation
DARE Modeling
RCodingAgent: DARE-Augmented Agentic Data Analysis
Evaluating LLM Agents for Statistical Programming in R
Experiments
Experimental Setup
Evaluation Metrics
Experimental Results
Performance on Retrieval
Inference Efficiency Analysis
Impact on Agentic Data Analysis
...and 13 more sections

Figures (7)

Figure 1: Comparison of traditional semantic function search methods and Distribution-Aware Retrieval Embedding (DARE) method.
Figure 2: The overall framework of DARE training process.
Figure 3: An example of RCodingAgent for realistic statistical analysis.
Figure 4: Upper panel: Pipeline for constructing R-based statistical evaluation tasks. Lower panel: Overview of selected domains and R packages covered in the benchmark.
Figure 5: Results of QPS and Latency.
...and 2 more figures

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

TL;DR

Abstract

DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Authors

TL;DR

Abstract

Table of Contents

Figures (7)