Table of Contents
Fetching ...

LLM-Aided Customizable Profiling of Code Data Based On Programming Language Concepts

Pankaj Thorat, Adnan Qidwai, Adrija Dhar, Aishwariya Chakraborty, Anand Eswaran, Hima Patel, Praveen Jayachandran

TL;DR

The paper addresses the challenge of profiling code datasets for code-focused Large Language Models by introducing a hybrid offline-online framework that converts diverse multilingual code into a language-agnostic Uniform Base Syntactic Representation (UBSR). The offline phase uses LLMs to learn deterministic rules for base syntactic and semantic concepts, while the online phase applies these rules deterministically for real-time profiling, balancing accuracy and cost. Key contributions include the UBSR framework, a tunable higher-order and semantic concept profiler, and a cost-sensitive design that scales across more than 200 languages. Empirical results show strong performance, including a mean syntactic rule accuracy of $90.33\%$ and semantic accuracies around $80\%$ and $77\%$, along with significant token reduction, validating the approach for data valuation, curation, and multilingual code data management.

Abstract

Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of code datasets for Large Language Models (code-LLMs), where data quality directly influences tasks such as code generation and summarization. Characterizing code datasets in terms of programming language concepts enables better insights and targeted data curation. Our proposed methodology decomposes code data profiling into two phases: (1) an offline phase where LLMs are leveraged to derive and learn rules for extracting syntactic and semantic concepts across various programming languages, including previously unseen or low-resource languages, and (2) an online deterministic phase applying these derived rules for efficient real-time analysis. This hybrid approach is customizable, extensible to new syntactic and semantic constructs, and scalable to multiple languages. Experimentally, our LLM-aided method achieves a mean accuracy of 90.33% for syntactic extraction rules and semantic classification accuracies averaging 80% and 77% across languages and semantic concepts, respectively.

LLM-Aided Customizable Profiling of Code Data Based On Programming Language Concepts

TL;DR

The paper addresses the challenge of profiling code datasets for code-focused Large Language Models by introducing a hybrid offline-online framework that converts diverse multilingual code into a language-agnostic Uniform Base Syntactic Representation (UBSR). The offline phase uses LLMs to learn deterministic rules for base syntactic and semantic concepts, while the online phase applies these rules deterministically for real-time profiling, balancing accuracy and cost. Key contributions include the UBSR framework, a tunable higher-order and semantic concept profiler, and a cost-sensitive design that scales across more than 200 languages. Empirical results show strong performance, including a mean syntactic rule accuracy of and semantic accuracies around and , along with significant token reduction, validating the approach for data valuation, curation, and multilingual code data management.

Abstract

Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of code datasets for Large Language Models (code-LLMs), where data quality directly influences tasks such as code generation and summarization. Characterizing code datasets in terms of programming language concepts enables better insights and targeted data curation. Our proposed methodology decomposes code data profiling into two phases: (1) an offline phase where LLMs are leveraged to derive and learn rules for extracting syntactic and semantic concepts across various programming languages, including previously unseen or low-resource languages, and (2) an online deterministic phase applying these derived rules for efficient real-time analysis. This hybrid approach is customizable, extensible to new syntactic and semantic constructs, and scalable to multiple languages. Experimentally, our LLM-aided method achieves a mean accuracy of 90.33% for syntactic extraction rules and semantic classification accuracies averaging 80% and 77% across languages and semantic concepts, respectively.

Paper Structure

This paper contains 27 sections, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Proposed Code Profiler Step that is Useful Across Data and Model Lifecycle.
  • Figure 2: High Level System Design.
  • Figure 3: AST-based Syntactic Variants that Represent the Same Concept.
  • Figure 4: End-to-end Workflow of the Proposed Code Data Profiling Framework.
  • Figure 5: Rule Generation Accuracy with Prompt and Test Examples from (a) Same Paradigm (b) Cross Paradigm
  • ...and 4 more figures