Table of Contents
Fetching ...

ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages

Mehant Kammakomati, Sameer Pimparkhede, Srikanth Tamilselvam, Prince Kumar, Pushpak Bhattacharyya

TL;DR

ConCodeEval systematically evaluates large language models on code constraints within domain-specific languages used in system-level programming. The benchmark comprises two tasks (Data as Code Generation and DSL Validation) across $5$ representations and $3$ output formats, totaling $15$ use-case combinations, with a dataset of $602$ schemas per representation. Key findings show NL representations excel in generation while JSON and YAML robustly support constraint adherence; model performance does not strictly track pre-training data presence, and constraint locality significantly affects adherence, with end-of-schema constraints being most reliable. The work highlights practical implications for prompt design and schema representation in enterprise settings and motivates future improvements in constrained decoding and tokenizer/representation strategies to enhance LLM controllability over code constraints.

Abstract

Recent work shows Large Language Models (LLMs) struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages (DSLs) like JSON and YAML which are widely used for system-level programming tasks in enterprises. Given that LLMs are increasingly used for system-level code tasks, evaluating if they can comprehend these code constraints is crucial. However, no work has been done to evaluate their controllability over code constraints. Hence, we introduce ConCodeEval, a first-of-its-kind benchmark having two novel tasks for code constraints across five representations. Our findings suggest that language models struggle with code constraints. Code languages that perform excellently for normal code tasks do not perform well when the same languages represent fine-grained constraints.

ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages

TL;DR

ConCodeEval systematically evaluates large language models on code constraints within domain-specific languages used in system-level programming. The benchmark comprises two tasks (Data as Code Generation and DSL Validation) across representations and output formats, totaling use-case combinations, with a dataset of schemas per representation. Key findings show NL representations excel in generation while JSON and YAML robustly support constraint adherence; model performance does not strictly track pre-training data presence, and constraint locality significantly affects adherence, with end-of-schema constraints being most reliable. The work highlights practical implications for prompt design and schema representation in enterprise settings and motivates future improvements in constrained decoding and tokenizer/representation strategies to enhance LLM controllability over code constraints.

Abstract

Recent work shows Large Language Models (LLMs) struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages (DSLs) like JSON and YAML which are widely used for system-level programming tasks in enterprises. Given that LLMs are increasingly used for system-level code tasks, evaluating if they can comprehend these code constraints is crucial. However, no work has been done to evaluate their controllability over code constraints. Hence, we introduce ConCodeEval, a first-of-its-kind benchmark having two novel tasks for code constraints across five representations. Our findings suggest that language models struggle with code constraints. Code languages that perform excellently for normal code tasks do not perform well when the same languages represent fine-grained constraints.
Paper Structure (55 sections, 2 figures, 6 tables)

This paper contains 55 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Uniform trend of steep decline in performance across models for constraints positioned in the middle and beginning of the JSON schema context and output for data as code generation experiment. We divide the schema into $3$ portions, Begin, Middle, and End, and put the violated constraints based on their locality into either of these three buckets.
  • Figure 2: Attention maps for LLama3 8B and 70B model for Data validation experiment. The more the intensity of color, the more attention is given to that part of input by the model.