Programming Languages
Language design, compilers, program analysis, and software engineering
Language design, compilers, program analysis, and software engineering
2601.04124Issue Tracking Systems (ITSs) enable software developers and managers to collect and resolve issues collaboratively. While researchers have extensively analysed ITS data to automate or assist specific activities such as issue assignments, duplicate detection, or priority prediction, developer studies on ITSs remain rare. Particularly, little is known about the challenges Software Engineering (SE) teams encounter in ITSs and when certain practices and workarounds (such as leaving issue fields like "priority" empty) are considered problematic. To fill this gap, we conducted an in-depth interview study with 26 experienced SE practitioners from different organisations and industries. We asked them about general problems encountered, as well as the relevance of 31 ITS smells (aka potentially problematic practices) discussed in the literature. By applying Thematic Analysis to the interview notes, we identified 14 common problems including issue findability, zombie issues, workflow bloat, and lack of workflow enforcement. Participants also stated that many of the ITS smells do not occur or are not problematic. Our results suggest that ITS problems and smells are highly dependent on context factors such as ITS configuration, workflow stage, and team size. We also discuss potential tooling solutions to configure, monitor, and visualise ITS smells to cope with these challenges.
Existing code similarity metrics, such as BLEU, CodeBLEU, and TSED, largely rely on surface-level string overlap or abstract syntax tree structures, and often fail to capture deeper semantic relationships between programs.We propose CSSG (Code Similarity using Semantic Graphs), a novel metric that leverages program dependence graphs to explicitly model control dependencies and variable interactions, providing a semantics-aware representation of code.Experiments on the CodeContests+ dataset show that CSSG consistently outperforms existing metrics in distinguishing more similar code from less similar code under both monolingual and cross-lingual settings, demonstrating that dependency-aware graph representations offer a more effective alternative to surface-level or syntax-based similarity measures.
In operational technology (OT) contexts, containerised applications often require elevated privileges to access low-level network interfaces or perform administrative tasks such as application monitoring. These privileges reduce the default isolation provided by containers and introduce significant security risks. Security risk identification for OT container deployments is challenged by hybrid IT/OT architectures, fragmented stakeholder knowledge, and continuous system changes. Existing approaches lack reproducibility, interpretability across contexts, and technical integration with deployment artefacts. We propose a model-based approach, implemented as the Container Security Risk Ontology (CSRO), which integrates five key domains: adversarial behaviour, contextual assumptions, attack scenarios, risk assessment rules, and container security artefacts. Our evaluation of CSRO in a case study demonstrates that the end-to-end formalisation of risk calculation, from artefact to risk level, enables automated and reproducible risk identification. While CSRO currently focuses on technical, container-level treatment measures, its modular and flexible design provides a solid foundation for extending the approach to host-level and organisational risk factors.
2601.03988Background: Extracting the stages that structure Machine Learning (ML) pipelines from source code is key for gaining a deeper understanding of data science practices. However, the diversity caused by the constant evolution of the ML ecosystem (e.g., algorithms, libraries, datasets) makes this task challenging. Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain. These limitations highlight the need for more flexible and reliable solutions. Objective: We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations, and subsequently how they can advance our understanding of data science practices. Method: We conduct a confirmatory study based on two reference works selected for their relevance regarding current state-of-the-art's limitations. First, we compare several SLMs using Cochran's Q test. The best-performing model is then evaluated against the reference studies using two distinct McNemar's tests. We further analyze how variations in taxonomy definitions affect performance through an additional Cochran's Q test. Finally, a goodness-of-fit analysis is conducted using Pearson's chi-squared tests to compare our insights on data science practices with those from prior studies.
We present an approach to implement binary search trees in the rule-based graph programming language GP 2. Our implementation uses GP 2's rooted graph transformation rules to be fast and supports insertion, deletion and query operations. We argue that the worst-case runtime for each of the operations is O(n) for a tree with n nodes. In addition, we expect that, on average, the operations run in time O(log(n)). Hence the implementation would match the time complexity of binary search trees implementations in imperative languages.
Large Language Models (LLMs) are increasingly integrated into software development workflows, yet their behavior in structured, specification-driven processes remains poorly understood. This paper presents an empirical study design using CURRANTE, a Visual Studio Code extension that enables a human-in-the-loop workflow for LLM-assisted code generation. The tool guides developers through three sequential stages--Specification, Tests, and Function--allowing them to define requirements, generate and refine test suites, and produce functions that satisfy those tests. Participants will solve medium-difficulty problems from the LiveCodeBench dataset, while the tool records fine-grained interaction logs, effectiveness metrics (e.g., pass rate, all-pass completion), efficiency indicators (e.g., time-to-pass), and iteration behaviors. The study aims to analyze how human intervention in specification and test refinement influences the quality and dynamics of LLM-generated code. The results will provide empirical insights into the design of next-generation development environments that align human reasoning with model-driven code generation.
2601.03857LLMs are increasingly used to boost productivity and support software engineering tasks. However, when applied to socially sensitive decisions such as team composition and task allocation, they raise concerns of fairness. Prior studies have revealed that LLMs may reproduce stereotypes; however, these analyses remain exploratory and examine sensitive attributes in isolation. This study investigates whether LLMs exhibit bias in team composition and task assignment by analyzing the combined effects of candidates' country and pronouns. Using three LLMs and 3,000 simulated decisions, we find systematic disparities: demographic attributes significantly shaped both selection likelihood and task allocation, even when accounting for expertise-related factors. Task distributions further reflected stereotypes, with technical and leadership roles unevenly assigned across groups. Our findings indicate that LLMs exacerbate demographic inequities in software engineering contexts, underscoring the need for fairness-aware assessment.
We present a framework for synthesising formulas in first-order logic (FOL) from examples, which unifies and advances state-of-the-art approaches for inference of transition system invariants. To do so, we study and categorise the existing methodologies, encoding techniques in their formula synthesis via answer set programming (ASP). Based on the derived categorisation, we propose orthogonal slices, a new technique for formula enumeration that partitions the search space into manageable chunks, enabling two approaches for incremental candidate pruning. Using a combination of existing techniques for first-order (FO) invariant synthesis and the orthogonal slices implemented in our framework FORCE, we significantly accelerate a state-of-the-art algorithm for distributed system invariant inference. We also show that our approach facilitates composition of different invariant inference frameworks, allowing for novel optimisations.
2601.03836Logic programming languages present clear advantages in terms of declarativeness and conciseness. However, the ideas of logic programming have been met with resistance in other programming communities, and have not generally been adopted by other paradigms and languages. This paper proposes a novel way to incorporate logic programming in an existing codebase in a typed functional programming language. Our approach integrates with the host language without sacrificing static typing, and leverages strengths of typed functional programming such as polymorphism and higher-order. We do so by combining three ideas. First, we use the extensible types technique to allow values of the host language to contain logic variables. Second, we implement a unification algorithm that works for any data structure that supports certain operations.Third, we introduce a domain-specific language to define and query predicates. We demonstrate our proposal via a series of examples, and provide aids to make the notation convenient for users, showing that the proposed approach is not just technically possible but also practical. Our ideas have been implemented in the language Haskell with very good results.
Large Language Models (LLMs) such as GPT-4, Claude and LLaMA have shown impressive performance in code generation, typically evaluated using benchmarks (e.g., HumanEval). However, effective code generation requires models to understand and apply a wide range of language concepts. If the concepts exercised in benchmarks are not representative of those used in real-world projects, evaluations may yield incomplete. Despite this concern, the representativeness of code concepts in benchmarks has not been systematically examined. To address this gap, we present the first empirical study that analyzes the representativeness of code generation benchmarks through the lens of Knowledge Units (KUs) - cohesive sets of programming language capabilities provided by language constructs and APIs. We analyze KU coverage in two widely used Python benchmarks, HumanEval and MBPP, and compare them with 30 real-world Python projects. Our results show that each benchmark covers only half of the identified 20 KUs, whereas projects exercise all KUs with relatively balanced distributions. In contrast, benchmark tasks exhibit highly skewed KU distributions. To mitigate this misalignment, we propose a prompt-based LLM framework that synthesizes KU-based tasks to rebalance benchmark KU distributions and better align them with real-world usage. Using this framework, we generate 440 new tasks and augment existing benchmarks. The augmented benchmarks substantially improve KU coverage and achieve over a 60% improvement in distributional alignment. Evaluations of state-of-the-art LLMs on these augmented benchmarks reveal consistent and statistically significant performance drops (12.54-44.82%), indicating that existing benchmarks overestimate LLM performance due to their limited KU coverage. Our findings provide actionable guidance for building more realistic evaluations of LLM code-generation capabilities.
2601.03768Proof engineering is notoriously labor-intensive: proofs that are straightforward on paper often require lengthy scripts in theorem provers. Recent advances in large language models (LLMs) create new opportunities for proof automation: modern LLMs not only generate proof scripts, but also support agentic behavior, exploring codebases and iteratively refining their outputs against prover feedback. These advances enable an emerging scheme where LLM-based agents undertake most proof engineering under human guidance. Humans provide mathematical insight (definitions, theorems, proof strategies); agents handle the mechanical work of proof development. We call this scheme agentic proof automation. We present this scheme through a case study: mechanizing the semantic type soundness of a sophisticated formal system, System Capless, in Lean 4, comprising over 14,000 lines of code. Using off-the-shelf LLM agents with a single lightweight proof-checking tool, the agents completed 189 proof engineering tasks with an 87% success rate, only 16% requiring human intervention. The case study demonstrates that agents are capable proof engineers that substantially boost productivity, though they fall short in creative reasoning and still require human guidance in certain cases. We release an interactive explorer where readers can examine all agent interactions; the mechanization is open-sourced for experiments and extensions.
As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $ESV$ (reading load), $MCL$ (simulation depth), and $DFI$ (integration width). Comprehensive evaluations of frontier models (e.g., Claude-4.5-Sonnet, DeepSeek-v3.1-Terminus) reveal a prevalent aggregation deficit, where integration width serves as the primary cognitive bottleneck. Our findings provide granular white-box insights for optimizing the next generation of agentic software engineering.
Large language models (LLMs) have achieved strong performance on code completion tasks in general-purpose programming languages. However, existing repository-level code completion benchmarks focus almost exclusively on software code and largely overlook hardware description languages. In this work, we present \textbf{MHRC-Bench}, consisting of \textbf{MHRC-Bench-Train} and \textbf{MHRC-Bench-Eval}, the first benchmark designed for multilingual hardware code completion at the repository level. Our benchmark targets completion tasks and covers three major hardware design coding styles. Each completion target is annotated with code-structure-level and hardware-oriented semantic labels derived from concrete syntax tree analysis. We conduct a comprehensive evaluation of models on MHRC-Bench-Eval. Comprehensive evaluation results and analysis demonstrate the effectiveness of MHRC-Bench.
Many real-world software tasks require exact transcription of provided data into code, such as cryptographic constants, protocol test vectors, allowlists, and calibration tables. These tasks are operationally sensitive because small omissions or alterations can remain silent while producing syntactically valid programs. This paper introduces a deliberately minimal transcription-to-code benchmark to isolate this reliability concern in LLM-based code generation. Given a list of high-precision decimal constants, a model must generate Python code that embeds the constants verbatim and performs a simple aggregate computation. We describe the prompting variants, evaluation protocol based on exact-string inclusion, and analysis framework used to characterize state-tracking and long-horizon generation failures. The benchmark is intended as a compact stress test that complements existing code-generation evaluations by focusing on data integrity rather than algorithmic reasoning.
Machine learning (ML) algorithms are increasingly deployed to make critical decisions in socioeconomic applications such as finance, criminal justice, and autonomous driving. However, due to their data-driven and pattern-seeking nature, ML algorithms may develop decision logic that disproportionately distributes opportunities, benefits, resources, or information among different population groups, potentially harming marginalized communities. In response to such fairness concerns, the software engineering and ML communities have made significant efforts to establish the best practices for creating fair ML software. These include fairness interventions for training ML models, such as including sensitive features, selecting non-sensitive attributes, and applying bias mitigators. But how reliably can software professionals tasked with developing data-driven systems depend on these recommendations? And how well do these practices generalize in the presence of faulty labels, missing data, or distribution shifts? These questions form the core theme of this paper.
DevOps automation can accelerate software delivery, yet many organizations still struggle to justify and prioritize automation work in terms of strategic project-management outcomes such as waste reduction, delivery predictability, cross-team coordination, and customer-facing quality. This paper presents \textit{VSM--GQM--DevOps}, a unified, traceable framework that integrates (i) Value Stream Mapping (VSM) to visualize the end-to-end delivery system and quantify delays, rework, and handoffs, (ii) the Goal--Question--Metric (GQM) paradigm to translate stakeholder objectives into a minimal, decision-relevant measurement model (combining DORA with project and team outcomes), and (iii) maturity-aligned DevOps automation to remediate empirically observed bottlenecks through small, reversible interventions. The framework operationalizes traceability from observed waste to goal-aligned questions, metrics, and automation candidates, and provides a defensible prioritization approach that balances expected impact, confidence, and cost. We also define a multi-site, longitudinal mixed-method validation protocol that combines telemetry-based quasi-experimental analysis (interrupted time series and, where feasible, controlled rollouts) with qualitative triangulation from interviews and retrospectives. The expected contribution is a validated pathway and a set of practical instruments that enables organizations to select automation investments that demonstrably improve both delivery performance and project-management outcomes.
2601.03556Testing is a critical practice for ensuring software correctness and long-term maintainability. As agentic coding tools increasingly submit pull requests (PRs), it becomes essential to understand how testing appears in these agent-driven workflows. Using the AIDev dataset, we present an empirical study of test inclusion in agentic pull requests. We examine how often tests are included, when they are introduced during the PR lifecycle and how test-containing PRs differ from non-test PRs in terms of size, turnaround time, and merge outcomes. Across agents, test-containing PRs are more common over time and tend to be larger and take longer to complete, while merge rates remain largely similar. We also observe variation across agents in both test adoption and the balance between test and production code within test PRs. Our findings provide a descriptive view of testing behavior in agentic pull requests and offer empirical grounding for future studies of autonomous software development.
Open-source scientific software is abundant, yet most tools remain difficult to compile, configure, and reuse, sustaining a small-workshop mode of scientific computing. This deployment bottleneck limits reproducibility, large-scale evaluation, and the practical integration of scientific tools into modern AI-for-Science (AI4S) and agentic workflows. We present Deploy-Master, a one-stop agentic workflow for large-scale tool discovery, build specification inference, execution-based validation, and publication. Guided by a taxonomy spanning 90+ scientific and engineering domains, our discovery stage starts from a recall-oriented pool of over 500,000 public repositories and progressively filters it to 52,550 executable tool candidates under license- and quality-aware criteria. Deploy-Master transforms heterogeneous open-source repositories into runnable, containerized capabilities grounded in execution rather than documentation claims. In a single day, we performed 52,550 build attempts and constructed reproducible runtime environments for 50,112 scientific tools. Each successful tool is validated by a minimal executable command and registered in SciencePedia for search and reuse, enabling direct human use and optional agent-based invocation. Beyond delivering runnable tools, we report a deployment trace at the scale of 50,000 tools, characterizing throughput, cost profiles, failure surfaces, and specification uncertainty that become visible only at scale. These results explain why scientific software remains difficult to operationalize and motivate shared, observable execution substrates as a foundation for scalable AI4S and agentic science.
Code translation across multiple programming languages is essential yet challenging due to two vital obstacles: scarcity of parallel data paired with executable test oracles, and optimization imbalance when handling diverse language pairs. We propose BootTrans, a bootstrapping method that resolves both obstacles. Its key idea is to leverage the functional invariance and cross-lingual portability of test suites, adapting abundant pivot-language unit tests to serve as universal verification oracles for multilingual RL training. Our method introduces a dual-pool architecture with seed and exploration pools to progressively expand training data via execution-guided experience collection. Furthermore, we design a language-aware weighting mechanism that dynamically prioritizes harder translation directions based on relative performance across sibling languages, mitigating optimization imbalance. Extensive experiments on the HumanEval-X and TransCoder-Test benchmarks demonstrate substantial improvements over baseline LLMs across all translation directions, with ablations validating the effectiveness of both bootstrapping and weighting components.
Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated remarkable advancements in logical reasoning, there remains a significant gap in evaluating their code generation capabilities. Existing benchmark datasets fall short in pinpointing specific strengths and weaknesses, impeding targeted enhancements in models' reasoning abilities to synthesize code. To bridge this gap, our paper introduces an innovative, pedagogical benchmarking method that mirrors the evaluation processes encountered in academic programming courses. We introduce CodeEval, a multi-dimensional benchmark dataset designed to rigorously evaluate LLMs across 24 distinct aspects of Python programming. The dataset covers three proficiency levels - beginner, intermediate, and advanced - and includes both class-based and function-based problem types with detailed problem specifications and comprehensive test suites. To facilitate widespread adoption, we also developed RunCodeEval, an open-source execution framework that provides researchers with a ready-to-use evaluation pipeline for CodeEval. RunCodeEval handles test execution, context setup, and metrics generation, enabling researchers to quickly obtain detailed insights into model strengths and weaknesses across complexity levels, problem types, and programming categories. This combination enables targeted evaluation and guides improvements in LLMs' programming proficiencies.