Language design, compilers, program analysis, and software engineering
Natural Language Processing (NLP) tools support requirements engineering (RE) tasks like requirements elicitation, classification, and validation. However, they are often developed from scratch despite functional overlaps, and abandoned after publication. This lack of interoperability and maintenance incurs unnecessary development effort, impedes tool comparison and benchmarking, complicates documentation, and diminishes the long-term sustainability of NLP4RE tools. To address these issues, we postulate a vision to transition from monolithic NLP4RE tools to an ecosystem of reusable, interoperable modules. We outline a research roadmap towards a software reference architecture (SRA) to realize this vision, elaborated following a standard methodological framework for SRA development. As an initial step, we conducted a stakeholder-driven focus group session to elicit generic system requirements for NLP4RE tools. This activity resulted in 36 key system requirements, further motivating the need for a dedicated SRA. Overall, the proposed vision, roadmap, and initial contribution pave the way towards improved development, reuse, and long-term maintenance of NLP4RE tools.
Self-adaptive systems increasingly operate in close interaction with humans, often sharing the same physical or virtual environments and making decisions with ethical implications at runtime. Current approaches typically encode ethics as fixed, rule-based constraints or as a single chosen ethical theory embedded at design time. This overlooks a fundamental property of human-system interaction settings: ethical preferences vary across individuals and groups, evolve with context, and may conflict, while still needing to remain within a legally and regulatorily defined hard-ethics envelope (e.g., safety and compliance constraints). This paper advocates a shift from static ethical rules to runtime ethical reasoning for self-adaptive systems, where ethical preferences are treated as runtime requirements that must be elicited, represented, and continuously revised as stakeholders and situations change. We argue that satisfying such requirements demands explicit ethics-based negotiation to manage ethical trade-offs among multiple humans who interact with, are represented by, or are affected by a system. We identify key challenges, ethical uncertainty, conflicts among ethical values (including human, societal, and environmental drivers), and multi-dimensional/multi-party/multi-driver negotiation, and outline research directions and questions toward ethically self-adaptive systems.
Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.
Quantum computing has gained significant attention due to its potential to solve computational problems beyond the capabilities of classical computers. With major corporations and academic institutions investing in quantum hardware and software, there has been a rise in the development of quantum-enabled systems, particularly within open-source communities. However, despite the promising nature of quantum technologies, these communities face critical socio-technical challenges, including the emergence of socio-technical anti-patterns known as community smells. These anti-patterns, prevalent in open-source environments, have the potential to negatively impact both product quality and community health by introducing technical debt and amplifying architectural and code smells. Despite the importance of these socio-technical factors, there remains a scarcity of research investigating their influence within quantum open-source communities. This work aims to address this gap by providing a first step in analyzing the socio-technical well-being of quantum communities through a cross-sectional study. By understanding the socio-technical dynamics at play, it is expected that foundational knowledge can be established to mitigate the risks associated with community smells and ensure the long-term sustainability of open-source quantum initiatives.
We introduce a compositional approach to model-based test generation in Behavior-Driven Development (BDD). BDD is an agile methodology in which system behavior is specified through textual scenarios that, in our approach, are translated into transition systems used for model-based testing. This paper formally defines disjunction composition, to combine BDD transition systems that represent alternative system behaviors. Disjunction composition allows for modeling and testing the integrated behavior while ensuring that the testing power of the original set of scenarios is preserved. This is proved using a symbolic semantics for BDD transition systems, with the property that the symbolic equivalence of two BDD transition systems guarantees that they fail the same test cases. Also, we demonstrate the potential of disjunction composition by applying the composition in an industrial case study.
2602.17193Since its introduction in the early 90s, the web has become the largest application platform available globally. HyperText Markup Language (HTML) has been an essential part of the web since the beginning, as it allows defining webpages in a tree-like manner, including semantics and content. Although the web was never meant to be an application platform, it evolved as such, especially since the early 2000s, as web application frameworks became available. While the emergence of frameworks made it easier than ever to develop complex applications, it also put HTML on the back burner. As web standards caught up, especially with milestones such as HTML5, the gap between the web platform and frameworks was reduced. HTML First development emphasizes this shift and puts focus on literally using HTML first when possible, while encouraging minimalism familiar from the early days of the web. It seems HTML-oriented web development can provide clear benefits to developers, especially when it is combined with comple- mentary approaches, such as embracing hypermedia and moving a large part of application logic to the server side. In the context of the htmx project, it was observed that moving towards HTML can reduce the size of a codebase greatly while leading to maintenance and development benefits due to the increased conceptual simplicity. Holotype-based comparisons for content-oriented websites show performance benefits, and the same observation was confirmed by a small case study where the Yle website was converted to follow HTML First principles. In short, the HTML First approach seems to have clear advantages for web developers, while there are open questions related to the magnitude of the benefits and the alignment with the recent trend of AI-driven web development.
Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.
Throughout the history of software, evolution has occurred in cycles of rise and fall driven by competition, and open-source software (OSS) is no exception. This cycle is accelerating, particularly in rapidly evolving domains such as web development and deep learning. However, the impact of competitive relationships among OSS projects on their survival remains unclear, and there are risks of losing a competitive edge to rivals. To address this, this study proposes a new automated method called ``Mutual Impact Analysis of OSS (MIAO)'' to quantify these competitive relationships. The proposed method employs a structural vector autoregressive model and impulse response functions, normally used in macroeconomic analysis, to analyze the interactions among OSS projects. In an empirical analysis involving mining and analyzing 187 OSS project groups, MIAO identified projects that were forced to cease development owing to competitive influences with up to 81\% accuracy, and the resulting features supported predictive experiments that anticipate cessation one year ahead with up to 77\% accuracy. This suggests that MIAO could be a valuable tool for OSS project maintainers to understand the dynamics of OSS ecosystems and predict the rise and fall of OSS projects.
Many OSS projects join foundations such as Apache, Eclipse, and OSGeo, to aid their immediate plans and improve long-term prospects by getting governance advice, incubation support, and community-building mechanisms. But foundations differ in their policies, funding models, and support strategies. Moreover, since projects joining these foundations are diverse, coming at different lifecycle stages and having different needs, it can be challenging to decide on the appropriate project-foundation match and on the project-specific plan for sustainability. Here, we present an empirical study and quantitative analysis of the sustainability of incubator projects in the Apache, Eclipse, and OSGeo foundations, and, additionally, of OSS projects from GitHub outside of foundations. We develop foundation-specific sustainability models and a project triage, based on projects' sociotechnical trace profiles, and demonstrate their effectiveness across the foundations. Our results show that our models with triage can effectively forecast sustainability outcomes not only within but across foundations. In addition, the generalizability of the framework allows us to apply the approach to GitHub projects outside the foundations. We complement our findings with actionable recovery strategies from previous work and apply them to case studies of failed incubator projects. Our study highlights the value of sociotechnical frameworks in characterizing and addressing software project sustainability issues.
Agentic Coding, powered by autonomous agents such as GitHub Copilot and Cursor, enables developers to generate code, tests, and pull requests from natural language instructions alone. While this accelerates implementation, it produces larger volumes of code per pull request, shifting the burden from implementers to reviewers. In practice, a notable portion of AI-generated code is eventually deleted during review, yet reviewers must still examine such code before deciding to remove it. No prior work has explored methods to help reviewers efficiently identify code that will be removed.In this paper, we propose a prediction model that identifies functions likely to be deleted during PR review. Our results show that functions deleted for different reasons exhibit distinct characteristics, and our model achieves an AUC of 87.1%. These findings suggest that predictive approaches can help reviewers prioritize their efforts on essential code.
Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale.
The adoption of third-party libraries has become integral to modern software development, leading to large ecosystems such as PyPI, NPM, and Maven, where contributors typically share the technical expertise to sustain extensions. In communities that are not exclusively composed of developers, however, maintaining plugin ecosystems can present different challenges. In this early results paper, we study Obsidian, a knowledge--centric platform whose community is focused on writing, organization, and creativity--has built a substantial plugin ecosystem despite not being developer--centric. We investigate what kinds of plugins exist within this hybrid ecosystem and establish a foundation for understanding how they are maintained. Using repository mining and LLM-based topic modeling on a representative sample of 396 plugins, we identify six topics related to knowledge management and tooling, which is (i) dynamic editing and organization, (ii) interface and layouts, (iii) creative writing and productivity, (iv) knowledge sync solutions, (v) linking and script tools, and (vi) workflow enhancements tools. Furthermore, analysis of the Pull Requests from these plugins show that much software evolution has been performed on these ecosystem. These findings suggest that even in mixed communities, plugin ecosystems can develop recognizable engineering structures, motivating future work that highlight three different research directions with six research questions related to the health and sustainability of these non-developer ecosystems.
User stories are one of the most widely used artifacts in the software industry to define functional requirements. In parallel, the use of high-fidelity mockups facilitates end-user participation in defining their needs. In this work, we explore how combining these techniques with large language models (LLMs) enables agile and automated generation of user stories from mockups. To this end, we present a case study that analyzes the ability of LLMs to extract user stories from high-fidelity mockups, both with and without the inclusion of a glossary of the Language Extended Lexicon (LEL) in the prompts. Our results demonstrate that incorporating the LEL significantly enhances the accuracy and suitability of the generated user stories. This approach represents a step forward in the integration of AI into requirements engineering, with the potential to improve communication between users and developers.
Object-oriented programs tend to be written using many common coding idioms, such as those captured by design patterns. While design patterns are useful, implementing them is often tedious and repetitive, requiring boilerplate code that distracts the programmer from more essential details. In this paper, we introduce Mason, a tool that synthesizes object-oriented programs from partial program pieces, and we apply it to automatically insert design patterns into programs. At the core of Mason is a novel technique we call type- and name-guided synthesis, in which an enumerative solver traverses a partial program to generate typing constraints; discharges constraints via program transformations guided by the names of constrained types and members; and backtracks when a constraint is violated or a candidate program fails unit tests. We also introduce two extensions to Mason: a non-local backtracking heuristic that uses execution traces, and a language of patterns that impose syntactic restrictions on missing names. We evaluate Mason on a suite of benchmarks to which Mason must add various well-known design patterns implemented as a library of program pieces. We find that Mason performs well when very few candidate programs satisfy its typing constraints and that our extensions can improve Mason's performance significantly when this is not the case. We believe that Mason takes an important step forward in synthesizing multi-class object-oriented programs using design patterns.
Janus is a paradigmatic example of reversible programming language. Indeed, Janus programs can be executed backwards as well as forwards. However, its small-step semantics (useful, e.g., for debugging or as a basis for extensions with concurrency primitives) is not reversible, since it loses information while computing forwards. E.g., it does not satisfy the Loop Lemma, stating that any reduction has an inverse, a main property of reversibility in process calculi, where small-step semantics is commonly used. We present here a novel small-step semantics which is actually reversible, while remaining equivalent to the previous one. It involves the non-trivial challenge of defining a semantics based on a "program counter" for a high-level programming language.
When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.
2602.16809Since its birth as a new scientific body of knowledge in the late 1950s, computer programming has become a fundamental skill needed in many other disciplines. However, programming is not easy, it is prone to errors and code re-use is key for productivity. This calls for high-quality documentation in software libraries, which is quite often not the case. Taking a few Haskell functions available from the Hackage repository as case-studies, and comparing their descriptions with similar functions in other languages, this paper shows how clarity and good conceptual design can be achieved by following a so-called easy-hard-split formal strategy that is quite general and productive, even if used informally. This strategy is easy to use in functional programming and can be applied to both program analysis and synthesis.
Recent algorithmic advances have made equality saturation an appealing approach to program optimization because it avoids the phase-ordering problem. Existing work uses external equality saturation libraries, or custom implementations that are deeply tied to the specific application. However, these works only apply equality saturation at a single level of abstraction, or discard the discovered equalities when code is transformed by other compiler passes. We propose an alternative approach that represents an e-graph natively in the compiler's intermediate representation, facilitating the application of constructive compiler passes that maintain the e-graph state throughout the compilation flow. We build on a Python-based MLIR framework, xDSL, and introduce a new MLIR dialect, eqsat, that represents e-graphs in MLIR code. We show that this representation expands the scope of equality saturation in the compiler, allowing us to interleave pattern rewriting with other compiler transformations. The eqsat dialect provides a unified abstraction for compilers to utilize equality saturation across various levels of intermediate representations concurrently within the same MLIR flow.
Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the leap-to-code failure mode, where models prematurely emit code without grounding in program structure, constraints, and semantics. This will result in non-compilable tests, hallucinated function signatures, low branch coverage, and semantically irrelevant assertions that cannot properly capture bugs. We introduce SPARC, a neuro-symbolic, scenario-based framework that bridges this gap through four stages: (1) Control Flow Graph (CFG) analysis, (2) an Operation Map that grounds LLM reasoning in validated utility helpers, (3) Path-targeted test synthesis, and (4) an iterative, self-correction validation loop using compiler and runtime feedback. We evaluate SPARC on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE on complex subjects. SPARC retains 94.3% of tests through iterative repair and produces code with significantly higher developer-rated readability and maintainability. By aligning LLM reasoning with program structure, SPARC provides a scalable path for industrial-grade testing of legacy C codebases.
2602.16499The Asset Administration Shell (AAS) is an emerging technology for the implementation of digital twins in the field of manufacturing. Software is becoming increasingly important, not only in general but specifically in relation to manufacturing, especially with regard to digital manufacturing and a shift towards the usage of artificial intelligence. This increases the need not only to model software, but also to integrate services directly into the AAS. The existing literature contains individual solutions to implement such software-heavy AAS. However, there is no systematic analysis of software architectures that integrate software services directly into the AAS. This paper aims to fill this research gap and differentiate architectures based on software quality criteria as well as typical manufacturing use cases. This work may be considered as an interpretation guideline for software-heavy AAS, both in academia and for practitioners.