Programming Languages

Language design, compilers, program analysis, and software engineering

Includes:Programming Languages(cs.PL)Software Engineering(cs.SE)

Programming Languages

Language design, compilers, program analysis, and software engineering

Includes:Programming Languages(cs.PL)Software Engineering(cs.SE)

Trending in Programming Languages

SWE-Universe: Scale Real-World Verifiable Environments to Millions

We propose SWE-Universe, a scalable and efficient framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs). To overcome the prevalent challenges of automatic building, such as low production yield, weak verifiers, and prohibitive cost, our framework utilizes a building agent powered by an efficient custom-trained model. This agent employs iterative self-verification and in-loop hacking detection to ensure the reliable generation of high-fidelity, verifiable tasks. Using this method, we scale the number of real-world multilingual SWE environments to a million scale (807,693). We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning. Finally, we applied this technique to Qwen3-Max-Thinking and achieved a score of 75.3% on SWE-Bench Verified. Our work provides both a critical resource and a robust methodology to advance the next generation of coding agents.

2602.02361

Feb 2026Software Engineering

SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents. SWE-Master systematically explores the complete agent development pipeline, including teacher-trajectory synthesis and data curation, long-horizon SFT, RL with real execution feedback, and inference framework design. Starting from an open-source base model with limited initial SWE capability, SWE-Master demonstrates how systematical optimization method can elicit strong long-horizon SWE task solving abilities. We evaluate SWE-Master on SWE-bench Verified, a standard benchmark for realistic software engineering tasks. Under identical experimental settings, our approach achieves a resolve rate of 61.4\% with Qwen2.5-Coder-32B, substantially outperforming existing open-source baselines. By further incorporating test-time scaling~(TTS) with LLM-based environment feedback, SWE-Master reaches 70.8\% at TTS@8, demonstrating a strong performance potential. SWE-Master provides a practical and transparent foundation for advancing reproducible research on software engineering agents. The code is available at https://github.com/RUCAIBox/SWE-Master.

2602.03411

Feb 2026Software Engineering

SWE-World: Building Software Engineering Agents in Docker-Free Environments

Recent advances in large language models (LLMs) have enabled software engineering agents to tackle complex code modification tasks. Most existing approaches rely on execution feedback from containerized environments, which require dependency-complete setup and physical execution of programs and tests. While effective, this paradigm is resource-intensive and difficult to maintain, substantially complicating agent training and limiting scalability. We propose SWE-World, a Docker-free framework that replaces physical execution environments with a learned surrogate for training and evaluating software engineering agents. SWE-World leverages LLM-based models trained on real agent-environment interaction data to predict intermediate execution outcomes and final test feedback, enabling agents to learn without interacting with physical containerized environments. This design preserves the standard agent-environment interaction loop while eliminating the need for costly environment construction and maintenance during agent optimization and evaluation. Furthermore, because SWE-World can simulate the final evaluation outcomes of candidate trajectories without real submission, it enables selecting the best solution among multiple test-time attempts, thereby facilitating effective test-time scaling (TTS) in software engineering tasks. Experiments on SWE-bench Verified demonstrate that SWE-World raises Qwen2.5-Coder-32B from 6.2\% to 52.0\% via Docker-free SFT, 55.0\% with Docker-free RL, and 68.2\% with further TTS. The code is available at https://github.com/RUCAIBox/SWE-World

2602.034191

Feb 2026Software Engineering

Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion. Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ($\sim$1.7$\times$), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4$\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$4.3 hours to $\sim$1 minute at 1M tokens). Finally, our results characterize a fundamental coverage--fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.

2602.00746

Jan 2026Software Engineering

2601.18418

daVinci-Dev: Agent-native Mid-training for Software Engineering

Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, **agentic mid-training**-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is **agent-native data**-supervision comprising two complementary types of trajectories: **contextually-native trajectories** that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and **environmentally-native trajectories** collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model's agentic capabilities on `SWE-Bench Verified`. We demonstrate our superiority over the previous open software engineering mid-training recipe `Kimi-Dev` under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve **56.1%** and **58.5%** resolution rates, respectively, which are ...

2601.18418

Jan 2026Software Engineering

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

LLM agents have demonstrated remarkable capabilities in software development, but their performance is hampered by long interaction contexts, which incur high API costs and latency. While various context compression approaches such as LongLLMLingua have emerged to tackle this challenge, they typically rely on fixed metrics such as PPL, ignoring the task-specific nature of code understanding. As a result, they frequently disrupt syntactic and logical structure and fail to retain critical implementation details. In this paper, we propose SWE-Pruner, a self-adaptive context pruning framework tailored for coding agents. Drawing inspiration from how human programmers "selectively skim" source code during development and debugging, SWE-Pruner performs task-aware adaptive pruning for long contexts. Given the current task, the agent formulates an explicit goal (e.g., "focus on error handling") as a hint to guide the pruning targets. A lightweight neural skimmer (0.6B parameters) is trained to dynamically select relevant lines from the surrounding context given the goal. Evaluations across four benchmarks and multiple models validate SWE-Pruner's effectiveness in various scenarios, achieving 23-54% token reduction on agent tasks like SWE-Bench Verified and up to 14.84x compression on single-turn tasks like LongCodeQA with minimal performance impact.

2601.16746

Jan 2026Software Engineering

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

2601.11868

Jan 2026Software Engineering

MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences

While autonomous software engineering (SWE) agents are reshaping programming paradigms, they currently suffer from a "closed-world" limitation: they attempt to fix bugs from scratch or solely using local context, ignoring the immense historical human experience available on platforms like GitHub. Accessing this open-world experience is hindered by the unstructured and fragmented nature of real-world issue-tracking data. In this paper, we introduce MemGovern, a framework designed to govern and transform raw GitHub data into actionable experiential memory for agents. MemGovern employs experience governance to convert human experience into agent-friendly experience cards and introduces an agentic experience search strategy that enables logic-driven retrieval of human expertise. By producing 135K governed experience cards, MemGovern achieves a significant performance boost, improving resolution rates on the SWE-bench Verified by 4.65%. As a plug-in approach, MemGovern provides a solution for agent-friendly memory infrastructure.

2601.06789

Jan 2026Software Engineering

SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

We present SWE-Lego, a supervised fine-tuning (SFT) recipe designed to achieve state-ofthe-art performance in software engineering (SWE) issue resolving. In contrast to prevalent methods that rely on complex training paradigms (e.g., mid-training, SFT, reinforcement learning, and their combinations), we explore how to push the limits of a lightweight SFT-only approach for SWE tasks. SWE-Lego comprises three core building blocks, with key findings summarized as follows: 1) the SWE-Lego dataset, a collection of 32k highquality task instances and 18k validated trajectories, combining real and synthetic data to complement each other in both quality and quantity; 2) a refined SFT procedure with error masking and a difficulty-based curriculum, which demonstrably improves action quality and overall performance. Empirical results show that with these two building bricks alone,the SFT can push SWE-Lego models to state-of-the-art performance among open-source models of comparable size on SWE-bench Verified: SWE-Lego-Qwen3-8B reaches 42.2%, and SWE-Lego-Qwen3-32B attains 52.6%. 3) We further evaluate and improve test-time scaling (TTS) built upon the SFT foundation. Based on a well-trained verifier, SWE-Lego models can be significantly boosted--for example, 42.2% to 49.6% and 52.6% to 58.8% under TTS@16 for the 8B and 32B models, respectively.

2601.01426

Jan 2026Software Engineering

Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

As autonomous AI agents transition from code completion tools to full-fledged teammates capable of opening pull requests (PRs) at scale, software maintainers face a new challenge: not just reviewing code, but managing complex interaction loops with non-human contributors. This paradigm shift raises a critical question: can we predict which agent-generated PRs will consume excessive review effort before any human interaction begins? Analyzing 33,707 agent-authored PRs from the AIDev dataset across 2,807 repositories, we uncover a striking two-regime behavioral pattern that fundamentally distinguishes autonomous agents from human developers. The first regime, representing 28.3 percent of all PRs, consists of instant merges (less than 1 minute), reflecting success on narrow automation tasks. The second regime involves iterative review cycles where agents frequently stall or abandon refinement (ghosting). We propose a Circuit Breaker triage model that predicts high-review-effort PRs (top 20 percent) at creation time using only static structural features. A LightGBM model achieves AUC 0.957 on a temporal split, while semantic text features (TF-IDF, CodeBERT) provide negligible predictive value. At a 20 percent review budget, the model intercepts 69 percent of total review effort, enabling zero-latency governance. Our findings challenge prevailing assumptions in AI-assisted code review: review burden is dictated by what agents touch, not what they say, highlighting the need for structural governance mechanisms in human-AI collaboration.

2601.00753

Jan 2026Software Engineering

On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study

Large language models (LLMs) have achieved remarkable progress in code generation, largely driven by the availability of high-quality code datasets for effective training. To further improve data quality, numerous training data optimization techniques have been proposed; however, their overall effectiveness has not been systematically evaluated. To bridge this gap, we conduct the first large-scale empirical study, examining five widely-used training data optimization techniques and their pairwise combinations for LLM-based code generation across three benchmarks and four LLMs. Our results show that data synthesis is the most effective technique for improving functional correctness and reducing code smells, although it performs relatively worse on code maintainability compared to data refactoring, cleaning, and selection. Regarding combinations, we find that most combinations do not further improve functional correctness but can effectively enhance code quality (code smells and maintainability). Among all combinations, data synthesis combined with data refactoring achieves the strongest overall performance. Furthermore, our fine-grained analysis reinforces these findings and provides deeper insights into how individual techniques and their combinations influence code generation effectiveness. Overall, this work represents a first step toward a systematic understanding of training data optimization and combination strategies, offering practical guidance for future research and deployment in LLM-based code generation.

2512.24570

Dec 2025Software Engineering

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.

2512.185521

Dec 2025Software Engineering

One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

Locating the files and functions requiring modification in large open-source software (OSS) repositories is challenging due to their scale and structural complexity. Existing large language model (LLM)-based methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool-jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a pretrained model, without any closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and even the 32B model exceeding closed-source models such as Claude-3.7. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.

2512.20957

Dec 2025Software Engineering

2512.11589

A Study of Library Usage in Agent-Authored Pull Requests

Coding agents are becoming increasingly capable of completing end-to-end software engineering workflows that previously required a human developer, including raising pull requests (PRs) to propose their changes. However, we still know little about how these agents use libraries when generating code, a core part of real-world software development. To fill this gap, we study 26,760 agent-authored PRs from the AIDev dataset to examine three questions: how often do agents import libraries, how often do they introduce new dependencies (and with what versioning), and which specific libraries do they choose? We find that agents often import libraries (29.5% of PRs) but rarely add new dependencies (1.3% of PRs); and when they do, they follow strong versioning practices (75.0% specify a version), an improvement on direct LLM usage where versions are rarely mentioned. Generally, agents draw from a surprisingly diverse set of external libraries, contrasting with the limited "library preferences" seen in prior non-agentic LLM studies. Our results offer an early empirical view into how AI coding agents interact with today's software ecosystems.

2512.11589

Dec 2025Software Engineering

2512.11762

The Relative Monadic Metalanguage

Relative monads provide a controlled view of computation. We generalise the monadic metalanguage to a relative setting and give a complete semantics with strong relative monads. Adopting this perspective, we generalise two existing program calculi from the literature. We provide a linear-non-linear language for graded monads, LNL-RMM, along with a semantic proof that it is a conservative extension of the graded monadic metalanguage. Additionally, we provide a complete semantics for the arrow calculus, showing it is a restricted relative monadic metalanguage. This motivates the introduction of ARMM, a computational lambda calculus-style language for arrows that conservatively extends the arrow calculus.

2512.11762

Dec 2025Programming Languages

Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.

2512.032623

Dec 2025Software Engineering

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents, which has 64k+ GitHub stars. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex, full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude, and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.

2511.03690

Nov 2025Software Engineering

RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring

Large Language Models (LLMs) have substantially influenced various software engineering tasks. Indeed, in the case of software refactoring, traditional LLMs have shown the ability to reduce development time and enhance code quality. However, these LLMs often rely on static, detailed instructions for specific tasks. In contrast, LLM-based agents can dynamically adapt to evolving contexts and autonomously make decisions by interacting with software tools and executing workflows. In this paper, we explore the potential of LLM-based agents in supporting refactoring activities. Specifically, we introduce RefAgent, a multi-agent LLM-based framework for end-to-end software refactoring. RefAgent consists of specialized agents responsible for planning, executing, testing, and iteratively refining refactorings using self-reflection and tool-calling capabilities. We evaluate RefAgent on eight open-source Java projects, comparing its effectiveness against a single-agent approach, a search-based refactoring tool, and historical developer refactorings. Our assessment focuses on: (1) the impact of generated refactorings on software quality, (2) the ability to identify refactoring opportunities, and (3) the contribution of each LLM agent through an ablation study. Our results show that RefAgent achieves a median unit test pass rate of 90%, reduces code smells by a median of 52.5%, and improves key quality attributes (e.g., reusability) by a median of 8.6%. Additionally, it closely aligns with developer refactorings and the search-based tool in identifying refactoring opportunities, attaining a median F1-score of 79.15% and 72.7%, respectively. Compared to single-agent approaches, RefAgent improves the median unit test pass rate by 64.7% and the median compilation success rate by 40.1%. These findings highlight the promise of multi-agent architectures in advancing automated software refactoring.

2511.03153

Nov 2025Software Engineering

6,735 papers

Towards a Software Reference Architecture for Natural Language Processing Tools in Requirements Engineering

Natural Language Processing (NLP) tools support requirements engineering (RE) tasks like requirements elicitation, classification, and validation. However, they are often developed from scratch despite functional overlaps, and abandoned after publication. This lack of interoperability and maintenance incurs unnecessary development effort, impedes tool comparison and benchmarking, complicates documentation, and diminishes the long-term sustainability of NLP4RE tools. To address these issues, we postulate a vision to transition from monolithic NLP4RE tools to an ecosystem of reusable, interoperable modules. We outline a research roadmap towards a software reference architecture (SRA) to realize this vision, elaborated following a standard methodological framework for SRA development. As an initial step, we conducted a stakeholder-driven focus group session to elicit generic system requirements for NLP4RE tools. This activity resulted in 36 key system requirements, further motivating the need for a dedicated SRA. Overall, the proposed vision, roadmap, and initial contribution pave the way towards improved development, reuse, and long-term maintenance of NLP4RE tools.

2602.17498Feb 2026

View

The Runtime Dimension of Ethics in Self-Adaptive Systems

Self-adaptive systems increasingly operate in close interaction with humans, often sharing the same physical or virtual environments and making decisions with ethical implications at runtime. Current approaches typically encode ethics as fixed, rule-based constraints or as a single chosen ethical theory embedded at design time. This overlooks a fundamental property of human-system interaction settings: ethical preferences vary across individuals and groups, evolve with context, and may conflict, while still needing to remain within a legally and regulatorily defined hard-ethics envelope (e.g., safety and compliance constraints). This paper advocates a shift from static ethical rules to runtime ethical reasoning for self-adaptive systems, where ethical preferences are treated as runtime requirements that must be elicited, represented, and continuously revised as stakeholders and situations change. We argue that satisfying such requirements demands explicit ethics-based negotiation to manage ethical trade-offs among multiple humans who interact with, are represented by, or are affected by a system. We identify key challenges, ethical uncertainty, conflicts among ethical values (including human, societal, and environmental drivers), and multi-dimensional/multi-party/multi-driver negotiation, and outline research directions and questions toward ethically self-adaptive systems.

2602.17426Feb 2026

View

Computer-Using World Model

Agents operating in complex software environments benefit from reasoning about the consequences of their actions, as even a single incorrect user interface (UI) operation can derail long, artifact-preserving workflows. This challenge is particularly acute for computer-using scenarios, where real execution does not support counterfactual exploration, making large-scale trial-and-error learning and planning impractical despite the environment being fully digital and deterministic. We introduce the Computer-Using World Model (CUWM), a world model for desktop software that predicts the next UI state given the current state and a candidate action. CUWM adopts a two-stage factorization of UI dynamics: it first predicts a textual description of agent-relevant state changes, and then realizes these changes visually to synthesize the next screenshot. CUWM is trained on offline UI transitions collected from agents interacting with real Microsoft Office applications, and further refined with a lightweight reinforcement learning stage that aligns textual transition predictions with the structural requirements of computer-using environments. We evaluate CUWM via test-time action search, where a frozen agent uses the world model to simulate and compare candidate actions before execution. Across a range of Office tasks, world-model-guided test-time scaling improves decision quality and execution robustness.

2602.17365Feb 2026

View

Socio-Technical Well-Being of Quantum Software Communities: An Overview on Community Smells

Quantum computing has gained significant attention due to its potential to solve computational problems beyond the capabilities of classical computers. With major corporations and academic institutions investing in quantum hardware and software, there has been a rise in the development of quantum-enabled systems, particularly within open-source communities. However, despite the promising nature of quantum technologies, these communities face critical socio-technical challenges, including the emergence of socio-technical anti-patterns known as community smells. These anti-patterns, prevalent in open-source environments, have the potential to negatively impact both product quality and community health by introducing technical debt and amplifying architectural and code smells. Despite the importance of these socio-technical factors, there remains a scarcity of research investigating their influence within quantum open-source communities. This work aims to address this gap by providing a first step in analyzing the socio-technical well-being of quantum communities through a cross-sectional study. By understanding the socio-technical dynamics at play, it is expected that foundational knowledge can be established to mitigate the risks associated with community smells and ensure the long-term sustainability of open-source quantum initiatives.

2602.17320Feb 2026

View

Disjunction Composition of BDD Transition Systems for Model-Based Testing

We introduce a compositional approach to model-based test generation in Behavior-Driven Development (BDD). BDD is an agile methodology in which system behavior is specified through textual scenarios that, in our approach, are translated into transition systems used for model-based testing. This paper formally defines disjunction composition, to combine BDD transition systems that represent alternative system behaviors. Disjunction composition allows for modeling and testing the integrated behavior while ensuring that the testing power of the original set of scenarios is preserved. This is proved using a symbolic semantics for BDD transition systems, with the property that the symbolic equivalence of two BDD transition systems guarantees that they fail the same test cases. Also, we demonstrate the potential of disjunction composition by applying the composition in an industrial case study.

2602.17237Feb 2026

View

2602.17193

The Case for HTML First Web Development

Since its introduction in the early 90s, the web has become the largest application platform available globally. HyperText Markup Language (HTML) has been an essential part of the web since the beginning, as it allows defining webpages in a tree-like manner, including semantics and content. Although the web was never meant to be an application platform, it evolved as such, especially since the early 2000s, as web application frameworks became available. While the emergence of frameworks made it easier than ever to develop complex applications, it also put HTML on the back burner. As web standards caught up, especially with milestones such as HTML5, the gap between the web platform and frameworks was reduced. HTML First development emphasizes this shift and puts focus on literally using HTML first when possible, while encouraging minimalism familiar from the early days of the web. It seems HTML-oriented web development can provide clear benefits to developers, especially when it is combined with comple- mentary approaches, such as embracing hypermedia and moving a large part of application logic to the server side. In the context of the htmx project, it was observed that moving towards HTML can reduce the size of a codebase greatly while leading to maintenance and development benefits due to the increased conceptual simplicity. Holotype-based comparisons for content-oriented websites show performance benefits, and the same observation was confirmed by a small case study where the Yle website was converted to follow HTML First principles. In short, the HTML First approach seems to have clear advantages for web developers, while there are open questions related to the magnitude of the benefits and the alignment with the recent trend of AI-driven web development.

2602.17193Feb 2026

View

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: (i) shuffled multiple-choice options, (ii) open-ended questions and (iii) needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems.

2602.17183Feb 2026

View

Quantifying Competitive Relationships Among Open-Source Software Projects

Throughout the history of software, evolution has occurred in cycles of rise and fall driven by competition, and open-source software (OSS) is no exception. This cycle is accelerating, particularly in rapidly evolving domains such as web development and deep learning. However, the impact of competitive relationships among OSS projects on their survival remains unclear, and there are risks of losing a competitive edge to rivals. To address this, this study proposes a new automated method called ``Mutual Impact Analysis of OSS (MIAO)'' to quantify these competitive relationships. The proposed method employs a structural vector autoregressive model and impulse response functions, normally used in macroeconomic analysis, to analyze the interactions among OSS projects. In an empirical analysis involving mining and analyzing 187 OSS project groups, MIAO identified projects that were forced to cease development owing to competitive influences with up to 81\% accuracy, and the resulting features supported predictive experiments that anticipate cessation one year ahead with up to 77\% accuracy. This suggests that MIAO could be a valuable tool for OSS project maintainers to understand the dynamics of OSS ecosystems and predict the rise and fall of OSS projects.

2602.17131Feb 2026

View

Multi-Ecosystem Modeling of OSS Project Sustainability

Many OSS projects join foundations such as Apache, Eclipse, and OSGeo, to aid their immediate plans and improve long-term prospects by getting governance advice, incubation support, and community-building mechanisms. But foundations differ in their policies, funding models, and support strategies. Moreover, since projects joining these foundations are diverse, coming at different lifecycle stages and having different needs, it can be challenging to decide on the appropriate project-foundation match and on the project-specific plan for sustainability. Here, we present an empirical study and quantitative analysis of the sustainability of incubator projects in the Apache, Eclipse, and OSGeo foundations, and, additionally, of OSS projects from GitHub outside of foundations. We develop foundation-specific sustainability models and a project triage, based on projects' sociotechnical trace profiles, and demonstrate their effectiveness across the foundations. Our results show that our models with triage can effectively forecast sustainability outcomes not only within but across foundations. In addition, the generalizability of the framework allows us to apply the approach to GitHub projects outside the foundations. We complement our findings with actionable recovery strategies from previous work and apply them to case studies of failed incubator projects. Our study highlights the value of sociotechnical frameworks in characterizing and addressing software project sustainability issues.

2602.17112Feb 2026

View

What to Cut? Predicting Unnecessary Methods in Agentic Code Generation

Agentic Coding, powered by autonomous agents such as GitHub Copilot and Cursor, enables developers to generate code, tests, and pull requests from natural language instructions alone. While this accelerates implementation, it produces larger volumes of code per pull request, shifting the burden from implementers to reviewers. In practice, a notable portion of AI-generated code is eventually deleted during review, yet reviewers must still examine such code before deciding to remove it. No prior work has explored methods to help reviewers efficiently identify code that will be removed.In this paper, we propose a prediction model that identifies functions likely to be deleted during PR review. Our results show that functions deleted for different reasons exhibit distinct characteristics, and our model achieves an AUC of 87.1%. These findings suggest that predictive approaches can help reviewers prioritize their efforts on essential code.

2602.17091Feb 2026

View

Wink: Recovering from Misbehaviors in Coding Agents

Autonomous coding agents, powered by large language models (LLMs), are increasingly being adopted in the software industry to automate complex engineering tasks. However, these agents are prone to a wide range of misbehaviors, such as deviating from the user's instructions, getting stuck in repetitive loops, or failing to use tools correctly. These failures disrupt the development workflow and often require resource-intensive manual intervention. In this paper, we present a system for automatically recovering from agentic misbehaviors at scale. We first introduce a taxonomy of misbehaviors grounded in an analysis of production traffic, identifying three primary categories: Specification Drift, Reasoning Problems, and Tool Call Failures, which we find occur in about 30% of all agent trajectories. To address these issues, we developed a lightweight, asynchronous self-intervention system named Wink. Wink observes agent trajectories and provides targeted course-correction guidance to nudge the agent back to a productive path. We evaluated our system on over 10,000 real world agent trajectories and found that it successfully resolves 90% of the misbehaviors that require a single intervention. Furthermore, a live A/B test in our production environment demonstrated that our system leads to a statistically significant reduction in Tool Call Failures, Tokens per Session and Engineer Interventions per Session. We present our experience designing and deploying this system, offering insights into the challenges of building resilient agentic systems at scale.

2602.17037Feb 2026

View

Not Only for Developers: Exploring Plugin Maintenance for Knowledge-Centric Communities

The adoption of third-party libraries has become integral to modern software development, leading to large ecosystems such as PyPI, NPM, and Maven, where contributors typically share the technical expertise to sustain extensions. In communities that are not exclusively composed of developers, however, maintaining plugin ecosystems can present different challenges. In this early results paper, we study Obsidian, a knowledge--centric platform whose community is focused on writing, organization, and creativity--has built a substantial plugin ecosystem despite not being developer--centric. We investigate what kinds of plugins exist within this hybrid ecosystem and establish a foundation for understanding how they are maintained. Using repository mining and LLM-based topic modeling on a representative sample of 396 plugins, we identify six topics related to knowledge management and tooling, which is (i) dynamic editing and organization, (ii) interface and layouts, (iii) creative writing and productivity, (iv) knowledge sync solutions, (v) linking and script tools, and (vi) workflow enhancements tools. Furthermore, analysis of the Pull Requests from these plugins show that much software evolution has been performed on these ecosystem. These findings suggest that even in mixed communities, plugin ecosystems can develop recognizable engineering structures, motivating future work that highlight three different research directions with six research questions related to the health and sustainability of these non-developer ecosystems.

2602.17018Feb 2026

View

Exploring LLMs for User Story Extraction from Mockups

User stories are one of the most widely used artifacts in the software industry to define functional requirements. In parallel, the use of high-fidelity mockups facilitates end-user participation in defining their needs. In this work, we explore how combining these techniques with large language models (LLMs) enables agile and automated generation of user stories from mockups. To this end, we present a case study that analyzes the ability of LLMs to extract user stories from high-fidelity mockups, both with and without the inclusion of a glossary of the Language Extended Lexicon (LEL) in the prompts. Our results demonstrate that incorporating the LEL significantly enhances the accuracy and suitability of the generated user stories. This approach represents a step forward in the integration of AI into requirements engineering, with the potential to improve communication between users and developers.

2602.16997Feb 2026

View

Mason: Type- and Name-Guided Program Synthesis

Object-oriented programs tend to be written using many common coding idioms, such as those captured by design patterns. While design patterns are useful, implementing them is often tedious and repetitive, requiring boilerplate code that distracts the programmer from more essential details. In this paper, we introduce Mason, a tool that synthesizes object-oriented programs from partial program pieces, and we apply it to automatically insert design patterns into programs. At the core of Mason is a novel technique we call type- and name-guided synthesis, in which an enumerative solver traverses a partial program to generate typing constraints; discharges constraints via program transformations guided by the names of constrained types and members; and backtracks when a constraint is violated or a candidate program fails unit tests. We also introduce two extensions to Mason: a non-local backtracking heuristic that uses execution traces, and a language of patterns that impose syntactic restrictions on missing names. We evaluate Mason on a suite of benchmarks to which Mason must add various well-known design patterns implemented as a library of program pieces. We find that Mason performs well when very few candidate programs satisfy its typing constraints and that our extensions can improve Mason's performance significantly when this is not the case. We believe that Mason takes an important step forward in synthesizing multi-class object-oriented programs using design patterns.

2602.16981Feb 2026

View

A Reversible Semantics for Janus

Janus is a paradigmatic example of reversible programming language. Indeed, Janus programs can be executed backwards as well as forwards. However, its small-step semantics (useful, e.g., for debugging or as a basis for extensions with concurrency primitives) is not reversible, since it loses information while computing forwards. E.g., it does not satisfy the Loop Lemma, stating that any reduction has an inverse, a main property of reversibility in process calculi, where small-step semantics is commonly used. We present here a novel small-step semantics which is actually reversible, while remaining equivalent to the previous one. It involves the non-trivial challenge of defining a semantics based on a "program counter" for a high-level programming language.

2602.16913Feb 2026

View

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.

2602.16819Feb 2026

View

2602.16809

Haskell meets Evariste

Since its birth as a new scientific body of knowledge in the late 1950s, computer programming has become a fundamental skill needed in many other disciplines. However, programming is not easy, it is prone to errors and code re-use is key for productivity. This calls for high-quality documentation in software libraries, which is quite often not the case. Taking a few Haskell functions available from the Hackage repository as case-studies, and comparing their descriptions with similar functions in other languages, this paper shows how clarity and good conceptual design can be achieved by following a so-called easy-hard-split formal strategy that is quite general and productive, even if used informally. This strategy is easy to use in functional programming and can be applied to both program analysis and synthesis.

2602.16809Feb 2026

View

E-Graphs as a Persistent Compiler Abstraction

Recent algorithmic advances have made equality saturation an appealing approach to program optimization because it avoids the phase-ordering problem. Existing work uses external equality saturation libraries, or custom implementations that are deeply tied to the specific application. However, these works only apply equality saturation at a single level of abstraction, or discard the discovered equalities when code is transformed by other compiler passes. We propose an alternative approach that represents an e-graph natively in the compiler's intermediate representation, facilitating the application of constructive compiler passes that maintain the e-graph state throughout the compilation flow. We build on a Python-based MLIR framework, xDSL, and introduce a new MLIR dialect, eqsat, that represents e-graphs in MLIR code. We show that this representation expands the scope of equality saturation in the compiler, allowing us to interleave pattern rewriting with other compiler transformations. The eqsat dialect provides a unified abstraction for compilers to utilize equality saturation across various levels of intermediate representations concurrently within the same MLIR flow.

2602.16707Feb 2026

View

SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation

Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the leap-to-code failure mode, where models prematurely emit code without grounding in program structure, constraints, and semantics. This will result in non-compilable tests, hallucinated function signatures, low branch coverage, and semantically irrelevant assertions that cannot properly capture bugs. We introduce SPARC, a neuro-symbolic, scenario-based framework that bridges this gap through four stages: (1) Control Flow Graph (CFG) analysis, (2) an Operation Map that grounds LLM reasoning in validated utility helpers, (3) Path-targeted test synthesis, and (4) an iterative, self-correction validation loop using compiler and runtime feedback. We evaluate SPARC on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE on complex subjects. SPARC retains 94.3% of tests through iterative repair and produces code with significantly higher developer-rated readability and maintainability. By aligning LLM reasoning with program structure, SPARC provides a scalable path for industrial-grade testing of legacy C codebases.

2602.16671Feb 2026

View

2602.16499

Software-heavy Asset Administration Shells: Classification and Use Cases

The Asset Administration Shell (AAS) is an emerging technology for the implementation of digital twins in the field of manufacturing. Software is becoming increasingly important, not only in general but specifically in relation to manufacturing, especially with regard to digital manufacturing and a shift towards the usage of artificial intelligence. This increases the need not only to model software, but also to integrate services directly into the AAS. The existing literature contains individual solutions to implement such software-heavy AAS. However, there is no systematic analysis of software architectures that integrate software services directly into the AAS. This paper aims to fill this research gap and differentiate architectures based on software quality criteria as well as typical manufacturing use cases. This work may be considered as an interpretation guideline for software-heavy AAS, both in academia and for practitioners.

2602.16499Feb 2026

View

Page 1 of 337

Programming Languages Papers - ScienceStack

Trending in Programming Languages

SWE-Universe: Scale Real-World Verifiable Environments to Millions

2602.02361

Feb 2026Software Engineering

SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training

2602.03411

Feb 2026Software Engineering

SWE-World: Building Software Engineering Agents in Docker-Free Environments

2602.034191

Feb 2026Software Engineering

Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

2602.00746

Jan 2026Software Engineering

2601.18418

daVinci-Dev: Agent-native Mid-training for Software Engineering

2601.18418

Jan 2026Software Engineering

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

2601.16746

Jan 2026Software Engineering

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

2601.11868

Jan 2026Software Engineering

MemGovern: Enhancing Code Agents through Learning from Governed Human Experiences

2601.06789

Jan 2026Software Engineering

SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving

2601.01426

Jan 2026Software Engineering

Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

2601.00753

Jan 2026Software Engineering

On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study

2512.24570

Dec 2025Software Engineering

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

2512.185521

Dec 2025Software Engineering

One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

2512.20957

Dec 2025Software Engineering

2512.11589

A Study of Library Usage in Agent-Authored Pull Requests

2512.11589

Dec 2025Software Engineering

2512.11762

The Relative Monadic Metalanguage

2512.11762

Dec 2025Programming Languages

Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

2512.032623

Dec 2025Software Engineering

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

2511.03690

Nov 2025Software Engineering

RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring

2511.03153

Nov 2025Software Engineering

6,735 papers

Towards a Software Reference Architecture for Natural Language Processing Tools in Requirements Engineering

2602.17498Feb 2026

View

The Runtime Dimension of Ethics in Self-Adaptive Systems

2602.17426Feb 2026

View

Computer-Using World Model

2602.17365Feb 2026

View

Socio-Technical Well-Being of Quantum Software Communities: An Overview on Community Smells

2602.17320Feb 2026

View

Disjunction Composition of BDD Transition Systems for Model-Based Testing

2602.17237Feb 2026

View

2602.17193

The Case for HTML First Web Development

2602.17193Feb 2026

View

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

2602.17183Feb 2026

View

Quantifying Competitive Relationships Among Open-Source Software Projects

2602.17131Feb 2026

View

Multi-Ecosystem Modeling of OSS Project Sustainability

2602.17112Feb 2026

View

What to Cut? Predicting Unnecessary Methods in Agentic Code Generation

2602.17091Feb 2026

View

Wink: Recovering from Misbehaviors in Coding Agents

2602.17037Feb 2026

View

Not Only for Developers: Exploring Plugin Maintenance for Knowledge-Centric Communities

2602.17018Feb 2026

View

Exploring LLMs for User Story Extraction from Mockups

2602.16997Feb 2026

View

Mason: Type- and Name-Guided Program Synthesis

2602.16981Feb 2026

View

A Reversible Semantics for Janus

2602.16913Feb 2026

View

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks

2602.16819Feb 2026

View

2602.16809

Haskell meets Evariste

2602.16809Feb 2026

View

E-Graphs as a Persistent Compiler Abstraction

2602.16707Feb 2026

View

SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation

2602.16671Feb 2026

View

2602.16499

Software-heavy Asset Administration Shells: Classification and Use Cases

2602.16499Feb 2026

View

Page 1 of 337