publications | Yuntong Zhang

2025

ICSE25

SpecRover: Code Intent Extraction via LLMs

Haifeng Ruan^*, Yuntong Zhang^*, and Abhik Roychoudhury (*Joint first authors, ordered alphabetically.)

In 47th International Conference on Software Engineering (ICSE), Ottawa, Canada, 2025

Abs arXiv PDF

Autonomous program improvement typically involves automatically producing bug fixes and feature additions. Such program improvement can be accomplished by a combination of large language model (LLM) and program analysis capabilities, in the form of an LLM agent. Since program repair or program improvement typically requires a specification of intended behavior - specification inference can be useful for producing high quality program patches. In this work, we examine efficient and low-cost workflows for iterative specification inference within an LLM agent. Given a GitHub issue to be resolved in a software project, our goal is to conduct iterative code search accompanied by specification inference - thereby inferring intent from both the project structure and behavior. The intent thus captured is examined by a reviewer agent with the goal of vetting the patches as well as providing a measure of confidence in the vetted patches. Our approach SpecRover (AutoCodeRover-v2) is built on the open-source LLM agent AutoCodeRover. In an evaluation on the full SWE-Bench consisting of 2294 GitHub issues, it shows more than 50% improvement in efficacy over AutoCodeRover. Compared to the open-source agents available, our work shows modest cost ($0.65 per issue) in resolving an average GitHub issue in SWE-Bench lite. The production of explanation by SpecRover allows for a better "signal" to be given to the developer, on when the suggested patches can be accepted with confidence. SpecRover also seeks to demonstrate the continued importance of specification inference in automated program repair, even as program repair technologies enter the LLM era.
TOSEM25

EffFix: Efficient and Effective Repair of Pointer Manipulating Programs

Yuntong Zhang, Andreea Costea, Ridwan Shariffdeen, Davin McCall, and Abhik Roychoudhury

ACM Transactions on Software Engineering and Methodology (TOSEM), 2025

Abs DOI arXiv PDF Code

This work introduces EffFix, a tool that applies a novel static analysis-driven Automated Program Repair (APR) technique for fixing memory errors. APR tools typically rely on a given test-suite to guide the repair process. Apart from the need to provide test oracles, this reliance is also one of the main contributors to the over-fitting problem. Static analysis based APR techniques bypass these issues only to introduce new ones, such as soundness, scalability, and generalizability. This work demonstrates how we can overcome these challenges and achieve sound memory bug repair at scale by leveraging static analysis (specifically Incorrectness Separation Logic - ISL) to guide repair. This is the first repair approach to use ISL. Our key insight is that the abstract domain used by static analysis to detect the bugs also contains key information to derive correct patches. Our proposed approach learns what a desirable patch is by inspecting how close a patch is to fixing the bug based on the feedback from ISL based static analysis (specifically the Pulse analyzer), and turning this information into a distribution of probabilities over context free grammars. This approach to repair is generic in that its learning strategy allows for finding patches without relying on the commonly used patch templates. Furthermore, to achieve efficient program repair, instead of focusing on heuristics for reducing the search space of patches, we make repair scalable by creating classes of equivalent patches according to the effect they have on the symbolic heap. We then conduct candidate patch validation only once per patch equivalence class. This allows EffFix to efficiently discover quality repairs even in the presence of a large pool of patch candidates. Experimental evaluation of fixing real world memory errors in medium to large scale subjects like OpenSSL, Linux Kernel, swoole, shows the efficiency and effectiveness of EffFix— in terms of automatically producing repairs from large search spaces. In particular, EffFix has a fix ratio of 66% for memory leak bugs and 83% for Null Pointer Dereferences for the considered dataset.

2024

ISSTA24

AutoCodeRover: Autonomous Program Improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury

In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), Vienna, Austria, 2024

Abs DOI arXiv PDF Code

Researchers have made significant progress in automating the software development process in the past decades. Automated techniques for issue summarization, bug reproduction, fault localization, and program repair have been built to ease the workload of developers. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless, software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. program repair to fix bugs) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving Github issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM’s understanding of the issue’s root cause, and effectively retrieve a context via iterative search. The use of spectrum-based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on the recently proposed SWE-bench-lite (300 real-life Github issues) show increased efficacy in solving Github issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported Swe-agent. Interestingly, our approach resolved 57 GitHub issues in about 4 minutes each (pass@1), whereas developers spent more than 2.68 days on average. In addition, AutoCodeRover achieved this efficacy with significantly lower cost (on average, $0.43 USD), compared to other baselines. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.

2023

arXiv

Program Repair by Fuzzing over Patch and Input Space

Yuntong Zhang, Ridwan Shariffdeen, Gregory J Duck, Jiaqi Tan, and Abhik Roychoudhury

arXiv preprint arXiv:2308.00666, 2023

Abs PDF

Fuzz testing (fuzzing) is a well-known method for exposing bugs/vulnerabilities in software systems. Popular fuzzers, such as AFL, use a biased random search over the domain of program inputs, where 100s or 1000s of inputs (test cases) are executed per second in order to expose bugs. If a bug is discovered, it can either be fixed manually by the developer or fixed automatically using an Automated Program Repair (APR) tool. Like fuzzing, many existing APR tools are search-based, but over the domain of patches rather than inputs. In this paper, we propose search-based program repair as patch-level fuzzing. The basic idea is to adapt a fuzzer (AFL) to fuzz over the patch space rather than the input space. Thus we use a patch-space fuzzer to explore a patch space, while using a traditional input level fuzzer to rule out patch candidates and help in patch selection. To improve the throughput, we propose a compilation-free patch validation methodology, where we execute the original (unpatched) program natively, then selectively interpret only the specific patched statements and expressions. Since this avoids (re)compilation, we show that compilation-free patch validation can achieve a similar throughput as input-level fuzzing (100s or 1000s of execs/sec). We show that patch-level fuzzing and input-level fuzzing can be combined, for a co-exploration of both spaces in order to find better quality patches. Such a collaboration between input-level fuzzing and patch-level fuzzing is then employed to search over candidate fix locations, as well as patch candidates in each fix location. Our results show that our tool FuzzRepair is more effective in patching security vulnerabilities than well-known existing repair tools GenProg/Darjeeling, Prophet and Concolic Program Repair (CPR). Moreover, our approach produces other artifacts such as fix locations, and crashing tests (which show the evidence why patch candidates are ruled out). Thus our approach provides a pragmatic solution to enhance automation in program vulnerability repair, thereby reducing exposure of critical software systems to possible attacks.

2022

ISSTA22

Program Vulnerability Repair via Inductive Inference

Yuntong Zhang, Xiang Gao, Gregory J. Duck, and Abhik Roychoudhury

In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), Virtual, South Korea, 2022

Abs DOI PDF Video Code

Program vulnerabilities, even when detected and reported, are not fixed immediately. The time lag between the reporting and fixing of a vulnerability causes open-source software systems to suffer from significant exposure to possible attacks. In this paper, we propose a counter-example guided inductive inference procedure over program states to define likely invariants at possible fix locations. The likely invariants are constructed via mutation over states at the fix location, which turns out to be more effective for inductive property inference, as compared to the usual greybox fuzzing over program inputs. Once such likely invariants, which we call patch invariants, are identified, we can use them to construct patches via simple patch templates. Our work assumes that only one failing input (representing the exploit) is available to start the repair process. Experiments on the VulnLoc data-set of 39 vulnerabilities, which has been curated in previous works on vulnerability repair, show the effectiveness of our repair procedure. As compared to proposed approaches for vulnerability repair such as CPR or SenX which are based on concolic and symbolic execution respectively, we can repair significantly more vulnerabilities. Our results show the potential for program repair via inductive constraint inference, as opposed to generating repair constraints via deductive/symbolic analysis of a given test-suite.
EuroSys22

Hardening Binaries against More Memory Errors

Gregory J. Duck, Yuntong Zhang, and Roland H. C. Yap

In Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys), Rennes, France, 2022

Abs DOI PDF Code

Memory errors, such as buffer overflows and use-after-free, remain the root cause of many security vulnerabilities in modern software. The use of closed source software further exacerbates the problem, as source-based memory error mitigation cannot be applied. While many memory error detection tools exist, most are based on a single error detection methodology with resulting known limitations, such as incomplete memory error detection (redzones) or false error detections (low-fat pointers). In this paper we introduce RedFat, a memory error hardening tool for stripped binaries that is fast, practical and scalable. The core idea behind RedFat is to combine complementary error detection methodologies—redzones and low-fat pointers—in order to detect more memory errors that can be detected by each individual methodology alone. However, complementary error detection also inherits the limitations of each approach, such as false error detections from low-fat pointers. To mitigate this, we introduce a profile-based analysis that automatically determines the strongest memory error protection possible without negative side effects.We implement RedFat on top of a scalable binary rewriting framework, and demonstrate low overheads compared to the current state-of-the-art. We show RedFat to be language agnostic on C/C++/Fortran binaries with minimal requirements, and works with stripped binaries for both position independent/dependent code. We also show that the RedFat instrumentation can scale to very large/complex binaries, such as Google Chrome.