publications
2024
- AutoCodeRover: Autonomous Program ImprovementIn Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 2024
Researchers have made significant progress in automating the software development process in the past decades. Automated techniques for issue summarization, bug reproduction, fault localization, and program repair have been built to ease the workload of developers. Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding. Nevertheless, software engineering involves the process of program improvement apart from coding, specifically to enable software maintenance (e.g. program repair to fix bugs) and software evolution (e.g. feature additions). In this paper, we propose an automated approach for solving Github issues to autonomously achieve program improvement. In our approach called AutoCodeRover, LLMs are combined with sophisticated code search capabilities, ultimately leading to a program modification or patch. In contrast to recent LLM agent approaches from AI researchers and practitioners, our outlook is more software engineering oriented. We work on a program representation (abstract syntax tree) as opposed to viewing a software project as a mere collection of files. Our code search exploits the program structure in the form of classes/methods to enhance LLM’s understanding of the issue’s root cause, and effectively retrieve a context via iterative search. The use of spectrum-based fault localization using tests, further sharpens the context, as long as a test-suite is available. Experiments on the recently proposed SWE-bench-lite (300 real-life Github issues) show increased efficacy in solving Github issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported Swe-agent. Interestingly, our approach resolved 57 GitHub issues in about 4 minutes each (pass@1), whereas developers spent more than 2.68 days on average. In addition, AutoCodeRover achieved this efficacy with significantly lower cost (on average, $0.43 USD), compared to other baselines. We posit that our workflow enables autonomous software engineering, where, in future, auto-generated code from LLMs can be autonomously improved.
2023
- Patch Space Exploration using Static Analysis FeedbackarXiv preprint arXiv:2308.00294 2023
Automated Program Repair (APR) techniques typically rely on a given test-suite to guide the repair process. Apart from the need to provide test oracles, this makes the produced patches prone to test data over-fitting. In this work, instead of relying on test cases, we show how to automatically repair memory safety issues, by leveraging static analysis (specifically Incorrectness Separation Logic) to guide repair. Our proposed approach learns what a desirable patch is by inspecting how close a patch is to fixing the bug based on the feedback from incorrectness separation logic based static analysis (specifically the Pulse analyser), and turning this information into a distribution of probabilities over context free grammars. Furthermore, instead of focusing on heuristics for reducing the search space of patches, we make repair scalable by creating classes of equivalent patches according to the effect they have on the symbolic heap, and then invoking the validation oracle only once per class of patch equivalence. This allows us to efficiently discover repairs even in the presence of a large pool of patch candidates offered by our generic patch synthesis mechanism. Experimental evaluation of our approach was conducted by repairing real world memory errors in OpenSSL, swoole and other subjects. The evaluation results show the scalability and efficacy of our approach in automatically producing high quality patches.
- Program Repair by Fuzzing over Patch and Input SpacearXiv preprint arXiv:2308.00666 2023
Fuzz testing (fuzzing) is a well-known method for exposing bugs/vulnerabilities in software systems. Popular fuzzers, such as AFL, use a biased random search over the domain of program inputs, where 100s or 1000s of inputs (test cases) are executed per second in order to expose bugs. If a bug is discovered, it can either be fixed manually by the developer or fixed automatically using an Automated Program Repair (APR) tool. Like fuzzing, many existing APR tools are search-based, but over the domain of patches rather than inputs. In this paper, we propose search-based program repair as patch-level fuzzing. The basic idea is to adapt a fuzzer (AFL) to fuzz over the patch space rather than the input space. Thus we use a patch-space fuzzer to explore a patch space, while using a traditional input level fuzzer to rule out patch candidates and help in patch selection. To improve the throughput, we propose a compilation-free patch validation methodology, where we execute the original (unpatched) program natively, then selectively interpret only the specific patched statements and expressions. Since this avoids (re)compilation, we show that compilation-free patch validation can achieve a similar throughput as input-level fuzzing (100s or 1000s of execs/sec). We show that patch-level fuzzing and input-level fuzzing can be combined, for a co-exploration of both spaces in order to find better quality patches. Such a collaboration between input-level fuzzing and patch-level fuzzing is then employed to search over candidate fix locations, as well as patch candidates in each fix location. Our results show that our tool FuzzRepair is more effective in patching security vulnerabilities than well-known existing repair tools GenProg/Darjeeling, Prophet and Concolic Program Repair (CPR). Moreover, our approach produces other artifacts such as fix locations, and crashing tests (which show the evidence why patch candidates are ruled out). Thus our approach provides a pragmatic solution to enhance automation in program vulnerability repair, thereby reducing exposure of critical software systems to possible attacks.
2022
- Program Vulnerability Repair via Inductive InferenceIn Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA) 2022
Program vulnerabilities, even when detected and reported, are not fixed immediately. The time lag between the reporting and fixing of a vulnerability causes open-source software systems to suffer from significant exposure to possible attacks. In this paper, we propose a counter-example guided inductive inference procedure over program states to define likely invariants at possible fix locations. The likely invariants are constructed via mutation over states at the fix location, which turns out to be more effective for inductive property inference, as compared to the usual greybox fuzzing over program inputs. Once such likely invariants, which we call patch invariants, are identified, we can use them to construct patches via simple patch templates. Our work assumes that only one failing input (representing the exploit) is available to start the repair process. Experiments on the VulnLoc data-set of 39 vulnerabilities, which has been curated in previous works on vulnerability repair, show the effectiveness of our repair procedure. As compared to proposed approaches for vulnerability repair such as CPR or SenX which are based on concolic and symbolic execution respectively, we can repair significantly more vulnerabilities. Our results show the potential for program repair via inductive constraint inference, as opposed to generating repair constraints via deductive/symbolic analysis of a given test-suite.
- Hardening Binaries against More Memory ErrorsGregory J. Duck, Yuntong Zhang, and Roland H. C. YapIn Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys) 2022
Memory errors, such as buffer overflows and use-after-free, remain the root cause of many security vulnerabilities in modern software. The use of closed source software further exacerbates the problem, as source-based memory error mitigation cannot be applied. While many memory error detection tools exist, most are based on a single error detection methodology with resulting known limitations, such as incomplete memory error detection (redzones) or false error detections (low-fat pointers). In this paper we introduce RedFat, a memory error hardening tool for stripped binaries that is fast, practical and scalable. The core idea behind RedFat is to combine complementary error detection methodologies—redzones and low-fat pointers—in order to detect more memory errors that can be detected by each individual methodology alone. However, complementary error detection also inherits the limitations of each approach, such as false error detections from low-fat pointers. To mitigate this, we introduce a profile-based analysis that automatically determines the strongest memory error protection possible without negative side effects.We implement RedFat on top of a scalable binary rewriting framework, and demonstrate low overheads compared to the current state-of-the-art. We show RedFat to be language agnostic on C/C++/Fortran binaries with minimal requirements, and works with stripped binaries for both position independent/dependent code. We also show that the RedFat instrumentation can scale to very large/complex binaries, such as Google Chrome.