Challenges to writing an LLM-driven code review tool

Published by Bill Harding October 20, 2023

Last changed over 2 years ago

views

What makes it so time-consuming to build out effective pull request review? How might other developers build their own version of a pull request review tool? Below we’ll reveal some of the challenges that we faced while creating GitClear’s automated pull request review.

link Collecting sufficient data for a prompt

The precursor to all subsequent prompting challenges is to create code interpretation infrastructure that can offer the LLM sufficient context to explain the code it is given. Information like "which function/method did the change occur within?" and "what issue tracker ticket was this change in response to?" are prerequisite to allowing a quality summarization.

link Inventing input prompts

The first challenge one faces in attempting to extract useful feedback from an LLM: how to represent a changed code snippet with enough information to respond intelligently, but not so much that useless details are regurgitated back to the system by an over-informed prompt?

The most obvious answer is to use the existing precedent for representing code changes: the venerable git hunk. The problem with this option is that it communicates only code bodies, and only two operations: “add” or “delete.”

As we often point out [todo: link], when developers read code, only about 50% of code changes would be considered a true “addition” or “deletion.” The other 50% of changes are more precisely described by humans as “update,” “find/replace,” or “copy/paste.”

Git hunks have other shortcomings. They don’t communicate the function/class context that a line resides within. Nor do they indicate which lines were subsequently churned. Since around 30% of newly added lines will be revised before they settle into their final form, having churn labeled explicitly saves a lot of time avoiding a comment on lines that have already changed. Finally, git hunks say nothing about the history of a line, or more specifically, whether a line has been the source of past bugs and so warrants additional consideration. Our pull request input includes these factors, among others, in an array of JSON.

link Tyranny of small talk

When a human reviewer summarizes a change to code, their attention might settle on a myriad of different aspects. Was the file newly added? Was the change required to address a bug? Were the lines deleted in an effort to DRY redundant methods? An effective, human-like response must consider tens of attributes of the code in order to possess the superset of potentially relevant details.

The counterbalance to feeding the LLM so much information is that, the more it knows, the more probable it becomes that irrelevant details will be regurgitated back to the inquirer. Our efforts to build code summarization with ChatGPT suggest that, no matter how much stern admonishment our prompt included, OpenAI was very prone to respond conversationally. For example: “In path/to/code/file.ts, the low-effort additions to the file included…”

We went through several iterations of regular expressions designed to strip responses of their preamble babble. If you would like to use our “sans small talk” text translation, you can find it here [todo: paste code]. It has proven reasonably effective at snipping the babble from responses, but there are still cases where it leaves behind preamble of unfamiliar form, and there can also be cases where it kicks out use details nested among noise. Eventually we had to acknowledge that fine-tuning was requisite if we hoped to supply an LLM with enough information to render human-like responses. It is just too tempting for the LLM to take your details and babble them back at you sans relevance. Not so unlike a new developer that has been trained to understand code, but is not sufficiently experienced to separate wheat from chaff.

link Helpful when?

Since nobody has done this before, there is no roadmap for where the UI should begin and end. In GitHub’s prototype of Next Gen Pull Requests [link], they show a tool that mostly sidesteps interactivity. Their prospective tool is activated through a text incantation [true?], and displays its output as a stream of comments separate from the usual “Code” tab. They pass on the opportunity to accelerate comprehending what changed.

Our research suggests that people want a review assistant that is helpful for comprehending what changed, why it changed, and how the end goal might be alternately achieved? Each of these objectives probably has a different touch point within the UI. 7

Zooming out from the tentative conclusions GitHub has made about how this should work, these are the general questions facing git providers:

At what point during the PR process should the tool intervene? Before the PR is posted? After it is posted, but before human feedback? Whenever the author requests it?

What scope of feedback should be provided? Should it focus on ensuring that the PR abides by project conventions, or should it do more? If it is just enforcing conventions, why not just use a linter? If it does more than enforce conventions, should it suggest changes to the scope of the feature? Should it suggest changes in variable naming? Should it point out code that might lead to bugs? How can it be prompted to make such suggestions?

Where should the automated feedback be shown? Should it be shown as comments alongside the human comments? What about feedback on the general structure of the PR or the scope of files/methods? In these cases there might be no specific line to comment on - should this be part of the tool? If so, where should it be presented?

link How to simulate shared knowledge?

Every reviewer approaches a pull request with different concerns and areas of experience. Every author brings a different level of expertise within the repo that their code was authored. To create value-adding comment suggestions, the knowledge of the reviewer and of the commenter must be taken into account.

Beyond suggesting comment content that simulates a particular reviewer, the ideal AI-powered pull request tool should make it possible to simulate what unavailable reviewers might have to say about a given pull request. All of this means implies the need to accumulate a wealth of training data to fine tune comment suggestions in the style of various reviewers.

To build the “wealth of training data” needed, we must approximate the inputs that human reviewers considered in order to render their pull request feedback.

link How to evaluate quality?

linkCollecting sufficient data for a prompt

linkInventing input prompts

linkTyranny of small talk

linkHelpful when?