Published by Bill Harding January 09, 2024

Last changed 2 months ago

views

Coding on Copilot: 2023 Data Shows Downward Pressure on Code Quality, Plus Projections for 2024

Coding-on-Copilot-2024-Developer-Research.pdf

link Abstract

2023 was a very good year for GitHub Copilot. In little more than a year's time, it has grown to be used by millions of developers and tens of thousands of businesses. The popularity of Copilot proves that 2023 marked the beginning of a new era in how code is authored.

Less understood is what impact all these AI-generated lines might have on code quality and maintainability? In this paper, GitClear looks at more than 100 million lines of code, to decipher the patterns in how writing code has changed since Copilot's debut.

We find several concerning trends for Lead Developers hoping to maintain a pliable code base in the long-term. Code churn -- the percentage of lines that are changed or reverted less than two weeks after being authored -- is projected to double in 2024 relative to its level in 2021. We find that percentage of "code added" and "code copy/pasted," especially by junior developers, is increasing at a much higher rate than code that is updated, deleted, or moved to consolidated location (aka "DRY code").

We conclude with suggestions for managers that seek to maintain high code quality in spite of the forces increasingly driving against that end.

link GitHub: "55% faster coding. 46% more code written. $1.5 trillion added to global GDP"

With numbers like these, it's little wonder that GitHub's own CEO, Thomas Dohmke, would carve time out of his schedule to write about the AI revolution, in a blog post (and research paper) he published on GitHub in 2023. Surely, there were less busy people who could have written about the magnitude of Copilot's impact. But, considering this is the most popular product GitHub has launched since its inception, you can understand why Dohmke would choose basking in the glow of his hit product over the usual day-to-day rigors of running a GitHub-sized company.

open_in_new

From Dohmke's 2023 blog post, "The economic impact of the AI-powered developer lifecycle and lessons from GitHub Copilot"

When we saw these numbers, it confirmed our intuition that Copilot is already making a measurable impact on the development ecosystem. In the same blog post, Dohmke asserts that more than 20,000 organizations are already using GitHub Copilot for Business. This is in addition to the "more than one million people" that GitHub stated were already using Copilot on a Personal license as of February 2023, when Copilot for Business was released.

At this point, it's not hard to imagine that at least a third of all global developers have access to either GitHub Copilot or an AI coding assistant like it. In 2024 and beyond, the proliferation of AI-assisted code seems likely to continue, or even accelerate.

link The Problem with AI-generated Code

Developers wouldn't be adopting Copilot if they didn't believe that it accelerated their ability to produce code. GitHub's "75% more fulfilled" measurement attests that Copilot certainly succeeds on this count. The question is, "what cost are teams paying for this convenience?"

open_in_new

Developer researchers are concerned by the impact of AI assisted programming

GitHub claims that code is written "55% faster" with Copilot. But what about code that shouldn't be written in the first place? That, too, is written 55% faster.

That is the first of several challenges facing developers who use an AI assistant. Others include:

Being inundated with suggestions for added code, but never presented with suggestions for updating, moving, or deleting code

Time required to evaluate code suggestions can become costly, especially when the developer works in an environment with competing auto-suggest mechanisms

Code suggestion is generally optimized by the likelihood the developer will accept it. It is not optimized for whether the code is correct, or whether it will even run

These drawbacks presumably account for a portion of the difference in Suggestion Acceptance Rate between Junior and Senior Developers:

open_in_new

GitHub's own data suggests that Junior Developers use Copilot around 20% more than experienced developers

link Code Change Definitions

GitClear classifies code changes (operations) into seven categories, six of which are analyzed in this research:

Added code. Newly committed lines of code that are distinct, excluding lines that incrementally change an existing line (labeled "Updates"). "Added code" also does not include lines that are added, removed, and then re-added (these lines are labeled as "Updated" and "Churned")

Deleted code. Lines of code that are removed, committed, and not subsequently re-added for at least the next two weeks.

Moved code. A line of code that is cut and pasted to a new file, or a new function within the same file. By definition, the content of a "Moved" operation doesn't change within a commit, except for (potentially) the white space that precedes the content.

Updated code. A committed line of code based off an existing line of code, that modifies the existing line of code by approximately three words or less.

Find/replaced code. A pattern of code change where the same string is removed from 3+ locations and substituted with consistent replacement content.

Copy/pasted code. Line content, excluding programming language keywords (e.g., end, }), [), that are committed to multiple files or functions within a day.

No-op code. Trivial code changes, such as changes to white space, or changes in line number within the same code block. No-op code is excluded from this research.

Specific examples of GitClear's code operations can be found in the Diff Delta documentation. GitClear has been classifying git repos by these operations since 2020. As of January 2024, GitClear has analyzed and classified around a billion lines of code over four years.

For this research, we are also exploring the change in "Churned code." This is not treated as a code operation, because a churned line can simultaneously be an "Added," "Deleted," or "Updated" operation. For a line to qualify as "churned," it must have been authored, pushed to the main branch, and then revised within the subsequent two weeks. Churn is best understood as "changes that were either incomplete or erroneous when the author initially accepted, committed, and pushed them."

link Trends in Commit Line Operations

As a first approximation of how Copilot has changed development, we analyzed the number of different line operations that GitClear has observed, segmented by the year in which the code was authored (using the authored_at date within the git commit header). The raw numbers for this analysis are included in the Appendix. Here are the percentages by year:

	Added	Deleted	Updated	Moved	Copy/pasted	Find/replaced	Churn
2020	39.18%	19.47%	5.19%	24.99%	8.26%	2.92%	3.32%
2021	39.49%	19.03%	4.99%	24.69%	8.43%	3.37%	3.63%
2022	41.05%	20.15%	5.22%	20.46%	9.43%	3.68%	3.97%
2023	42.34%	21.12%	5.50%	16.92%	10.49%	3.63%	5.53%
2024	43.63%	22.09%	5.78%	13.38%	11.55%	3.58%	7.09%

Here are how these look in graph form, where the left axis illustrates the prevalence of code change operations (which, as percentages, sum to 1). The right axis tracks the change in "Churn" code:

open_in_new

The projections for 2024 utilize OpenAI's gpt-4-1106-preview Assistant to run a quadratic regression on existing data for how AI changed percentages from 2022 to 2023. The full output of the OpenAI Assistant is provided in the Appendix. Given the exponential growth of Copilot reported by GitHub, it seems likely that 2024's numbers will continue the trends that began to take form in 2022.

Looking only at the differences in operation frequency between 2022 and 2023, we find:

Operation	YoY change
Added	+3.1%
Deleted	+4.8%
Updated	+5.2%
Moved	-17.3%
Copy/pasted	+11.3%
Find/replaced	-1.3%
Churn	+39.2%

link Interpreting Significant Code Operation Changes

The most significant changes observed to correlate with the proliferation of Copilot are "Churn," "Moved," and "Copy/pasted." The implications for each change are reviewed in turn.

link Burgeoning Churn

Recall that "Churn" is the percentage of code that was pushed to the repo, then subsequently removed or updated within 2 weeks. This was a relatively infrequent outcome when developers authored all their own code -- only 3-4% of code was churned prior to 2023.

Coinciding with the growth of Copilot, there is a surge in how often "mistake code" is being pushed to the repo, running the risk of being deployed to production. If the current pattern continues into 2024, more than 7% of all code changes will be reverted within two weeks, double the rate of 2021. The implications for growth in Google DORA's "Change Failure Rate" are likely to manifest when the 2024 State of Devops report is released later in the year.

link Less Moved Code Implies Less Refactoring

Moved code is typically observed when refactoring an existing code system. Refactored systems in general, and moved code in particular, are key to enabling code reuse. As a product grows in scope, developers traditionally rearrange existing code into new modules that can be reused by newly added features. The benefits of code reuse are familiar to experienced developers. Compared with newly added code, reused code has already been tested & proven stable in production. Often, reused code has been touched by multiple developers, which increases the likelihood that the code comes with documentation. This accelerates the interpretation of the module by developers who are new to it.

Combined with the growth in code labeled "Copy/Pasted," it seems clear that the current implementation of AI Assistants discourages code reuse. Instead of refactoring and working to DRY ("Don't Repeat Yourself") code, these Assistants suggest authoring new code that repeats existing code.

link More Copy/Pasted Code Implies Future Headaches

There is perhaps no greater scourge to long-term code maintainability than copy/pasted code. In effect, when a non-keyword line of code is repeated, the code author is admitting "I didn't have the time or inclination to evaluate the previous implementation." By re-adding code instead of reusing it, the chore is left to future maintainers to figure out how to consolidate parallel code paths that implement some repeatedly-needed functionality.

Since most developers derive much greater satisfaction from "implementing new features" than they do "interpreting potentially reusable legacy code," copy/pasted code often persists long past its expiration date. Especially on less experienced teams, there may be no code maintainer with the moral authority to mandate code reuse. Even when there are Senior Developers possessing such authority, the willpower cost of understanding code well enough to consolidate it is hard to overstate.

If there isn't a CTO or VP of Engineering who actively schedules time to reduce tech debt, you can add "executive-driven time pressures" to the long list of reasons that the copy/pasted code will never be consolidated into the component libraries that underpin long-term development velocity.

link Trends in Revised Code Age

Another way to assess how Copilot is influencing code quality is to extract the data from GitClear's Code Provenance derivation. This provides an independent secondary check of whether the patterns observed in Code O.

	Less than 2 weeks	Less than one month	Less than one year	1-2 years
2020	65.9%	8.7%	21.8%	3.6%
2021	66.7%	9.0%	20.5%	3.8%
2022	64.7%	9.9%	21.1%	4.4%
2023	71.3%	9.3%	16.4%	3.0%
2024	74.4%	9.1%	14.1%	2.4%

In its visualized graph form:

open_in_new

link Interpreting Code Age Trends

The trend in this data corroborates the patterns observed in the previous Code Operation analysis. When code gets updated, the age of the code is growing younger by year. From the 2020 epoch of our data set, it appears the steepest drop in code getting revised more than 1 month, and less than 12 months, after it was initially authored.

The trend suggests that, absent AI Assistants, developers were more likely to find recently authored code in their repo to target for refinement and reuse. Around 70% of products built in the early 2020s use the Agile Methodology, per a Techreport survey [5]. In Agile, features are typically planned and executed per-Sprint. A typical Sprint lasts 2-3 weeks. It aligns with the data to surmise that teams circa 2020 were more likely to convene post-Sprint, to discuss what was recently implemented and how to build upon it in a proximal Sprint. Judging by the data, that seems to be happening much less often lately.

Moving in the opposite direction is code that gets revised less than two weeks after being initially authored. After hovering between 65%-67% through the period before widespread AI Assistants, in 2023 it popped up to 71%, a 6.6% increase that came at the expense of refactoring more seasoned code.

link Questions for Follow Up Research

Can incentives be created to counteract the tendency to "add it and forget it" that today's AI systems promote? It is hard to imagine what sort of developer experience could be implemented that would guide a developer toward preferring reuse vs reinventing the wheel. It's conceivable that AI could be trained to identify opportunities where similar code could be consolidated, and offer the developer with tools to consolidate the copy/pasted mess that is currently propagating throughout repos. But even if this hypothetical "consolidation AI" were built, when would it be invoked? The same pressures that generally prevent teams from scheduling time to reduce tech debt would generally prevent them from stopping the feature pipeline for cleanup.

Another salient question in light of this data: at what rate does development progress become inhibited by additional code? Especially when it comes to copy/pasted code, which tends to seed indecision about which utility method to use among multiple similar choices, there is almost certainly an inverse correlation between "the number of lines of code in a repo" and "the velocity at which developers can modify those lines." The current uncertainty is "when is the accumulated cruft too great to be tolerated?" Knowing the rate at which slowdown takes hold would allow future tools to highlight when a manager should consider cutting back time on new features.

link Conclusion: Developers Are Right to be Wary

By all measures we evaluate, AI tooling exerted negative pressure on code quality throughout 2023.

Developer assessments, like GitHub's 2023 survey with Wakefield Research, suggests that developers already perceive the decrease in code quality. When asked "What metrics should you be evaluated on, absent AI?" their top response was "Collaboration and Communication," with "Code Quality" in second place. When the question switched to "What metrics should you be evaluated on, when actively using AI?" their responses shifted, with "Code Quality" top concern, and "Number of production incidents" as the #3 concern:

open_in_new

From GitHub's Survey on AI Impact

While individual developers lack the data to substantiate why "code quality" and "production incidents" become critical concerns when using AI, our data suggests a possible backstory. When developers are inundated with quick and easy suggestions that will probably work in the short term, it becomes a constant temptation to add more lines of code without checking whether an existing system could be refined for reuse.

To the extent that inexperienced developers continue to be offered easy copy/paste suggestions, the fix for this situation won't be easy. In the age of Copilot, it is beholden on engineering leaders to monitor incoming data and consider its implications for future product maintenance. There are a growing number of tools, including GitClear, that offer Developer Analytics. When evaluating these, we recommend Managers consider adopting tools that can help detect when problematic code is festering.

When it comes to building a product, there's no question that AI assistance leads to more lines of code being added. The better question for 2024: who's on the hook to clean up the mess?

link Citations

The economic impact of the AI-powered developer lifecycle and lessons from GitHub Copilot

GitHub Copilot for Business is now available

Sea Change in Software Development: Economic and Productivity Analysis of the AI-Powered Developer Lifecycle

Diff Delta and Commit Groups

Techreport survey: 71% of teams use Agile

What is "code provenance" and why does it matter?

Survey reveals AI’s impact on the developer experience

link Appendix

Data used to build this research is included below.

link Raw data for changed line counts

	Added	Deleted	Updated	Moved	Copy/pasted	Find/replaced	Lines changed	Churn
2020	9,071,731	4,508,098	1,202,480	5,786,718	1,911,855	676,000	23,156,882	769,493
2021	14,464,864	6,969,778	1,826,579	9,043,649	3,087,530	1,234,213	36,626,613	1,331,278
2022	16,868,378	8,280,031	2,146,768	8,407,677	3,873,240	1,512,708	41,088,802	1,630,703
2023	22,626,714	11,288,962	2,938,800	9,040,659	5,607,373	1,942,194	53,444,702	2,952,912
2024	28,708,803	14,535,353	3,803,275	8,804,121	7,599,970	2,355,662	65,800,602	4,665,263

Here were some secondary characteristics of the data set analyzed, to aid in evaluating its validity/applicability relative to existing data sets the reader may possess:

Year	Commit count	Committer count	Repos analyzed	Code files changed
2020	381347	12761	497	1368549
2021	623264	17577	643	2207498
2022	723823	18446	993	2616263
2023	1019680	21700	1294	3414136

In CSV pasteable form, for your reanalysis convenience (2024 omitted since it is a projection you can replace with your own):

Year,Added,Deleted,Updated,Moved,Copy/pasted,Find/replaced,Lines changed,Churn
2020,9071731,4508098,1202480,5786718,1911855,676000,23156882,769493
2021,14464864,6969778,1826579,9043649,3087530,1234213,36626613,1331278
2022,16868378,8280031,2146768,8407677,3873240,1512708,41088802,1630703
2023,22626714,11288962,2938800,9040659,5607373,1942194,53444702,2952912

link Queries used to produce data

The data was stored in a Postgres database and was queried via Ruby on Rails' ActiveRecord.

# Operation by year
2020.upto(2023).map { |year| CodeLine.where(authored_at: Time.new(year, 1, 1)..Time.new(year + 1, 1, 1), commit_impacting: true).group(:operation_em).count }
# Commits by year
Commit.impacting.where(authored_at: Time.local(2020,1,1)..Time.local(2024,1,1)).group("EXTRACT (year from authored_at)").count
# Committers committing by year
2020.upto(2023).map { |year| Committer.joins(:commits).merge(Commit.impacting).where(commits: { authored_at: Time.local(year,1,1)..Time.local(year+1,1,1) }).group(:id).count.size }
# Repos changed by year
2020.upto(2023).map { |year| Repo.joins(:commits).merge(Commit.impacting).where(commits: { authored_at: Time.local(year,1,1)..Time.local(year+1,1,1) }).group(:id).count.size }
# Files by year
2020.upto(2023).map { |year| CommitCodeFile.impacting.joins(:commit).where(commits: { authored_at: Time.local(year,1,1)..Time.local(year+1,1,1) }).group(:id).count.size }

link Raw data for revised line counts

Year	Less than 2 weeks	Less than one month	Less than one year	1-2 years
2020	550362	72471	182420	30074
2021	891008	120029	274125	50825
2022	1136604	173370	369925	77463
2023	1941351	254082	445869	82405

In CSV pasteable form, for your own reanalysis convenience:

Year,Less than 2 weeks,Less than one month,Less than one year,1-2 years,Sum
2020,550362,72471,182420,30074,835327
2021,891008,120029,274125,50825,1335987
2022,1136604,173370,369925,77463,1757362
2023,1941351,254082,445869,82405,2723707

link Queries used to produce data

The data was stored in a Postgres database and was queried via Ruby on Rails' ActiveRecord.

# ruby 
# Revision (provenance) query
2020.upto(2023).map { |year| CodeLine.where(authored_at: Time.new(year, 1, 1)..Time.new(year + 1, 1, 1), commit_impacting: true).group(:provenance_em).count }

link Abstract

link GitHub: "55% faster coding. 46% more code written. $1.5 trillion added to global GDP"

link The Problem with AI-generated Code

link Code Change Definitions

link Trends in Commit Line Operations

link Interpreting Significant Code Operation Changes

link Burgeoning Churn

link Less Moved Code Implies Less Refactoring

link More Copy/Pasted Code Implies Future Headaches

link Trends in Revised Code Age

link Interpreting Code Age Trends

link Questions for Follow Up Research

link Conclusion: Developers Are Right to be Wary

link Citations

link Appendix

link Raw data for changed line counts

link Queries used to produce data

link Raw data for revised line counts

link Queries used to produce data

link OpenAI Assistant Data Projection

link
open_in_new