Coding on Copilot: 2023 Data Shows Downward Pressure on Code Quality, Plus Projections for 2024

Final version: Available on Google Docs




linkAbstract

2023 was a very good year for GitHub Copilot. In little more than a year's time, it has grown to be used by millions of developers and tens of thousands of businesses. The popularity of Copilot proves that 2023 marked the beginning of a new era in how code is authored.


Less understood is what impact all these AI-generated lines might have on code quality and maintainability? In this paper, GitClear looks at more than 100 million lines of code, to decipher the patterns in how writing code has changed since Copilot's debut.


We find several concerning trends for Lead Developers hoping to maintain a pliable code base in the long-term. Code churn -- the percentage of lines that are changed or reverted less than two weeks after being authored -- is projected to double in 2024 relative to its level in 2021. We find that percentage of "code added" and "code copy/pasted," especially by junior developers, is increasing at a much higher rate than code that is updated, deleted, or moved to consolidated location (aka "DRY code").


We conclude with suggestions for managers that seek to maintain high code quality in spite of the forces increasingly driving against that end.


linkGitHub: "55% faster coding. 46% more code written. $1.5 trillion added to global GDP"

With numbers like these, it's little wonder that GitHub's own CEO, Thomas Dohmke, would carve time out of his schedule to write about the AI revolution, in a blog post (and research paper) he published on GitHub in 2023. Surely, there were less busy people who could have written about the magnitude of Copilot's impact. But, considering this is the most popular product GitHub has launched since its inception, you can understand why Dohmke would choose basking in the glow of his hit product over the usual day-to-day rigors of running a GitHub-sized company.


From Dohmke's 2023 blog post, "The economic impact of the AI-powered developer lifecycle and lessons from GitHub Copilot"

When we saw these numbers, it confirmed our intuition that Copilot is already making a measurable impact on the development ecosystem. In the same blog post, Dohmke asserts that more than 20,000 organizations are already using GitHub Copilot for Business. This is in addition to the "more than one million people" that GitHub stated were already using Copilot on a Personal license as of February 2023, when Copilot for Business was released.


At this point, it's not hard to imagine that at least a third of all global developers have access to either GitHub Copilot or an AI coding assistant like it. In 2024 and beyond, the proliferation of AI-assisted code seems likely to continue, or even accelerate.


linkThe Problem with AI-generated Code

Developers wouldn't be adopting Copilot if they didn't believe that it accelerated their ability to produce code. GitHub's "75% more fulfilled" measurement attests that Copilot certainly succeeds on this count. The question is, "what cost are teams paying for this convenience?"


Developer researchers are concerned by the impact of AI assisted programming

GitHub claims that code is written "55% faster" with Copilot. But what about code that shouldn't be written in the first place? That, too, is written 55% faster.


That is the first of several challenges facing developers who use an AI assistant. Others include:

Being inundated with suggestions for added code, but never presented with suggestions for updating, moving, or deleting code

Time required to evaluate code suggestions can become costly, especially when the developer works in an environment with competing auto-suggest mechanisms

Code suggestion is generally optimized by the likelihood the developer will accept it. It is not optimized for whether the code is correct, or whether it will even run

These drawbacks presumably account for a portion of the difference in Suggestion Acceptance Rate between Junior and Senior Developers:


GitHub's own data suggests that Junior Developers use Copilot around 20% more than experienced developers


linkCode Change Definitions

GitClear classifies code changes (operations) into seven categories, six of which are analyzed in this research:

Added code. Newly committed lines of code that are distinct, excluding lines that incrementally change an existing line (labeled "Updates"). "Added code" also does not include lines that are added, removed, and then re-added (these lines are labeled as "Updated" and "Churned")

Deleted code. Lines of code that are removed, committed, and not subsequently re-added for at least the next two weeks.

Moved code. A line of code that is cut and pasted to a new file, or a new function within the same file. By definition, the content of a "Moved" operation doesn't change within a commit, except for (potentially) the white space that precedes the content.

Updated code. A committed line of code based off an existing line of code, that modifies the existing line of code by approximately three words or less.

Find/replaced code. A pattern of code change where the same string is removed from 3+ locations and substituted with consistent replacement content.

Copy/pasted code. Line content, excluding programming language keywords (e.g., end, }), [), that are committed to multiple files or functions within a day.

No-op code. Trivial code changes, such as changes to white space, or changes in line number within the same code block. No-op code is excluded from this research.

Specific examples of GitClear's code operations can be found in the Diff Delta documentation. GitClear has been classifying git repos by these operations since 2020. As of January 2024, GitClear has analyzed and classified around a billion lines of code over four years.


For this research, we are also exploring the change in "Churned code." This is not treated as a code operation, because a churned line can simultaneously be an "Added," "Deleted," or "Updated" operation. For a line to qualify as "churned," it must have been authored, pushed to the main branch, and then revised within the subsequent two weeks. Churn is best understood as "changes that were either incomplete or erroneous when the author initially accepted, committed, and pushed them."


linkTrends in Commit Line Operations

As a first approximation of how Copilot has changed development, we analyzed the number of different line operations that GitClear has observed, segmented by the year in which the code was authored (using the authored_at date within the git commit header). The raw numbers for this analysis are included in the Appendix. Here are the percentages by year:



Added

Deleted

Updated

Moved

Copy/pasted

Find/replaced

Churn

2020

39.18%

19.47%

5.19%

24.99%

8.26%

2.92%

3.32%

2021

39.49%

19.03%

4.99%

24.69%

8.43%

3.37%

3.63%

2022

41.05%

20.15%

5.22%

20.46%

9.43%

3.68%

3.97%

2023

42.34%

21.12%

5.50%

16.92%

10.49%

3.63%

5.53%

2024

43.63%

22.09%

5.78%

13.38%

11.55%

3.58%

7.09%


Here are how these look in graph form, where the left axis illustrates the prevalence of code change operations (which, as percentages, sum to 1). The right axis tracks the change in "Churn" code:



The projections for 2024 utilize OpenAI's gpt-4-1106-preview Assistant to run a quadratic regression on existing data for how AI changed percentages from 2022 to 2023. The full output of the OpenAI Assistant is provided in the Appendix. Given the exponential growth of Copilot reported by GitHub, it seems likely that 2024's numbers will continue the trends that began to take form in 2022.


Looking only at the differences in operation frequency between 2022 and 2023, we find:


Operation

YoY change

Added

+3.1%

Deleted

+4.8%

Updated

+5.2%

Moved

-17.3%

Copy/pasted

+11.3%

Find/replaced

-1.3%

Churn

+39.2%


linkInterpreting Significant Code Operation Changes

The most significant changes observed to correlate with the proliferation of Copilot are "Churn," "Moved," and "Copy/pasted." The implications for each change are reviewed in turn.


linkBurgeoning Churn

Recall that "Churn" is the percentage of code that was pushed to the repo, then subsequently removed or updated within 2 weeks. This was a relatively infrequent outcome when developers authored all their own code -- only 3-4% of code was churned prior to 2023.


Coinciding with the growth of Copilot, there is a surge in how often "mistake code" is being pushed to the repo, running the risk of being deployed to production. If the current pattern continues into 2024, more than 7% of all code changes will be reverted within two weeks, double the rate of 2021. The implications for growth in Google DORA's "Change Failure Rate" are likely to manifest when the 2024 State of Devops report is released later in the year.


linkLess Moved Code Implies Less Refactoring

Moved code is typically observed when refactoring an existing code system. Refactored systems in general, and moved code in particular, are key to enabling code reuse. As a product grows in scope, developers traditionally rearrange existing code into new modules that can be reused by newly added features. The benefits of code reuse are familiar to experienced developers. Compared with newly added code, reused code has already been tested & proven stable in production. Often, reused code has been touched by multiple developers, which increases the likelihood that the code comes with documentation. This accelerates the interpretation of the module by developers who are new to it.


Combined with the growth in code labeled "Copy/Pasted," it seems clear that the current implementation of AI Assistants discourages code reuse. Instead of refactoring and working to DRY ("Don't Repeat Yourself") code, these Assistants suggest authoring new code that repeats existing code.


linkMore Copy/Pasted Code Implies Future Headaches

There is perhaps no greater scourge to long-term code maintainability than copy/pasted code. In effect, when a non-keyword line of code is repeated, the code author is admitting "I didn't have the time or inclination to evaluate the previous implementation." By re-adding code instead of reusing it, the chore is left to future maintainers to figure out how to consolidate parallel code paths that implement some repeatedly-needed functionality.


Since most developers derive much greater satisfaction from "implementing new features" than they do "interpreting potentially reusable legacy code," copy/pasted code often persists long past its expiration date. Especially on less experienced teams, there may be no code maintainer with the moral authority to mandate code reuse. Even when there are Senior Developers possessing such authority, the willpower cost of understanding code well enough to consolidate it is hard to overstate.


If there isn't a CTO or VP of Engineering who actively schedules time to reduce tech debt, you can add "executive-driven time pressures" to the long list of reasons that the copy/pasted code will never be consolidated into the component libraries that underpin long-term development velocity.


linkTrends in Revised Code Age

Another way to assess how Copilot is influencing code quality is to extract the data from GitClear's Code Provenance derivation. This provides an independent secondary check of whether the patterns observed in Code O.



Less than 2 weeks

Less than one month

Less than one year

1-2 years

2020

65.9%

8.7%

21.8%

3.6%

2021

66.7%

9.0%

20.5%

3.8%

2022

64.7%

9.9%

21.1%

4.4%

2023

71.3%

9.3%

16.4%

3.0%

2024

74.4%

9.1%

14.1%

2.4%


In its visualized graph form:



linkInterpreting Code Age Trends

The trend in this data corroborates the patterns observed in the previous Code Operation analysis. When code gets updated, the age of the code is growing younger by year. From the 2020 epoch of our data set, it appears the steepest drop in code getting revised more than 1 month, and less than 12 months, after it was initially authored.


The trend suggests that, absent AI Assistants, developers were more likely to find recently authored code in their repo to target for refinement and reuse. Around 70% of products built in the early 2020s use the Agile Methodology, per a Techreport survey [5]. In Agile, features are typically planned and executed per-Sprint. A typical Sprint lasts 2-3 weeks. It aligns with the data to surmise that teams circa 2020 were more likely to convene post-Sprint, to discuss what was recently implemented and how to build upon it in a proximal Sprint. Judging by the data, that seems to be happening much less often lately.


Moving in the opposite direction is code that gets revised less than two weeks after being initially authored. After hovering between 65%-67% through the period before widespread AI Assistants, in 2023 it popped up to 71%, a 6.6% increase that came at the expense of refactoring more seasoned code.


linkQuestions for Follow Up Research

Can incentives be created to counteract the tendency to "add it and forget it" that today's AI systems promote? It is hard to imagine what sort of developer experience could be implemented that would guide a developer toward preferring reuse vs reinventing the wheel. It's conceivable that AI could be trained to identify opportunities where similar code could be consolidated, and offer the developer with tools to consolidate the copy/pasted mess that is currently propagating throughout repos. But even if this hypothetical "consolidation AI" were built, when would it be invoked? The same pressures that generally prevent teams from scheduling time to reduce tech debt would generally prevent them from stopping the feature pipeline for cleanup.


Another salient question in light of this data: at what rate does development progress become inhibited by additional code? Especially when it comes to copy/pasted code, which tends to seed indecision about which utility method to use among multiple similar choices, there is almost certainly an inverse correlation between "the number of lines of code in a repo" and "the velocity at which developers can modify those lines." The current uncertainty is "when is the accumulated cruft too great to be tolerated?" Knowing the rate at which slowdown takes hold would allow future tools to highlight when a manager should consider cutting back time on new features.


linkConclusion: Developers Are Right to be Wary

By all measures we evaluate, AI tooling exerted negative pressure on code quality throughout 2023.


Developer assessments, like GitHub's 2023 survey with Wakefield Research, suggests that developers already perceive the decrease in code quality. When asked "What metrics should you be evaluated on, absent AI?" their top response was "Collaboration and Communication," with "Code Quality" in second place. When the question switched to "What metrics should you be evaluated on, when actively using AI?" their responses shifted, with "Code Quality" top concern, and "Number of production incidents" as the #3 concern:



While individual developers lack the data to substantiate why "code quality" and "production incidents" become critical concerns when using AI, our data suggests a possible backstory. When developers are inundated with quick and easy suggestions that will probably work in the short term, it becomes a constant temptation to add more lines of code without checking whether an existing system could be refined for reuse.


To the extent that inexperienced developers continue to be offered easy copy/paste suggestions, the fix for this situation won't be easy. In the age of Copilot, it is beholden on engineering leaders to monitor incoming data and consider its implications for future product maintenance. There are a growing number of tools, including GitClear, that offer Developer Analytics. When evaluating these, we recommend Managers consider adopting tools that can help detect when problematic code is festering.


When it comes to building a product, there's no question that AI assistance leads to more lines of code being added. The better question for 2024: who's on the hook to clean up the mess?




linkCitations


linkAppendix

Data used to build this research is included below.


linkRaw data for changed line counts


Added

Deleted

Updated

Moved

Copy/pasted

Find/replaced

Lines changed

Churn

2020

9,071,731

4,508,098

1,202,480

5,786,718

1,911,855

676,000

23,156,882

769,493

2021

14,464,864

6,969,778

1,826,579

9,043,649

3,087,530

1,234,213

36,626,613

1,331,278

2022

16,868,378

8,280,031

2,146,768

8,407,677

3,873,240

1,512,708

41,088,802

1,630,703

2023

22,626,714

11,288,962

2,938,800

9,040,659

5,607,373

1,942,194

53,444,702

2,952,912

2024

28,708,803

14,535,353

3,803,275

8,804,121

7,599,970

2,355,662

65,800,602

4,665,263


Here were some secondary characteristics of the data set analyzed, to aid in evaluating its validity/applicability relative to existing data sets the reader may possess:

Year

Commit count

Committer count

Repos analyzed

Code files changed

2020

381347

12761

497

1368549

2021

623264

17577

643

2207498

2022

723823

18446

993

2616263

2023

1019680

21700

1294

3414136


In CSV pasteable form, for your reanalysis convenience (2024 omitted since it is a projection you can replace with your own):

Year,Added,Deleted,Updated,Moved,Copy/pasted,Find/replaced,Lines changed,Churn
2020,9071731,4508098,1202480,5786718,1911855,676000,23156882,769493
2021,14464864,6969778,1826579,9043649,3087530,1234213,36626613,1331278
2022,16868378,8280031,2146768,8407677,3873240,1512708,41088802,1630703
2023,22626714,11288962,2938800,9040659,5607373,1942194,53444702,2952912


linkQueries used to produce data

The data was stored in a Postgres database and was queried via Ruby on Rails' ActiveRecord.

# Operation by year
2020.upto(2023).map { |year| CodeLine.where(authored_at: Time.new(year, 1, 1)..Time.new(year + 1, 1, 1), commit_impacting: true).group(:operation_em).count }
# Commits by year
Commit.impacting.where(authored_at: Time.local(2020,1,1)..Time.local(2024,1,1)).group("EXTRACT (year from authored_at)").count
# Committers committing by year
2020.upto(2023).map { |year| Committer.joins(:commits).merge(Commit.impacting).where(commits: { authored_at: Time.local(year,1,1)..Time.local(year+1,1,1) }).group(:id).count.size }
# Repos changed by year
2020.upto(2023).map { |year| Repo.joins(:commits).merge(Commit.impacting).where(commits: { authored_at: Time.local(year,1,1)..Time.local(year+1,1,1) }).group(:id).count.size }
# Files by year
2020.upto(2023).map { |year| CommitCodeFile.impacting.joins(:commit).where(commits: { authored_at: Time.local(year,1,1)..Time.local(year+1,1,1) }).group(:id).count.size }


linkRaw data for revised line counts

Year

Less than 2 weeks

Less than one month

Less than one year

1-2 years

2020

550362

72471

182420

30074

2021

891008

120029

274125

50825

2022

1136604

173370

369925

77463

2023

1941351

254082

445869

82405


In CSV pasteable form, for your own reanalysis convenience:

Year,Less than 2 weeks,Less than one month,Less than one year,1-2 years,Sum
2020,550362,72471,182420,30074,835327
2021,891008,120029,274125,50825,1335987
2022,1136604,173370,369925,77463,1757362
2023,1941351,254082,445869,82405,2723707


linkQueries used to produce data

The data was stored in a Postgres database and was queried via Ruby on Rails' ActiveRecord.

# ruby
# Revision (provenance) query
2020.upto(2023).map { |year| CodeLine.where(authored_at: Time.new(year, 1, 1)..Time.new(year + 1, 1, 1), commit_impacting: true).group(:provenance_em).count }


linkOpenAI Assistant Data Projection

link