What is "code provenance," and why does it matter?

"Code Provenance" is the term used by GitClear to indicate our interpretation of where the code line came from. It is one of the graphs that can be zoomed in from the "Historical" page:


Navigating to full-sized "code provenance" graph within Historical stats


linkExample graphs

Here are a couple graphs for annual provenance, to help you get a sense for what "normal" values might look like


linkAlloy.dev

This is the sum of Diff Delta accumulated across our company's products: Bonanza.com, GitClear, Amplenote, and NoteApps.info:


Example graph showing Code Provenance over time

These projects range from 3-10 years old, but they still average about 70% new code, and 30% all older code.


linkOpen source repos

This is the aggregate of all open source projects we calculate stats for, including React Native, Tensorflow, Microsoft Visual Studio Code, Rails, and others. It is a broad sample set:

Example graph showing code provenance by an assortment of open source repos

The aggregated open source repos average about 80% brand new code. The runner-up provenance has tended to be churned code, with 6% of Diff Delta in the sample screenshot being dedicated to rewriting code less than two weeks old.


linkProvenance designations glossary

The Provenance Report groups all code into the following buckets:


✨ Brand new code: For most projects, this tends to run from 50-80% of all Diff Delta. GitClear recommends alternating between phases of adding new code with phases of paring back code and rewriting legacy code that has become tech debt.


🏗️ "Less than two weeks" and "Less than one month": Combined, these tend to run from 2-10%. This approximates the amount of churn, or rework the team is doing. While "churn" can sometimes imply unclear (or misinterpreted) project requirements, it just as often means that the developer is discovering cleaner ways to implement their original prototype solution.


😨 Less Than One Year: This provenance should be less than 5% on most healthy teams. Teams that have a high amount of energy dedicated to rewriting code that was authored more than a month ago but less than a year ago suggests that there is medium-term uncertainty or rapid evolution surrounding what is to be implemented.


🧓 One to Two Years: 1-2 years is the onset of when code can begin to accumulate tech debt, especially in a rapidly evolving project. Healthy levels of "One to Two Years" can range from 2-15%. You don't want too little revisions of this sort or it suggests that recent years' features are being ignored.


🛁 More than Two Years: This is the ideal type of Diff Delta -- it generally indicates that the team is continuing to modernize legacy code and pay down tech debt. You don't see too much of this provenance on most projects. Anywhere from 1-5% is the usual range. If you are trying to improve a legacy repo, much higher values would be desirable.