Method for extracting signal-rich commit count data

This document will describe the method used by a side project undertaken out of sheer curiosity with the question: how many commits per year does the average developer make?

linkLimitations of "Commit Count" as a productivity metric (it isn't)

This author has spent no shortage of past words chronicling the shortcomings of "commit count" (henceforth, CC) is as benchmark. The reason commit data holds intrigue in spite of those shortcomings is that, the further one zooms out, the less intrinsically bad CC becomes. This explains in part why our Diff Delta correlation research showed that CC correlated with developer effort at 0.5 (out of a maximum 1.0) within the large repo we analyzed that used the most effective Story Point estimation techniques.

linkData collection method

To gather the CC data presented, a small web scraper was created and sent out to crawl GitHub to collect the last three years of commit history for every developer the scraper could locate. As of November 2021, the crawler has processed data from 1.4m unique GitHub accounts and 1.9m developer years. Among the 1.9m records processed, 1.1m of them had recorded zero commits in the year.

linkScraper v1

After we collected 90,000 developer-year data points, we put together a first attempt to visualize the annual commit count of all developers. The data was good for identifying the average number of commits per year, but it wasn't useful to identify how many commits were happening among developers at 99th percentile and above, because the data set was polluted by numerous committers like whose commit history indicated they had authored hundreds of commits per day.

linkScraper v2

To combat the prevalence of developers that make automated commits, we re-populated our developer commit counts, this time capturing a few new metrics:

Max commits made on a single date over the year analyzed

90th percentile of commits made during the year analyzed

Median number of commits made per active day during the year analyzed

We gathered this data by taking an array of days from the year analyzed, sorting them, and then plucking out the desired records to produce the quantities listed.

linkHigh-level goals suggest data filtering to apply

To yield the most interesting, informative results, we wanted to impose two constraints on the data that would be included in our analysis. We wanted only to analyze years in which developers were working full-time and were authoring their commits manually.

linkThe ✨✨✨ coolest thing ✨✨✨ about this data: most developers commit counts do include commits made to private repos

All previous git research we have seen analyzes only commits made to open source repos, since those are the repos where commit data is publicly available to researchers. But, with our goal being to get a sense for how many commits the average developer makes per weekday, it was insufficient to consider only open source developers, whose circumstances often differ substantively compared to a developer working full-time at conventional business. For this data to be juiciest, it needs to include all the work from everybody who uses GitHub. And so it does. If you, dear reader, are a GitHub developer, it's very likely that the work you've done at "your day job" is contributing to the aggregate visualizations we're sharing.

linkSpecific restrictions to capture non-automated commits made by full-time developers

In order to exclude commits made automatically, and to avoid data pollution from contributors who only make commits as a side project, we applied two constraints to the 1.9m developer-years that could be factored into the final analysis.

The developer-year must have 3 <= max daily commits <= 75. Why 75?

The developer-year must have either 100 <= total annual commits <= 25,000 (why 100 as a lower limit and 25k as an upper limit?) or 1000 <= total annual commits <= 25,000 (why 1,000 as a lower limit)

Applying these constraints reduced 1.9m available developer-years to 295k years that met both criteria (minimum 100 commits per year)

linkCommit Count Data Analysis Results

To see what we learned from this data set, check out What we learned from evaluating 295,000 developer-years of commit history.