Shrewd observers of GitClear’s trajectory could be forgiven for asking: “If this company 'started' in early 2016, why wasn’t it available to purchase until more than two years later?” Yes, well, about that.


This article will review what made it so insanely difficult to get our data processing reliable enough to be leveraged for all the features we launched in 2019. Of the many challenges we faced during these first few years of data processing, the most time-consuming was the challenge of de-duplication.


If “removing duplicates” sounds boring to you, then you’ve just discovered the first reason that so few of our competitors dedicate sufficient resources to this problem. It’s boring as hell to work on, and a "success result" amounts to an unsatisfying “things are less bad.” Adding a flashy feature like Open Repos is what gets developers excited to engage. Removing duplicate committers... not so much.


A Very Brief Primer on Git Commit Data

The second reason that this is a seldom-solved problem lies in the design of git. Technically, to make a commit, one needs provide only:




In order of their appearance in the git commit screenshot above:


0) A pointer to the repo's current content (227tree 5b85...)

1) Shas of parents (parent b1a70...)

2) The commit's author (name, email, time of authorship)

3) The commit's committer (name, email, time of commit)

4) Checkin note (Added some text)


No branch data. No diff information. No enduring sense of committer identity. No pointers to other commits with known matching content. Anyone who has had the misfortune of dealing with a large data set of semi-duplicated data should already see where this is going. A git commit holds precious little structure to which a de-duplication effort can attach.


De-duplicating Committers


There may be fewer distinct individuals than it appears

Getting a consistent interpretation of the committers in a repo is the first battlefront in the war to turn messy git data into reliable git data.


There are probably some who wonder why this is so hard. It's not like people change their name or email address that often, right? Assuming that a developer keeps the same name and email address, there is no ambiguity about the identity of a committer. So far so good.


Where the situation starts to get sticky is when the developer sets up a new workstation. How often does this happen? Well, most developers work from at least two computers: a desktop and a laptop. Each of those workstations probably gets upgraded every 2-4 years. Add in some occasional OS reinstalls, and ballpark math says that the average developer works on about 13 workstations every 10 years. That's 13 separate opportunities for them to create their .gitconfig file with a slightly (or totally) different name, or with an updated email address (how many email addresses have you used in the last 10 years?).


If committers sharing a name make commits from different email addresses, that's technically a different committer. If they use the same email address but one workstation knows them as "Bill Harding," another knows them as "William Harding," and a third knows "wbh," that's three different committers. Even if the casing of the name changes, that may or may not be a different committer, depending on what database is interpreting it.


The only failsafe way to eliminate duplicate committers would be letting a manager dictate which committers get merged. This functionality is offered by all the major git data processors, and it seems fine, until you start running the numbers. If you administrate a team of 50 developers, the math above says you're probably implies you'll be looking at about 50 * 1.3 = 65 new workstations per year on your team. That’s 65 situations every year where developers are going to be asked to enter their name and email anew, and there’s a good chance that half of those will be at least slightly different than the last time they were asked.


Our recipe for de-duplicating committers (dear competitors: please stop reading) is to err on the side of over-merging, and make it easy for the user to revert the committer merge if we guess wrong (which has turned out to be mercifully rare). The key to allowing managers to revert a committer merge is keeping copies of each committer identity that gets accumulated over time. By keeping commits associated with the committer identity known at the time, it's trivial to ungroup committers if it's determined that they're not the same person.


There's a significant implementation cost GitClear pays for the flexibility of allowing committers to be easily grouped and ungrouped: Every single query from every single report needs to know that a committer can never be identified by their ID alone. Instead, the usable notion of "a committer" is in fact a family of committers. Enforcing this knowledge across every database query suggests why de-duplicating committers has been maddeningly difficult for competitors in the dev metric space to get right.


De-duplicating Commits

Are you feeling bored and exhausted yet? That's understandable. De-duplication is exhausting, but we've only scratched the surface of it so far. The hotbed of git duplication isn't committers--it's commits. Here's a breakdown of what sort of commit activity we saw across almost half a million commits processed in 2019:




Valid and meaningful commits do narrowly edge out the alternative in aggregate, but that still leaves almost 45% of all commit shas containing no unique work. Here is a non-exhaustive list of different paths by which the same content is committed repeatedly, or ignore-worthy work is committed:


The list could go on, but the point feels adequately made that git never truly commits to that which bears the name “commit.” Commits are a fluid concept that constantly mutate and frequently re-appear in slightly different circumstances.


De-duplication: Not fun. Not sexy. Just essential for reliable insight

The work required to ensure reliable, consistent, and distinct committers/commits is very boring. The fact that you’ve read this far into our article can only be a testament to the fact that you’ve personally dealt with the consequences of this very boring problem hindering your ability to rely upon your data.


Git was not built with data visualization in mind. The limited signal git provides is a significant hindrance when extracting the truth about what work was distinct and meaningful. Unfortunately, if one hopes to rely on an engineering intelligence provider to inform real world decisions, tackling this difficult challenge is essential. With relentless attention to detail, it's possible to translate murky inputs into consistent, reliable outputs. GitClear has already devoted a year of our company's life toward building an infrastructure we believe provides best-in-class data consistency and reliability.


Related articles



This content was written in Amplenote. Publish to the web with one click.   Learn more keyboard_arrow_right