Shrewd observers of GitClear’s trajectory could be forgiven for asking: “If this company 'started' in early 2016, why wasn’t it available to purchase until more than two years later?” Yes, well, about that.

This article will review what made it so insanely difficult to get our data processing reliable enough to be leveraged for all the features we launched in 2019. Of the many challenges we faced during these first few years of data processing, the most time-consuming was the challenge of de-duplication.

If “removing duplicates” sounds boring to you, then you’ve just discovered the first reason that so few of our competitors dedicate sufficient resources to this problem. It’s boring as hell to work on, and a "success result" amounts to an unsatisfying “things are less bad.” Adding a flashy feature like Open Repos is what gets developers excited to engage. Removing duplicate committers... not so much.

linkA Very Brief Primer on Git Commit Data

The second reason that this is a seldom-solved problem lies in the design of git. Technically, to make a commit, one needs provide only:

In order of their appearance in the git commit screenshot above:

0) A pointer to the repo's current content (227tree 5b85...)

1) Shas of parents (parent b1a70...)

2) The commit's author (name, email, time of authorship)

3) The commit's committer (name, email, time of commit)

4) Checkin note (Added some text)

No branch data. No diff information. No enduring sense of committer identity. No pointers to other commits with known matching content. Anyone who has had the misfortune of dealing with a large data set of semi-duplicated data should already see where this is going. A git commit holds precious little structure to which a de-duplication effort can attach.

linkDe-duplicating Committers

There may be fewer distinct individuals than it appears

Getting a consistent interpretation of the committers in a repo is the first battlefront in the war to turn messy git data into reliable git data.

There are probably some who wonder why this is so hard. It's not like people change their name or email address that often, right? Assuming that a developer keeps the same name and email address, there is no ambiguity about the identity of a committer. So far so good.

Where the situation starts to get sticky is when the developer sets up a new workstation. How often does this happen? Well, most developers work from at least two computers: a desktop and a laptop. Each of those workstations probably gets upgraded every 2-4 years. Add in some occasional OS reinstalls, and ballpark math says that the average developer works on about 13 workstations every 10 years. That's 13 separate opportunities for them to create their .gitconfig file with a slightly (or totally) different name, or with an updated email address (how many email addresses have you used in the last 10 years?).

If committers sharing a name make commits from different email addresses, that's technically a different committer. If they use the same email address but one workstation knows them as "Bill Harding," another knows them as "William Harding," and a third knows "wbh," that's three different committers. Even if the casing of the name changes, that may or may not be a different committer, depending on what database is interpreting it.

The only failsafe way to eliminate duplicate committers would be letting a manager dictate which committers get merged. This functionality is offered by all the major git data processors, and it seems fine, until you start running the numbers. If you administrate a team of 50 developers, the math above says you're probably implies you'll be looking at about 50 * 1.3 = 65 new workstations per year on your team. That’s 65 situations every year where developers are going to be asked to enter their name and email anew, and there’s a good chance that half of those will be at least slightly different than the last time they were asked.

Our recipe for de-duplicating committers (dear competitors: please stop reading) is to err on the side of over-merging, and make it easy for the user to revert the committer merge if we guess wrong (which has turned out to be mercifully rare). The key to allowing managers to revert a committer merge is keeping copies of each committer identity that gets accumulated over time. By keeping commits associated with the committer identity known at the time, it's trivial to ungroup committers if it's determined that they're not the same person.

There's a significant implementation cost GitClear pays for the flexibility of allowing committers to be easily grouped and ungrouped: Every single query from every single report needs to know that a committer can never be identified by their ID alone. Instead, the usable notion of "a committer" is in fact a family of committers. Enforcing this knowledge across every database query suggests why de-duplicating committers has been maddeningly difficult for competitors in the dev metric space to get right.

linkDe-duplicating Commits

Are you feeling bored and exhausted yet? That's understandable. De-duplication is exhausting, but we've only scratched the surface of it so far. The hotbed of git duplication isn't committers--it's commits. Here's a breakdown of what sort of commit activity we saw across almost half a million commits processed in 2019:

Valid and meaningful commits do narrowly edge out the alternative in aggregate, but that still leaves almost 45% of all commit shas containing no unique work. Here is a non-exhaustive list of different paths by which the same content is committed repeatedly, or ignore-worthy work is committed:

  • Rebasing. If the developer rebases, every rebased commit is technically a new commit, with a new committed_at timestamp. Unless the data model can recognize commit pushed repeatedly with new commit times, then teams that rebase will appear to have twice as much commit activity.

  • Squash strategy. If the team adopts best practices as espoused by esteemed developers like Andrew Clark, they might choose to make a series of N commits (which are just save points, after all), and when the commit is ready to move to master, it becomes one commit. From the standpoint of repo history, this makes a lot of sense. But if a data processor hopes to show day-to-day work before it reaches master, something’s gotta give. Either you choose to ignore the in-progress commits, or you ignore the squash commit, or you double count work that was done.

  • Merge commits. There is no inherent notion of a “diff” or “what changed in this commit” to git. The diff you see presented by GitClear et. al is simply the data processor interpreting what changed between content that was pushed in commit A and commit A'. Thus, while a merge commit will often contain no difference between the lead commit and its parent on the feature branch, it will contain an entire PR’s worth of difference between itself and its parent on the default/master branch. Complicating matters further, a merge commit might itself contain novel changes — though it ought not — but it might. Should merge commits count toward the commit count when they include no novel content?

  • Force pushes. The most common way of developing a feature in git is to work on a separate branch that can be re-written in whole at any time. "Re-writing the branch" (aka "force pushing commits") is necessary if the developer has rebased the branch's commits onto master since they were last pushed (very common). Every time the branch is force pushed, a new set of commit shas, with new commit times, will enter into the repo. Every force push is an opportunity for a naive commit counting algorithm to drive up its interpretation of how many distinct commits have been made.

  • Patch commits. Because git is git, and things would just be too easy if the concept known as "commit" required the developer to, you know, actually commit to it in an unchangeable way. Rather, a commit is never a commitment, thanks to git's faculty to amend a previously authored commit, essentially re-writing previous work with a new commit time. If the commit wasn't pushed before being amended, it will be interpreted as one sha. But, if the user amends a previously pushed commit, the the revised commit content will split into two shas with different commit times. The committer and commit message may or may not match, depending on workstation used. Should these be interpreted as duplicate commits? (Our opinion: yes, the work is considered one commit, but commits aren’t a metric that really matters so the end result is negligible)

  • Subrepos. God help you if have a developer that needs to works in a library repo that propagates to other repos. The best process our team has discovered to make this possible is git-subrepo, an alternative to git submodule. For better and worse, we have been dogfooding this arrangement over the last three years via the release of our to-do list and note taking app, Amplenote. Work on the React editor generally happens in its dedicated repo, but sometimes takes place in a wrapper app that treats the main repo as a secondary destination to send commits. If you have developers that work on libraries used by multiple projects, there’s a high likelihood that their commits will be considered new and important by however many projects use the code.

The list could go on, but the point feels adequately made that git never truly commits to that which bears the name “commit.” Commits are a fluid concept that constantly mutate and frequently re-appear in slightly different circumstances.

linkDe-duplication: Not fun. Not sexy. Just essential for reliable insight

The work required to ensure reliable, consistent, and distinct committers/commits is very boring. The fact that you’ve read this far into our article can only be a testament to the fact that you’ve personally dealt with the consequences of this very boring problem hindering your ability to rely upon your data.

Git was not built with data visualization in mind. The limited signal git provides is a significant hindrance when extracting the truth about what work was distinct and meaningful. Unfortunately, if one hopes to rely on an engineering intelligence provider to inform real world decisions, tackling this difficult challenge is essential. With relentless attention to detail, it's possible to translate murky inputs into consistent, reliable outputs. GitClear has already devoted a year of our company's life toward building an infrastructure we believe provides best-in-class data consistency and reliability.

linkRelated articles

The content above is from a note published by an Amplenote subscriber. As updates to the note are made, they are reflected here in real time.   Learn how to embed notes anywhere keyboard_arrow_right