What makes GitClear's commit interpretation engine hard to copy

This note provides a partial list of the challenges that have vexed us while building GitClear over the past four years. Hopefully they will challenge our competitors as well, if any other company is foolish enough to try to crack the nut of quantifying code activity.


linkCommit interpretation

Interpreting what unique work occurred in a commit seems like a trivial problem at first approximation: it's the diff, dummy! Um no. That's like removing the backgrounds from pictures by flood filling all the colors adjacent to the edge. It works for up to half of the pictures you process, the other half...notsomuch. Let's get more specific about the challenges interpreting commits.


linkDoes this branch matter?

When should programming work "count"? Intuitively, one's first instinct might be "when it gets pushed." But that answer already runs counter to the prior art of Github et. al to show in the commits list only the work that has been merged to master. Work in branches is sequestered behind tabs. Why doesn't the main list of activity include all commits as soon as they're pushed? Why in fact does Github not even provide an option to view commit activity that happened across all branches?


I don't work at Github, so I can't definitively state why this option doesn't exist. But I can make an educated guess: their strategy hugely simplifies the work of figuring out which commits to show. Outside the master branch (which isn't the only name for git's default branch, because that would be too easy), a repo quickly becomes the wild west. 🤠 Commits often happen in a work-in-progress branch that gets discarded. But branches aren't always discarded, and when they are discarded, they may be discarded only in part. Often, five commits mutate into one. Other times, parts of the same commit are written to multiple repos (e.g., with subtrees). In all these cases, one needs to identify when unique work happened well enough to decide when it should count toward the ledger. For Github to implement the "show commits as soon as they get pushed" strategy, they'd be forced to create an algorithm that could sift through these (and several other) scenarios to identify which commits are meaningful and which are noise.


Of course, there are no rules that a commit will exist on a single branch. More often, commits simultaneously exist on many branches, some of which are ignored, some discarded, and some that actually matter.


linkWho committed this?

At face value, it's a trivial question. "Every commit has an author, just use that." <-- Idiot past self


No. Every commit has a committer and an author. Every commit can be committed on multiple branches, and by multiple committers. Often, a developer works from more than one system -- if not daily, when viewed over the longer-term of the repo's history. To attribute commit value, we need a way to roll up many committer identities into the human conception of what a "committer" represents (i.e., a single human).


Also, we need to ignore the bot committers and we need to ensure that whatever stats are cached happens in a way that it ties back to any of the individual committer identities that accumulated over the years.


linkWhat's a file?

Another question that seems like it should be easy is "what files and directories are being contributed to?". Except that files come in and out of existence, getting renamed, deleted, or ambiguously renamed (e.g., when 90% of the lines in a newly added file match lines that had existed in a recently deleted file). Keeping a stable sense of a file's lineage is essential to tracking how code lines evolve over time.


linkWhat is "real work"?

This is the first question that sounds kind of hard at first blush. And, not to disappoint, it's is a lot harder than it sounds. 😁 Assuming you have successfully de-duplicated the committer, the commit, and the file, your next task is to evaluate which files contain unique work and which are... something else.


What might "something else" be?

Autogenerated/compiled files

Test fixtures

Third-party libraries

Modified third-party libraries

Template code / template code that is slightly changed

Copy/pasted code

Moved code

Language keywords

To name a few.


linkHow long did it take?

This one also sounds like, and is, excruciatingly difficult. "How long a commit took to author" requires perfectly assessing all of the areas touched on above: what's a commit, who committed it, is it real work, what files were effected, and does the branch even matter? On top of that, it requires estimating when a programmer was programming. Passionate developers are notorious for keeping oddball hours (guilty as charged).


The simple beautiful case is that the developer made commits within the same day. If they are an individual contributor (not a manager, which gets more complicated) and they make their first commit at 9am followed by a second at 12pm, that seems great. In fact, this gets you about 50% of the way toward being able to effectively estimate time per commit.


As one might imagine, the rest of the commit scenarios become exponentially more difficult to estimate. The starting point is trying to estimate how much time was used to author the first commit in a day. How would you try to estimate that? What about commits that fall outside typical working hours? How would you even know what "typical working hours" are?


linkConclusion

The further one travels down the git rabbit hole, the more one realizes there are a constellation of ill-defined, interrelated concepts to untangle. Get even one piece of the puzzle wrong, and the accuracy of the entire system is jeopardized. Every time the system fails, customer trust is reduced. The value of GitClear is predicated upon being consistently accurate, such that business-critical decisions can be made using the data presented. Thus the reason we didn't even bother releasing GitClear to the public until more than two years after we'd began dogfooding it internally.