Along with calculating Diff Delta, the most multifaceted calculation undertaken by GitClear is the derivation of how much time was taken for each commit authored. In this article, we'll describe the basic mechanism by which GitClear estimates how much time was used per commit, and some of the curveballs that we handle to maximize the quality of our estimates.
Before we dive into the weeds of how minutes are estimated, please allow us a moment to dispatch the quixotic hope that GitClear can measure time per commit with anything approaching 100% accuracy. There is so little information from which to form an estimate that GitClear is the only developer analytics company to even attempt it. We do so with the spirit of accepting imperfections on behalf of opportunity. The payoff for making a data-backed guess at how long each commit took is that developers can hone in on which directories harbor the most tech debt. So we're willing to accept moderate imprecision on short-time scales. Assume that commit time estimates could be off by 20% or more in either direction, unless a team member cares enough to set more accurate values per commit.
GitClear employs heuristics to determine which among 10 different time estimation techniques is most likely to accurately describe the time that a developer consumed to finish the work in a commit.
The most basic form of minutes estimation is "how much time passed since the last commit was authored?" For example, if an author makes a commit at 3pm and then a second commit at 4pm, we could estimate that the commit at 4pm took 60 minutes.
A different flavor of "time since last commit." Depending on project structure, some developers will concurrently author work in multiple git repos. For example, a developer might make a commit in Repo A at 2pm, a commit in Repo B at 3pm, and then make commits at 4pm to both Repo A and Repo B. In this case, GitClear will generally estimate that the commits made at 4pm should be relative to the last commit that was made in the repo. We will also reduce the time estimate for the commits such that the committer is not credited for 3 hours of work (2 hours in Repo A, 1 hour in Repo B) during a 2 hour block of logical time (2-4pm) that passed.
To estimate the time that was consumed for the first commit of the day, it's necessary to approximate the working hours of a developer. GitClear performs this calculation by first aggregating the hour of the day at which each of the past year of commits was made, and then calculating the median commit hour. For example, if commits were made at 8am, 12pm and 4pm, then we would say the committer's median commit time is 12pm. The next step is to calculate the minimum time range that contains 90% of the committer's past commit activity. For example, if the committer made commits at the following times
8am
9am
10am
11am
12pm
1pm. Median commit
2pm
3pm
4pm
5pm
In this case, the median commit would be said to take place at 1pm, and so 90% of all commit activity occurs between 9am-5pm. Armed with this time range, if one commit occurs at 3pm and the next commit occurs at 10am, then this minutes estimation method would approximate that (5pm-3pm) + (10am-8am) = 4 hours were used to author the commit at 10am.
When it's not possible to judge a commit's minutes based on the time of the previous commit, another method of estimation is to extrapolate the time used based on the Diff Delta of the commit. Since research has proven that Diff Delta correlates with software effort proportionally across programming languages, with a large enough sample set of past work, it's possible to invert the Diff Delta of a commit to deduce the number of minutes that were consumed. For example, if over the past year we see that a developer has made 1,000 commits to Repo A at an average velocity of 50 Diff Delta per hour, then a commit that generates 100 Diff Delta would, by this method, be estimated to have consumed 2 hours.
When there is insufficient data to estimate velocity from a developer's history, an alternate method of estimation is to calculate the median Diff Delta per hour of contributors who made a commit to the repo within the past year. Statistically, the median velocity is the most likely estimate to encounter among those who make commits to a particular code base, so when more accurate estimation methods are unavailable, this can serve as a functional fallback.
In addition to the multiple estimation methods that GitClear analyzes when choosing a duration estimate for the commit, we have evolved a set of safeguards that work together to protect against miscalculations.
As described in our popular blog post, "Removing duplicates has taken over a year of developer time. Still worth it." we explain the blueprint for how GitClear ensures that committers can be uniquely identified across workstations (without admin intervention). In addition to clarifying reports, deduplication turns out to be essential to time estimation. Unless you know that a committer's various git identities are tied together, any "past commit"-based estimation mechanism can yield arbitrary results.
After "Diff Delta" and "duration" have been calculated for a commit, GitClear runs a secondary analysis of the commit velocity implied to evaluate whether it falls within reasonable norms. If it doesn't, then our commit processing engine will remove the duration estimate, leaving it blank and excluding it from tech debt calculations.
Another requirement for accurate estimation is to ensure that commits to git subrepos are accounted for. GitClear makes this possible by recognizing the signature of a git subrepo commit, and attributing a duration estimate only the source repo that supplies the subrepo to its consumer repos.