Along with calculating Line Impact, the most multifaceted calculation undertaken by GitClear is the derivation of how much time was taken for each commit authored. In this article, we'll describe the basic mechanism by which GitClear estimates how much time was used per commit, and some of the curveballs that we handle to maximize the quality of our estimates.


link📏 Methods of estimation

GitClear employs heuristics to determine which among 10 different time estimation techniques is most likely to accurately describe the time that a developer consumed to finish the work in a commit.


linkTime since last commit

The most basic form of minutes estimation is "how much time passed since the last commit was authored?" For example, if an author makes a commit at 3pm and then a second commit at 4pm, we could estimate that the commit at 4pm took 60 minutes.


linkTime since last commit in repo

A different flavor of "time since last commit." Depending on project structure, some developers will concurrently author work in multiple git repos. For example, a developer might make a commit in Repo A at 2pm, a commit in Repo B at 3pm, and then make commits at 4pm to both Repo A and Repo B. In this case, GitClear will generally estimate that the commits made at 4pm should be relative to the last commit that was made in the repo. We will also reduce the time estimate for the commits such that the committer is not credited for 3 hours of work (2 hours in Repo A, 1 hour in Repo B) during a 2 hour block of logical time (2-4pm) that passed.


linkTime since last commit special case: first commit of the day

To estimate the time that was consumed for the first commit of the day, it's necessary to approximate the working hours of a developer. GitClear performs this calculation by first aggregating the hour of the day at which each of the past year of commits was made, and then calculating the median commit hour. For example, if commits were made at 8am, 12pm and 4pm, then we would say the committer's median commit time is 12pm. The next step is to calculate the minimum time range that contains 90% of the committer's past commit activity. For example, if the committer made commits at the following times

8am

9am

10am

11am

12pm

1pm. Median commit

2pm

3pm

4pm

5pm

In this case, the median commit would be said to take place at 1pm, and so 90% of all commit activity occurs between 9am-5pm. Armed with this time range, if one commit occurs at 3pm and the next commit occurs at 10am, then this minutes estimation method would approximate that (5pm-3pm) + (10am-8am) = 4 hours were used to author the commit at 10am.


linkTime estimate based on past velocity

When it's not possible to judge a commit's minutes based on the time of the previous commit, another method of estimation is to extrapolate the time used based on the Line Impact of the commit. Since research has proven that Line Impact correlates with software effort proportionally across programming languages, with a large enough sample set of past work, it's possible to invert the Line Impact of a commit to deduce the number of minutes that were consumed. For example, if over the past year we see that a developer has made 1,000 commits to Repo A at an average velocity of 50 Line Impact per hour, then a commit that generates 100 Line Impact would, by this method, be estimated to have consumed 2 hours.


linkTime estimate based on repo norms

When there is insufficient data to estimate velocity from a developer's history, an alternate method of estimation is to calculate the median Line Impact per hour of contributors who made a commit to the repo within the past year. Statistically, the median velocity is the most likely estimate to encounter among those who make commits to a particular code base, so when more accurate estimation methods are unavailable, this can serve as a functional fallback.


link🔍 Estimation safeguards

In addition to the multiple estimation methods that GitClear analyzes when choosing a duration estimate for the commit, we have evolved a set of safeguards that work together to protect against miscalculations.


linkAutomatic committer deduplication

As described in our popular blog post, "Removing duplicates has taken over a year of developer time. Still worth it." we explain the blueprint for how GitClear ensures that committers can be uniquely identified across workstations (without admin intervention). In addition to clarifying reports, deduplication turns out to be essential to time estimation. Unless you know that a committer's various git identities are tied together, any "past commit"-based estimation mechanism can yield arbitrary results.


linkVelocity outlier check

After "Line Impact" and "duration" have been calculated for a commit, GitClear runs a secondary analysis of the commit velocity implied to evaluate whether it falls within reasonable norms. If it doesn't, then our commit processing engine will remove the duration estimate, leaving it blank and excluding it from tech debt calculations.


linkAutomatic commit deduplication

Another requirement for accurate estimation is to ensure that commits to git subrepos are accounted for. GitClear makes this possible by recognizing the signature of a git subrepo commit, and attributing a duration estimate only the source repo that supplies the subrepo to its consumer repos.