There were two steps used to generate the data samples collected within the GitClear Code Line Reduction Analysis spreadsheet.
To procure representative commit groups, we started by grabbing commit group IDs within the range being tested:
Then we ran a loop to whittle those groups down to only those that had a unique set of commits, so we don't double-count any groups if we happen to build the same set of commits multiple times:
This loop relies on the derive_comparison_result method to create a Struct that has a count of changed lines on GitClear and the git provider. Here is the implementation of that method:
These steps reduce 10,000 prospect commit groups down to 3,000-5,000 commit groups per strata. Then we have to print them out in a consistent order to paste into the spreadsheet
The resulting list of numbers was pasted into a column of the spreadsheet.
After pasting this data into the spreadsheet, two additional steps were taken to reduce noise and outlier results. These steps should reduce skew of the averages, to offer a higher-accuracy interpretation of the delta between GitClear and a conventional git patch.
Eliminate any results that suggest GitClear saves more than 70% lines (a generally implausible result)
Eliminate code changes where the patch lines are above the 95th percentile of size relative to the group analyzed.
Without these normalization tactics, the data would most likely show an even stronger reduction in code lines from GitClear, but we feel that these gains may be reverted as the methodology improves, so we proactively eliminate the outlier data prior to calculating the statistics provided in the current research.