Git diff line count data generation method

There were two steps used to generate the data samples collected within the GitClear Code Line Reduction Analysis spreadsheet.


linkExtracting data samples from the GitClear code database

To procure representative commit groups, we started by grabbing commit group IDs within the range being tested:


cges = CommitGroupExtra.joins(:commit_group).includes(:commit_group).limit(10_000).where.
not(lines_saved_by_file: [nil, '{}', '[]']).where.not(checkup_complete_at: nil).where(commit_groups: { value: 301..600 })


Then we ran a loop to whittle those groups down to only those that had a unique set of commits, so we don't double-count any groups if we happen to build the same set of commits multiple times:


indexed = {}
cge_ids = cges.map do |cge|
next nil unless (comp = cge.derive_comparison_result(cge.commit_group)) && cge.patch_line_count.try(:>, 0) &&
cge.gitclear_line_count.try(:>, 0)
next if indexed[cge.distilled_commit_ids]
indexed[cge.distilled_commit_ids] = true
cge.id
end


This loop relies on the derive_comparison_result method to create a Struct that has a count of changed lines on GitClear and the git provider. Here is the implementation of that method:

def derive_comparison_result(commit_group, provider_name: nil)
return nil unless commit_group
provider_name ||= commit_group.repo.entity.remote_endpoint.provider_name
 
cgcf_ids = lines_saved_by_file.values.pluck("id")
code_files_by_cgcf_id = commit_group.commit_group_code_files.where(id: cgcf_ids).preload(:code_file).inject({}) { |h, cgcf| h.merge(cgcf.id => cgcf.code_file) }
cumulative_lines_saved, cumulative_patch_count, cumulative_gitclear_lines = 0, 0, 0
 
file_comparisons = lines_saved_by_file.map do |file_path, saved_hash|
patch_count = saved_hash["patch_changed_line_count"]
gitclear_count = saved_hash["gitclear_changed_line_count"]
next nil unless (code_file = code_files_by_cgcf_id[saved_hash["id"]])
cumulative_patch_count += patch_count
cumulative_gitclear_lines += gitclear_count
saved_lines = patch_count - gitclear_count
cumulative_lines_saved += saved_lines
html = <<~HTML
<div class='file_comparison_line'>
#{ file_path }: #{ provider_name } #{ patch_count }, GitClear #{ gitclear_count }.
<a href='#code_file_#{ code_file.id }' onClick='GitClear.UI.Commit.compareToProvider(event, #{ code_file.id })' title="Clicking this label will open the same file diff on your git provider, but only if you allow your browser to permit GitClear to open a tab. You will probably need to click an icon to allow it.">
#{ saved_lines > 0 ? "<span class='saved_pr_file_label'>Saved #{ StringUtility.pluralize(saved_lines, "line") }</span>" : "No lines saved" }
</a>
</div>
HTML
[ saved_lines, html.squeeze(" ").html_safe ]
end.compact
 
FileComparison.new(file_comparisons, cumulative_patch_count, cumulative_gitclear_lines)
end


These steps reduce 10,000 prospect commit groups down to 3,000-5,000 commit groups per strata. Then we have to print them out in a consistent order to paste into the spreadsheet


puts CommitGroupExtra.includes(:commit_group).where(id: cge_ids.compact).order(:id).map { |cge|
(comp = cge.derive_comparison_result(cge.commit_group)) && comp.gitclear_line_count }.join("\n")


The resulting list of numbers was pasted into a column of the spreadsheet.


linkFiltering outliers and potential low quality data within spreadsheet

After pasting this data into the spreadsheet, two additional steps were taken to reduce noise and outlier results. These steps should reduce skew of the averages, to offer a higher-accuracy interpretation of the delta between GitClear and a conventional git patch.


Eliminate any results that suggest GitClear saves more than 70% lines (a generally implausible result)

Eliminate code changes where the patch lines are above the 95th percentile of size relative to the group analyzed.


Without these normalization tactics, the data would most likely show an even stronger reduction in code lines from GitClear, but we feel that these gains may be reverted as the methodology improves, so we proactively eliminate the outlier data prior to calculating the statistics provided in the current research.