link Extracting data samples from the GitClear code database

To procure representative commit groups, we started by grabbing commit group IDs within the range being tested:

cges = CommitGroupExtra.joins(:commit_group).includes(:commit_group).limit(10_000).where.
  not(lines_saved_by_file: [nil, '{}', '[]']).where.not(checkup_complete_at: nil).where(commit_groups: { value: 301..600 }) 

Then we ran a loop to whittle those groups down to only those that had a unique set of commits, so we don't double-count any groups if we happen to build the same set of commits multiple times:

indexed = {}
cge_ids = cges.map do |cge|
  next nil unless (comp = cge.derive_comparison_result(cge.commit_group)) && cge.patch_line_count.try(:>, 0) && 
    cge.gitclear_line_count.try(:>, 0)
  next if indexed[cge.distilled_commit_ids]
  indexed[cge.distilled_commit_ids] = true
  cge.id  
end

This loop relies on the derive_comparison_result method to create a Struct that has a count of changed lines on GitClear and the git provider. Here is the implementation of that method:

def derive_comparison_result(commit_group, provider_name: nil)
    return nil unless commit_group
    provider_name ||= commit_group.repo.entity.remote_endpoint.provider_name
 
    cgcf_ids = lines_saved_by_file.values.pluck("id")
    code_files_by_cgcf_id = commit_group.commit_group_code_files.where(id: cgcf_ids).preload(:code_file).inject({}) { |h, cgcf| h.merge(cgcf.id => cgcf.code_file) }
    cumulative_lines_saved, cumulative_patch_count, cumulative_gitclear_lines = 0, 0, 0
 
    file_comparisons = lines_saved_by_file.map do |file_path, saved_hash|
      patch_count = saved_hash["patch_changed_line_count"]
      gitclear_count = saved_hash["gitclear_changed_line_count"]
      next nil unless (code_file = code_files_by_cgcf_id[saved_hash["id"]])
      cumulative_patch_count += patch_count
      cumulative_gitclear_lines += gitclear_count
      saved_lines = patch_count - gitclear_count
      cumulative_lines_saved += saved_lines
      html = <<~HTML
        <div class='file_comparison_line'>
          #{ file_path }: #{ provider_name } #{ patch_count }, GitClear #{ gitclear_count }.
          <a href='#code_file_#{ code_file.id }' onClick='GitClear.UI.Commit.compareToProvider(event, #{ code_file.id })' title="Clicking this label will open the same file diff on your git provider, but only if you allow your browser to permit GitClear to open a tab. You will probably need to click an icon to allow it.">
            #{ saved_lines > 0 ? "<span class='saved_pr_file_label'>Saved #{ StringUtility.pluralize(saved_lines, "line") }</span>" : "No lines saved" }
          </a>
        </div>
      HTML
      [ saved_lines, html.squeeze(" ").html_safe ]
    end.compact
 
    FileComparison.new(file_comparisons, cumulative_patch_count, cumulative_gitclear_lines)
  end 

These steps reduce 10,000 prospect commit groups down to 3,000-5,000 commit groups per strata. Then we have to print them out in a consistent order to paste into the spreadsheet

puts CommitGroupExtra.includes(:commit_group).where(id: cge_ids.compact).order(:id).map { |cge| 
  (comp = cge.derive_comparison_result(cge.commit_group)) && comp.gitclear_line_count }.join("\n")

The resulting list of numbers was pasted into a column of the spreadsheet.

link Filtering outliers and potential low quality data within spreadsheet

After pasting this data into the spreadsheet, two additional steps were taken to reduce noise and outlier results. These steps should reduce skew of the averages, to offer a higher-accuracy interpretation of the delta between GitClear and a conventional git patch.

Eliminate any results that suggest GitClear saves more than 70% lines (a generally implausible result)

Eliminate code changes where the patch lines are above the 95th percentile of size relative to the group analyzed.

Without these normalization tactics, the data would most likely show an even stronger reduction in code lines from GitClear, but we feel that these gains may be reverted as the methodology improves, so we proactively eliminate the outlier data prior to calculating the statistics provided in the current research.

linkExtracting data samples from the GitClear code database

linkFiltering outliers and potential low quality data within spreadsheet

link Extracting data samples from the GitClear code database

link Filtering outliers and potential low quality data within spreadsheet