Detecting duplicate code blocks is among the 30 goals that GitClear users can configure to instrument in order to ensure consistent high performance.


GitClear uses the Commit APIs offered by all of the major git providers (GitHub, GitLab, BitBucket, Azure Devops) to incrementally piece together the meaningful changes that occur within a repo over time. In order to detect clone blocks without having access to the full repo source code, GitClear generates a one-way hash value to represent each changed line. When a line is changed (in a non-ignored file), we gather the 5 lines most proximal to the change, transform each line into its hashed representation, and search for any other files that contain this same “content signature.”


When we locate an instance of a separate code block that matches the content signature, we first gather all locations that match the 5-line “content signature” then “search outward” from each “content signature” to assess the full extent of the overlap between the two code locations.


If the cloned code locations have not yet been consolidated, GitClear users can set up a “Team Goal” to be notified when the cloned code blocks in their repo exceed a chosen threshold. It is possible to configure goals that alert when 1) any 2+ blocks of 5+ lines are duplicated in the repo 2) any 2+ blocks of 5+ lines are duplicated within a file 3) any 2+ blocks of 5+ lines are duplicated within non-test code. The latter option acknowledges that dev teams have differing concern levels they attribute to blocks of cloned code within tests. It is less studied, and less certain, that duplication of blocks within test code carries the same negative outcomes as duplicated blocks in non-test code.