Update: After the public release of the research presented below, it was reviewed by distinguished academic researcher Dr. Alain Abran (13,000 citations for academic research on "software metrics"). His assessment of this research?
I have looked closely at the research you sent me: it is quite impressive, well-structured, research well conducted and a very careful methodology for data collection, data sets constructions, and data analysis.
For your intended purpose to provide an evidence base that your measurement approach (e.g. 'Line Impact') is better than other alternatives, I am of the opinion that you have clearly demonstrated its superiority.
Read about his comments in depth at our follow-up blog post. Below is the original research announcement.
Almost every git provider offers a graph of git commits made per-repo or per-developer (e.g., GitHub). But none of them provide any guidance as to what this graph actually means, if anything. Does a high commit count correspond to any real-world concerns, like shipping velocity, complex problems solved, or bugs generated? It's tempting to think that a higher commit count must mean more is getting done, but the lack of research makes it anybody's guess whether that's true.
How bad is the lack of published research on software development? A Hacker News story with 780 upvotes from January 2021 linked to a fascinating blog post called "Software effort estimation is mostly fake research" which paints a grimly humorous portrait of existing research:
How large are these datasets that have attracted so many research papers?
The NASA dataset contains 93 rows (that is not a typo, there is no power-of-ten missing), COCOMO 63 rows, Desharnais 81 rows, and ISBSG is licensed by the International Software Benchmarking Standards Group (academics can apply for a limited time use for research purposes, i.e., not pay the $3,000 annual subscription). The China dataset contains 499 rows, and is sometimes used (there is no mention of a supercomputer being required for this amount of data ;-).
This story stirred us to action. Today, we're adding 2,729 data points of research free for anyone to download and reuse/analyze in whatever ways seem interesting. The research attached cites several sources to substantiate that Story Points are a viable proxy for "software effort estimation." Building upon this notion, the research explores how closely three git metrics fare in their correlation with Story Points: Commit Count, Lines of Code Changed, and Line Impact.
The data set analyzed in Feb 2021 indicates that Line Impact better approximates software effort estimation than Commit Count or Lines of Code Changed.
Pearson Correlation between estimated effort (Story Points) vs three git metrics analyzed in the five largest repos from the dataset, with 1,847 issues analyzed between them
Here were the average Pearson correlation levels discovered across 61 repos analyzed containing 2,729 issues with Story Points:
Lines of Code Changed: 25% (correlation in large repo range: 13-31%)
Commit Count: 27% (correlation in large repo range: 17-49%)
Line Impact: 38% (correlation in large repo range: 26-61%)
r2 Correlation between estimated effort (Story Points) vs three git metrics analyzed in the five largest repos from the dataset
The weighted average r2 correlation found across the data set:
Lines of Code Changed: 9% (r2 in large repo range: 2-10%)
Commit Count: 12% (r2 in large repo range: 3-24%)
Line Impact: 19% (r2 in large repo range: 7-38%)
To qualify as a "large repo" required having a minimum of 100 issues analyzed.
As the data shows, there wide differences in the correlation observed per team. The data gathered so far suggests that a large team can attain a 61% correlation between effort estimation and Line Impact, but the average team sees something closer to 38% correlation for Line Impact, the most correlative metric studied.
Software teams that bring about a high correlation (0.4+) between their effort estimations and any git-based metric have several advantages. The paper discusses five possible advantages in the "Purpose" section. A few include:
Detect Tech Debt (and estimate when to stop procrastinating its remediation)
If a Jira ticket is estimated as high effort/complexity, but results in a disproportionately small measurement in the correlative metric, it can be inferred more likely that the code in the changed system is complex to adapt. If the Manager knows that many future tickets will work in this code, their knowledge of effort correlation gives them a data-backed case to pay down the tech debt in that system or library.
A/B Test Workplace Policies
Most tech companies have found themselves comfortable working from home over the past year. But is it working for or against productivity? A git metric that correlated with "high development effort" would help Managers adopt a reasoned strategy for returning to the office, by comparing the WFH vs office metrics. (Our preliminary data suggests that the switch to WFH has added an average of 0-10% more Line Impact for our customers)
Proactive Assistance for New or Struggling Developers
Software Managers currently invest hours per week ensuring that their teammates aren't getting stuck on problems or unclear requirements. The typical method of helping is periodic check-ins, but these checkins have a low hit rate for identifying when a problem is underway (i.e., most developers don't need help at meetings). A highly correlative metric allows the Manager to detect when the Developer has begun to struggle as soon as it happens.
So far we've distributed our research to about 10 professors and Computer Science Institutes for them to review our research. In particular, we've submitted it to ESEC/FSE 2021 Industry Papers for review by their Program Committee in May 2021. We are determined to refine our methods as reasonable improvements are proposed to us. We want to help others reproduce this work if they're interested in the opportunities afforded by a metric that correlates with a team's effort estimation.
We are publishing this data now for two reasons. First, because having real data to talk about might help spark useful ideas about what to research next. Second, because the 38% r2 correlation for the largest repo in the dataset raises the intriguing prospect that other teams could match or exceed this correlation if they optimized for it.
Our offer to sponsor a Professional Researcher remains open. We want to work with someone who can design and execute high-caliber scientific research that answers pragmatic questions with data.
Here is our PDF research paper to share with colleagues:
FYI it tends to get somewhat behind the latest and greatest version in Google Docs.