Now that I am (hopefully) emerging from a couple weeks where my brain reserved 80% of its processing power for this one problem, I wanted to capture a couple thoughts about the experience of working on very hard problems as an programmer.

Language-specific class/function detection has proven quite the pain to develop over the past couple weeks. I implemented Part I of my solution a month or two ago, with a few thousand Diff Delta over 10 days of work. That gave us an implementation that could start labeling lines with the function that contained them. But I still had a long list of todos when I set aside the v1 implementation for more urgent tasks. Crucial concepts remained unimplemented, like "recognizing when a function had been removed" and should no longer have its bounds propagated forward when we process subsequent commits that touch the file.

I never bothered to estimate how much time that remaining cleanup would require, because it didn't matter. I was going to need to do it at some point regardless. It turned out, after the first 10 days, part two took the better part of two weeks to get to the point where, now, hierarchical function/class detection is pretty reliable and increasingly well-tested in Java, C/C++/C-Sharp, Ruby, JS, Typescript, and Python.

It was a painful two weeks because detecting functions means that we need to step outside the previous paradigm we've always used for interpreting code lines: where each line was its own atomic unit that contained every answer to how it would be stored. In function detection v1, I used a kludgy pair of columns, named input_params and multiline_input_params to reflect that we might know the single-line params before we ever know its full parameter signature. But, to maintain a unique index of function signatures, we need to know the full (multiline) arguments. Anything less is insufficient to uniquely identify a function. That means our v1 approach needed the 🥾. We could no longer instantiate functions upon encountering a function declaration line, because we might not have created lines for its arguments yet. We might not even have all the arguments as part of the commit, since we only get changed lines plus 3 lines context on either side.

Another challenge I underestimated when I started v1 was the distinction between different types of function/class "parents." There are differences in how a "scope parent" like Bingo in Bingo::Bongo should be detected, displayed, and defined, relative to an "inherit parent" like class Bongo < Bingo or class Bongo implements Bingo.

When I think about the numerous iteration to the schema needed to implement function detection, it has been an arduous, time-consuming evolution. But it could be much worse.

For a concept like function detection to work its way through a more conventionally-organized company than ours would take a triumph of coordination and perserverence.

For starters, before development on function detection began, we would need to justify why it belonged on the roadmap. Sales would correctly point out that there are zero near-term customer improvements arising from function detection. We haven't mentioned them in any Jiras yet. In my head, there are visions of sugar plum fairies dancing where there ought to be a list of function use cases.

I presume that at some point we will stick them atop the screen when scrolling through a diff, but I only got that idea earlier this week, seeing GH does it. I can (nay, will!) produce a list of ideas for how we might use it, but suffice to say, there were no meetings held, nor time spent contemplating how and why we will benefit from this implementation. As a developer, I just intuit that it's essential to speak at the granularity of classes and methods. And so, after a month of hard work and re-work, we can.

If a normal company decided to undertake development on this, what instructions would they give? Is a Product Manager going to describe "we want to be able to identify the name of functions, the parameters they take, and the other functions they inherit from?" Probably more granular than the average PM could get, but assume they could. Ok, how do you represent the method from commit to commit? What fields should be encrypted? How to look up and de-duplicate records when their encrypted fields resist indexing? Producing a mechanism to implement that is an uncertain proposition, where one's first attempt will almost certainly be a tech debt-ridden monster that nobody wants to touch (in fairness, even a "cleanly" implemented version of function detection might be avoided by the uninitiated since it is inherently complexity is high).

If GitClear and Amplenote are someday recognized as superior tools by the biggest experts, this general situation -- repeated a few times per year -- probably explains a lot of it.