When I encounter non-trivial programming problems, I try to compare a few LLMs, to keep the pulse of which are most effective in practical, day-to-day use. The Twittersphere has leaned Anthropic during Q4 2024 and Q1 2025, and my first few examples lean that direction as well.


But, in an environment where small differences can be leveraged into 100x results, it seems worthwhile to keep experimenting to confirm that we understand where to get a 15% edge. Thus, we are pleased to present you, a daily comparison of


linkMonday May 26: Spot a subtle bug in a file

Query

Explain the following console log output from @cab-viewer.js

```

Dates to load [] known dates 0 -length set Set(0) {size: 0} committers []

peek-transitions.js:41 Couldn't find selected bubble among 1 candidates in f5d45939b01401a58ef4b60f5192fbf10b4e35c3 Error Component Stack

at PeekView (peek-view.js:57:1)

at div (<anonymous>)

at CabViewer (cab-viewer.js:95:1)

overrideMethod @ hook.js:608

presentShas @ peek-transitions.js:41

enterBubble @ peek-view.js:79

(anonymous) @ peek-view.js:130

[several lines of debug stack]

cab-viewer.js:68 datesToLoad 9 given 1736 and 400

cab-viewer.js:135 Setting dates from datesToLoad []

cab-viewer.js:68 datesToLoad 9 given 1736 and 400

cab-viewer.js:147 Dates to load [] known dates 1 -length set Set(1) {Set(0)} committers []

react.development-8cf9dbbe9633e779f2ad9bf026dba93d9b5b6aecfe023a2c1ebff8587c2ad895.js:245 Warning: Failed prop type: The prop dynamicColorByEm is marked as required in `CabDatesFrame`, but its value is `null`.

[several lines of debug stack]

cab-dates-frame.js:120 Going to sort 1 dates in set Set(1) {Set(0)} as array [Set(0)]

cab-viewer.js:68 datesToLoad 10 given 1736 and 325

cab-dates-frame.js:120 Going to sort 1 dates in set Set(1) {Set(0)} as array [Set(0)]

```

How does the datesKnown set end up with a single member that appears to be `Set(0)`?

linkResults

Claude 4 Sonnet proposed a working solution. ChatGpt 4.1, Claude 3.7 and o3 mini did not.


linkClaude 4 Sonnet

Via Cursor: Useful explanation including a single-line, functional proposed fix. ✅


linkChatGPT 4.1

Via Copilot in Jetbrains IDE: Proposed fix is identical to existing code. 🤦‍♂️


linkClaude 3.7 Sonnet

Via Copilot in Jetbrains IDE: Identifies correct problem-line, but proposes a byzantine fix that would add many lines of FUDebt without actually fixing the issue. 😬


linko3 mini

Via Copilot in Jetbrains IDE: Initially doesn't propose code, just offers a terse explanation of blah blah. When prompted to propose a fix, offers up code similar to Claude 3.7's non-functional fix.


linkSunday May 4: Make a cross-file find and replace

Query

All of the constants and methods now in @segment_display.rb were previously located in `Defines::StatSegments`. Please scan the repo to update any references to the constants or methods in this module that point to Defines::StatSegments. Replace those references with Defines::SegmentDisplay

linkResults:

There were about 20 files that needed to be changed, so explaining the extent of success/failure requires some extra precision.

linkGemini-2.5-Pro

I first let Cursor take a shot at it with Gemini-2.5-Pro "Max". It changed about 25 files over the course of 3-5 minutes (prob closer to 5?). Upon reviewing the code it changed, around 40% of the namespaces it changed were erroneous: methods that were still in the original module, but were updated to point at the updated module. I had to carefully review each of the ~25 files to distinguish which were the false vs. accurate namespace changes. Gemini also made three changes that didn't pertain to the namespace, and which I can't make heads or tails of: ex 1, ex 2, ex 3. If I were too busy or lazy to carefully review its changes, these would have become a part of the "7% increase in bugs for 25% increase in AI adoption" that DORA found in the most recent DORA report.


linkClaude-3.7-Sonnet

I next reverted the Gemini changes and gave the same query to Claude 3.7. Claude also changed 25 files. I would estimate it was similar, but perhaps marginally worse than Gemini. Again, around 30-40% of the replacements should not have been replaced (guess my query could have been phrased more clearly). Again, there were some bizarre changes that did not seem to apply to the query at all. A handful of the changes were reverting work that was committed between when I attempted the original Gemini query and the Claude query, leading me to suspect that Cursor is not doing a perfect job of determining when files they have sent to an LLM have changed and need to have their new version sent. But a handful of the changes were different than Gemini, and different from changes we have made.


linkBottom line

Doing a find-and-replace based on which methods reside in which modules is a step above a "trivial," task, but it is one you would hope an intermediately skilled developer could complete without issue in 60 minutes. Based on the type of mistakes both contenders made, it seems clear I could have rephrased my question for clarity, and included the contents of the original file so it could better distinguish which methods lived in there. But on balance, I'd judge that Cursor+frontier models aren't ready for this query and this magnitude of file changes just yet.


On the bright side, Gemini did catch about 95% of the renames that were valid, so it was much more a "false positive" than "false negative" issue.


linkQuery costs

Here is how the work for this 25-file replacement was described in Cursor's billing:



Digging through their more granular breakdown, it looks like those "70 premium tool calls" were about evenly split between Gemini & Claude, so it looks like each ~5 minute query cost about $1.50 with Cursor on their frontier "Max" models


linkMonday March 17: Look up a Visual Studio shortcut key

Query

VS Code set hotkey jump to current indent level

linkResults

Claude 3.7: Appears to become relevant as of 2c. Very text-heavy/burdensome to read, but has two promising suggestions. Upon investigation, the suggestions are both semi-relevant, but they do not address the specific question asked: they do not move the indent level forward to indent level of adjacent lines when they are on a line with no existing spaces (as is arguably best practice to keep commit whitespace standardized)

DeepSeek (v3?): Thorough. Much more skimmable than Claude answer. Appears to provide two valid options, although upon scrutiny, I observe that these do not handle the case where the line needs whitespace added before it can indent.

ChatGPT 4o: Reasonably well-formatted, but it provides 5 options and not one of them is good. 🤨

Phind: A very clear, but also very wrong answer. There is no keyboard shortcut action match from their suggested text, nor any of 5 permutations I tried. 🤨

Winner: Claude 3.7 most directly actionable. ChatGPT/Phind are bad. DeepSeek was promising, but too complex to undertake changing settings when it involves editing a 500+ line JSON file that would be difficult to revert. None of the answers addressed the specific question intent: to create whitespace as necessary to reach ambient indent level.


linkSunday March 9: Find an ambiguously described bug buried in a 30 line method

Given this method, Why does Ruby report

[2025-03-09 06:53:39.983]
[2025-03-09 06:53:39.983] SystemStackError (stack level too deep):
[2025-03-09 06:53:39.983]
[2025-03-09 06:53:39.983] /Users/bill/.rvm/rubies/ruby-3.2.3/lib/ruby/3.2.0/set.rb:511:in `each_key'
[2025-03-09 06:53:39.983] /Users/bill/.rvm/rubies/ruby-3.2.3/lib/ruby/3.2.0/set.rb:511:in `each'
[2025-03-09 06:53:39.983] /Users/bill/.rvm/rubies/ruby-3.2.3/lib/ruby/3.2.0/benchmark.rb:311:in `realtime'
[2025-03-09 06:53:39.983] app/controllers/api/api_controller.rb:131:in `bubble_activity'
[2025-03-09 06:53:39.983] config/initializers/warden_jwt_token_dispatcher_patch.rb:40:in `call'
[2025-03-09 06:53:39.983] lib/middleware/timeout_middleware.rb:16:in `call'
[2025-03-09 06:53:39.983] lib/middleware/okay_check_middleware.rb:23:in `call'
[2025-03-09 06:53:39.983] vendor/gems/rails_modifications/initializers/action_dispatch_no_missing_map_exception.rb:8:in `call'
[2025-03-09 06:53:39.983] lib/middleware/block_ip_middleware.rb:32:in `call'

Anthropic 3.7 Deep Think: Gradually evaluates possibilities over a couple minutes, eventually pinpoints two lines, one of which was the actual culprit. ✅

ChatGPT 4o: Wrong, not helpful.

Anthropic 3.7 Standard: Gets the same general answer as Deep Think version, which is correct. ✅ Tried it a second time and got it very wrong though.

DeepSeek: Thorough, but wrong and not very helpful.

Winner: Anthropic by a good bit over ChatGPT 4o and DeepSeek


linkFriday March 7: Interpret a variety of time zone strings

How to parse a time zone string in Ruby on Rails into the number of hours deviation from UTC? It should parse "US/Pacific" or "Pacific/Fiji" or any other time zone name a user may enter

OpenAI o1 Pro: Disparate answer. Mildly helpful in that it offers multiple avenues, but misses the point of wanting a single method that combines all approaches.

DeepSeek DeepThink R1: Partially correct: doesn't have the fallback heuristics of Anthropic, but better captures DST differences by checking the current time in the zone vs. current time in UTC ✅

Winner: Anthropic & DeepSeek both considerably better than OpenAI. Anthropic better "spirit of the question" answer, DeepSeek more thorough answer.


linkThursday March 6: String diff method (deduce query is asking for LCS)

Query

Write a Ruby method that can take two strings bananas o'reilly really for reals and banana really(for reals) and produce the shortest possible difference in characters between the two:

1. "s o'reilly"

2. "()"

No other characters should be present in the diff, since all the rest of the words are common between the two strings. Ensure that the methods produced do not abbreviate variable names.


linkCommentary

The examples that were given in the query weren't quite right - there was an extra space that should have been present in the first example. This led DeepSeek down a multi-minute rabbit hole where it kept trying good ideas, but finding they didn't match the expected output because the human hadn't paid close enough attention to give a completely accurate example.


It was interesting that Anthropic Deep Think also was confused it its output generated wasn't exactly right, but it deemed the answer sufficiently correct that it stopped generating in less than a minute, similar to o1.


linkNon-deep thinks

First submitted the algo to Anthropic & OpenAI without deep thinking. Both gave very deficient answers.


linkOpenAI deep think

After 90 seconds, produces this code, which isn't very close to what was asked:

DiffUtility.single_chunk_diff("bananas o'reilly really for reals", "banana really(for reals)")
[
[0] "s o'reilly really for reals",
[1] " really(for reals)"
]

Weaksauce.


linkAnthropic deep think

After 2 minutes, produces this code, which successfully produces:

DiffUtility.string_diff("bananas o'reilly really for reals", "banana really(for reals)")
[
[0] "so'illy re ",
[1] "()"
]

Arguably slightly better than desired.


Date: March 2024

Winner: Anthropic over OpenAI by a lot


linkWednesday March 5: Give functional component a ref

How can the functional component CabViewer keep a ref for its functional component child, CabCommittersFrame?


linkCursor

Suggests using React.forwardRef on CabCommittersFrame, with no change needed to CabViewer. Certainly preferable if it works (tbd).