When a developer encounters non-trivial programming problems, there are several LLMs they could plausibly choose in 2025 to deliver the best results.

To keep the pulse of which are most effective in practical, day-to-day use. The Twittersphere has leaned Anthropic during Q4 2024 and Q1 2025, and my first few examples lean that direction as well. But, in an environment where small differences can be leveraged into 100x results, it seems worthwhile to keep experimenting to confirm that we understand where to get a 15% edge.

Thus, we will maintain a running log of programming challenges we submitted to 2025's leading-edge LLM models. Our hope is that this will help others calibrate which models are worth waiting on when a problem requires the best possible results.

link Tuesday July 1: Create a React component utilizing all possible modern Stripe React libraries

Query

Create a React component that utilizes all of Stripe's latest-and-greatest Javascript libraries to request a new customer's credit card information via Stripe. Ensure that if the user is logged in to Stripe already, that information will be pre-populated in the form. The form can incorporate native inputs as necessary, but as much as possible, the form should be implemented using Stripe's React components (for user name, credit card number, address if necessary, etc)

link Results

link OpenAI o4-mini-high

Utilizes loadStripe, Elements, LinkAuthenticationElement, PaymentElement, AddressElement in a 129-line implementation took about a minute to generate. Great mix of being concise, using appropriate elements, ignoring style, and not going overboard with commenting. Utility rating: 4 of 5. Generally concise, but the initial loadStripe ought not occur outside the scope of any function - this will be prone to race conditions, and it will ensure that any time this component is included, Stripe will be getting called multiple times. To be fair, the rest of the implementations also look like they would probably begin making calls to Stripe immediately upon the component loading, but at least they make that load within the component's context.

link Anthropic Claude 4 Opus with extended thinking

Utilizes AddressElement, Elements, LinkAuthenticationElement, PaymentElement in a 332-line implementation that seeks to be a full, effective demo. But for the purpose of trying to get boilerplate code, it is overkill. And took a couple mins to generate. Utility rating: 2 of 5. Too much boilerplate to be a good starting point for copy/paste.

link Anthropic Claude 4 Sonnet (no extended thinking)

Initially proposes a long implementation that might be promising, but has quite a few extra libraries, after a couple revision suggestions ("don't worry about the CSS") it produced a solution that used the same four elements as the other implementations in a 386-line implementation with quite a bit of cruft. Annoying that their editor doesn't show line numbers, and it gives errors about being unable to run the code, which is inevitable since it is using external libraries. Utility rating: 1 of 5. Way too much boilerplate.

link Google Gemini 2.5 Pro

Generated a 178-line implementation that used the same four methods as the other LLMs. More commenting than any other model. Shows appearance options to be passed to Stripe elements, which is helpful. Uses quite a few style classes that won't exist. Utility rating: 4 of 5. Too many undefined classes, too many space-hogging comments. But still overall pretty concise vs. Claude. Their implementation of paymentIntent is better than OpenAI in that it can happen via JS, and they load the promise outside scope of main JS file.

link GitHub Copilot with ChatGPT 4.1 (free)

Implements a 111-line implementation with the standard four elements. Has some undesirable style properties on the submit button, but otherwise commendable for its brevity and use of available options. Also shows useful options for the Stripe elements, but would have been nice if it had links to all the options. Utility rating: 4 of 5. See notes on OpenAI o4-mini-high, this result is effectively indistinguishable from that one.

link DeepSeek R1

Most similar to the GitHub Copilot 4.1 answer, but lacks the generic Elements import from the others in its 128-line implementation. Kind of nice that it shows an example of how to implement it in app code. Bizarre that it explicitly excludes applyPay and googlePay. 🤔 Utility: 3 of 5. Pros: Concise, shows options for Stripe elements, free. Cons: excluded Google/Apple pay for no apparent reason, unclear what purpose the try block for // Confirm payment setup provides.

link Winners

It was annoying that the Claude and Gemini answers reported an error showing the code, because they wanted to present their result in an editor but the import of external libraries prevents it. High marks for intention, low marks for execution. All of the answers except for the DeepSeek imported the same four Stripe elements, but none of them included comment links to the options for the elements, as I would hope/expect a Senior Developer to do.

None of the solutions were exactly what I was after, but all provided a good basic template to build from. The synthesis of these answers is where the truth lies. But, on balance, I think the victory for this round has to go to GitHub Copilot with ChatGPT 4.1 since it is free, concise, and contains all the essential pieces.

link Monday May 26: Spot a subtle bug in a file

Query

Explain the following console log output from @cab-viewer.js
```
Dates to load [] known dates 0 -length set Set(0) {size: 0} committers []
peek-transitions.js:41 Couldn't find selected bubble among 1 candidates in f5d45939b01401a58ef4b60f5192fbf10b4e35c3 Error Component Stack
at PeekView (peek-view.js:57:1)
at div (<anonymous>)
at CabViewer (cab-viewer.js:95:1)
overrideMethod @ hook.js:608
presentShas @ peek-transitions.js:41
enterBubble @ peek-view.js:79
(anonymous) @ peek-view.js:130
[several lines of debug stack]
cab-viewer.js:68 datesToLoad 9 given 1736 and 400
cab-viewer.js:135 Setting dates from datesToLoad []
cab-viewer.js:68 datesToLoad 9 given 1736 and 400
cab-viewer.js:147 Dates to load [] known dates 1 -length set Set(1) {Set(0)} committers []
react.development-8cf9dbbe9633e779f2ad9bf026dba93d9b5b6aecfe023a2c1ebff8587c2ad895.js:245 Warning: Failed prop type: The prop dynamicColorByEm is marked as required in `CabDatesFrame`, but its value is `null`.
[several lines of debug stack]
cab-dates-frame.js:120 Going to sort 1 dates in set Set(1) {Set(0)} as array [Set(0)]
cab-viewer.js:68 datesToLoad 10 given 1736 and 325
cab-dates-frame.js:120 Going to sort 1 dates in set Set(1) {Set(0)} as array [Set(0)]
```
How does the datesKnown set end up with a single member that appears to be `Set(0)`?

link Results

Claude 4 Sonnet proposed a working solution. ChatGpt 4.1, Claude 3.7 and o3 mini did not.

link Claude 4 Sonnet

Via Cursor: Useful explanation including a single-line, functional proposed fix. ✅

link ChatGPT 4.1

Via Copilot in Jetbrains IDE: Proposed fix is identical to existing code. 🤦‍♂️

link Claude 3.7 Sonnet

Via Copilot in Jetbrains IDE: Identifies correct problem-line, but proposes a byzantine fix that would add many lines of FUDebt without actually fixing the issue. 😬

link o3 mini

Via Copilot in Jetbrains IDE: Initially doesn't propose code, just offers a terse explanation of blah blah. When prompted to propose a fix, offers up code similar to Claude 3.7's non-functional fix.

link Sunday May 4: Make a cross-file find and replace

Query

All of the constants and methods now in @segment_display.rb were previously located in `Defines::StatSegments`. Please scan the repo to update any references to the constants or methods in this module that point to Defines::StatSegments. Replace those references with Defines::SegmentDisplay

link Results:

There were about 20 files that needed to be changed, so explaining the extent of success/failure requires some extra precision.

link Gemini-2.5-Pro

I first let Cursor take a shot at it with Gemini-2.5-Pro "Max". It changed about 25 files over the course of 3-5 minutes (prob closer to 5?). Upon reviewing the code it changed, around 40% of the namespaces it changed were erroneous: methods that were still in the original module, but were updated to point at the updated module. I had to carefully review each of the ~25 files to distinguish which were the false vs. accurate namespace changes. Gemini also made three changes that didn't pertain to the namespace, and which I can't make heads or tails of: ex 1, ex 2, ex 3. If I were too busy or lazy to carefully review its changes, these would have become a part of the "7% increase in bugs for 25% increase in AI adoption" that DORA found in the most recent DORA report.

link Claude-3.7-Sonnet

I next reverted the Gemini changes and gave the same query to Claude 3.7. Claude also changed 25 files. I would estimate it was similar, but perhaps marginally worse than Gemini. Again, around 30-40% of the replacements should not have been replaced (guess my query could have been phrased more clearly). Again, there were some bizarre changes that did not seem to apply to the query at all. A handful of the changes were reverting work that was committed between when I attempted the original Gemini query and the Claude query, leading me to suspect that Cursor is not doing a perfect job of determining when files they have sent to an LLM have changed and need to have their new version sent. But a handful of the changes were different than Gemini, and different from changes we have made.

link Bottom line

Doing a find-and-replace based on which methods reside in which modules is a step above a "trivial," task, but it is one you would hope an intermediately skilled developer could complete without issue in 60 minutes. Based on the type of mistakes both contenders made, it seems clear I could have rephrased my question for clarity, and included the contents of the original file so it could better distinguish which methods lived in there. But on balance, I'd judge that Cursor+frontier models aren't ready for this query and this magnitude of file changes just yet.

On the bright side, Gemini did catch about 95% of the renames that were valid, so it was much more a "false positive" than "false negative" issue.

link Query costs

Here is how the work for this 25-file replacement was described in Cursor's billing:

open_in_new

Digging through their more granular breakdown, it looks like those "70 premium tool calls" were about evenly split between Gemini & Claude, so it looks like each ~5 minute query cost about $1.50 with Cursor on their frontier "Max" models

link Monday March 17: Look up a Visual Studio shortcut key

Query

VS Code set hotkey jump to current indent level

link Results

Claude 3.7: Appears to become relevant as of 2c. Very text-heavy/burdensome to read, but has two promising suggestions. Upon investigation, the suggestions are both semi-relevant, but they do not address the specific question asked: they do not move the indent level forward to indent level of adjacent lines when they are on a line with no existing spaces (as is arguably best practice to keep commit whitespace standardized)

DeepSeek (v3?): Thorough. Much more skimmable than Claude answer. Appears to provide two valid options, although upon scrutiny, I observe that these do not handle the case where the line needs whitespace added before it can indent.

ChatGPT 4o: Reasonably well-formatted, but it provides 5 options and not one of them is good. 🤨

Phind: A very clear, but also very wrong answer. There is no keyboard shortcut action match from their suggested text, nor any of 5 permutations I tried. 🤨

Winner: Claude 3.7 most directly actionable. ChatGPT/Phind are bad. DeepSeek was promising, but too complex to undertake changing settings when it involves editing a 500+ line JSON file that would be difficult to revert. None of the answers addressed the specific question intent: to create whitespace as necessary to reach ambient indent level.

link Sunday March 9: Find an ambiguously described bug buried in a 30 line method

Given this method, Why does Ruby report

[2025-03-09 06:53:39.983]
[2025-03-09 06:53:39.983] SystemStackError (stack level too deep):
[2025-03-09 06:53:39.983]
[2025-03-09 06:53:39.983] /Users/bill/.rvm/rubies/ruby-3.2.3/lib/ruby/3.2.0/set.rb:511:in `each_key'
[2025-03-09 06:53:39.983] /Users/bill/.rvm/rubies/ruby-3.2.3/lib/ruby/3.2.0/set.rb:511:in `each'
[2025-03-09 06:53:39.983] /Users/bill/.rvm/rubies/ruby-3.2.3/lib/ruby/3.2.0/benchmark.rb:311:in `realtime'
[2025-03-09 06:53:39.983] app/controllers/api/api_controller.rb:131:in `bubble_activity'
[2025-03-09 06:53:39.983] config/initializers/warden_jwt_token_dispatcher_patch.rb:40:in `call'
[2025-03-09 06:53:39.983] lib/middleware/timeout_middleware.rb:16:in `call'
[2025-03-09 06:53:39.983] lib/middleware/okay_check_middleware.rb:23:in `call'
[2025-03-09 06:53:39.983] vendor/gems/rails_modifications/initializers/action_dispatch_no_missing_map_exception.rb:8:in `call'
[2025-03-09 06:53:39.983] lib/middleware/block_ip_middleware.rb:32:in `call'  

Anthropic 3.7 Deep Think: Gradually evaluates possibilities over a couple minutes, eventually pinpoints two lines, one of which was the actual culprit. ✅

ChatGPT 4o: Wrong, not helpful.

Anthropic 3.7 Standard: Gets the same general answer as Deep Think version, which is correct. ✅ Tried it a second time and got it very wrong though.

DeepSeek: Thorough, but wrong and not very helpful.

Winner: Anthropic by a good bit over ChatGPT 4o and DeepSeek

link Friday March 7: Interpret a variety of time zone strings

How to parse a time zone string in Ruby on Rails into the number of hours deviation from UTC? It should parse "US/Pacific" or "Pacific/Fiji" or any other time zone name a user may enter

Anthropic 3.7: Well-reasoned set of heuristics ✅

OpenAI o1 Pro: Disparate answer. Mildly helpful in that it offers multiple avenues, but misses the point of wanting a single method that combines all approaches.

DeepSeek DeepThink R1: Partially correct: doesn't have the fallback heuristics of Anthropic, but better captures DST differences by checking the current time in the zone vs. current time in UTC ✅

Winner: Anthropic & DeepSeek both considerably better than OpenAI. Anthropic better "spirit of the question" answer, DeepSeek more thorough answer.

link Thursday March 6: String diff method (deduce query is asking for LCS)

Query

Write a Ruby method that can take two strings bananas o'reilly really for reals and banana really(for reals) and produce the shortest possible difference in characters between the two:

1. "s o'reilly"

2. "()"

No other characters should be present in the diff, since all the rest of the words are common between the two strings. Ensure that the methods produced do not abbreviate variable names.

link Commentary

The examples that were given in the query weren't quite right - there was an extra space that should have been present in the first example. This led DeepSeek down a multi-minute rabbit hole where it kept trying good ideas, but finding they didn't match the expected output because the human hadn't paid close enough attention to give a completely accurate example.

It was interesting that Anthropic Deep Think also was confused it its output generated wasn't exactly right, but it deemed the answer sufficiently correct that it stopped generating in less than a minute, similar to o1.

link Non-deep thinks

First submitted the algo to Anthropic & OpenAI without deep thinking. Both gave very deficient answers.

link OpenAI deep think

After 90 seconds, produces this code, which isn't very close to what was asked:

DiffUtility.single_chunk_diff("bananas o'reilly really for reals", "banana really(for reals)")
[
    [0] "s o'reilly really for reals",
    [1] " really(for reals)"
] 

Weaksauce.

link Anthropic deep think

After 2 minutes, produces this code, which successfully produces:

DiffUtility.string_diff("bananas o'reilly really for reals", "banana really(for reals)")
[
    [0] "so'illy re ",
    [1] "()"
] 

Arguably slightly better than desired.

Date: March 2024

Winner: Anthropic over OpenAI by a lot

link Wednesday March 5: Give functional component a ref

How can the functional component CabViewer keep a ref for its functional component child, CabCommittersFrame?

Copilot: Suggests using forwardRef with useImperativeHandle in CabCommittersFrame, then includes the whole of CabViewer with a one line description of what changed in the file.

link Cursor

Suggests using React.forwardRef on CabCommittersFrame, with no change needed to CabViewer. Certainly preferable if it works (tbd).