0:00

32:31

AI code benchmarks lied to us

Tech

We finally got a benchmark that actually matches reality. Thank you Browserbase for sponsoring! Check them out at: https://soydev.link/browserbase SOURCES: https://deepswe.datacurve.ai/ https://x.com/theo/status/2059352130289651925 Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me Check out my Twitch, Twitter, Discord more at https://t3.gg S/O @Ph4seon3 for the awesome edit 🙏

Comments 100

madeleinedelahaye639 1 week ago

15:25 YOU NOW NO LONGER HAVE AN EXCUSE: CREATE THIS TOOL!

revadawn74 1 week ago

Didn't know DeepSWE was made by a bunch of Waterloo kids! Feels great to see stuff rom there being a grad myself. Great video B-)

carol_atkinson 1 week ago

Love this video. Your videos are always very useful. These benchmarks are congruent with my experience. It’s kind of a bummer because I want to open weight models to be better. But my anecdotal experience is that they are not.

rebeccareynolds895 1 week ago

I recently moved from open weight models to frontier models and god, was it better. There's literally day and nights difference between these models. And the bench kinda confirms it.

dalton_gentry 1 week, 1 day ago

its curious how they added the thinking effort in claude opus and gpt but not in deepseek or mimo or glm...

olivierdrift98 1 week, 1 day ago

Insightful bench from deep swe. Absolutely Theo good idea to make my own bench, how can you really know if you don't bench it properly. here goes starting yet another project,.. 😑

pedrolucas.abreu 1 week, 1 day ago

What about composer ?

shawn.henderson 1 week, 1 day ago

World would legit be fucked without Theo 🙇‍♂

gabinoirizarry390 1 week, 1 day ago

Could you do a video about proper prompting?

joshuachen282 1 week, 1 day ago

Finally a benchmark that doesn't just test vibes. The gap between what benchmarks promise and what actually works in production has been driving me crazy — glad someone's calling it out. Great breakdown as always Theo 🔥

carrie.chambers 1 week, 1 day ago

Honestly, the bad SWE prompt sounds like the exact kind of prompt I'd need as a non-AI.

angela.patterson 1 week, 1 day ago

That was a great video, I will mark down the conclusion in Obsidian. You should try obsidian too some day!

meganseraph65 1 week, 1 day ago

Finally, a benchmark that isn’t just vibes 😄

helena_novais 1 week, 1 day ago

Finally a benchmark that actually tests. However based on what I saw DeepSWE is still an "existing codebase engineering" benchmark, not a "greenfield architecture" benchmark. While the former might matter more for large companies managing huge codebases, the latter is certainly important for hobbyists constantly developing new personal projects. I would love to see a benchmark covering that.

christina.lewis 1 week, 1 day ago

I would really love to see a composer 2.5 in this bench

kerry_nicholson 1 week, 1 day ago

Maybe something between pi and mini-swe-agent would be better. Like, just an edit tool, a read tool, and a bash tool. Just so it matches real harnesses slightly better while still not giving an advantage to one model in particular.

normamcconnell229 1 week, 1 day ago

When we can combine the results from this bench and terminal bench 3 (which aims to be more realistic just like this, but focuses on problems that are very hard even for top models, but aren't necessarily long-horizon like DeepSWE), we will already have a more accurate feeling scoring metric than what Artificial Analysis currently uses.

rafaél_gastélum 1 week, 1 day ago

Interesting, the results of this benchmark lines up well with my experience using them.

monica.proctor 1 week, 1 day ago

We need more benchmarks like these, sadly open-weight models are behind but I hope this helps them get ahead in performance per cost

robert.maldonado 1 week, 1 day ago

Wish composer was in the test. It feels really good when I use it