We finally got a benchmark that actually matches reality. Thank you Browserbase for sponsoring! Check them out at: https://soydev.link/browserbase SOURCES: https://deepswe.datacurve.ai/ https://x.com/theo/status/2059352130289651925 Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me Check out my Twitch, Twitter, Discord more at https://t3.gg S/O @Ph4seon3 for the awesome edit 🙏
ADVERTISEMENT
15:25 YOU NOW NO LONGER HAVE AN EXCUSE: CREATE THIS TOOL!
Didn't know DeepSWE was made by a bunch of Waterloo kids! Feels great to see stuff rom there being a grad myself. Great video B-)
Love this video. Your videos are always very useful. These benchmarks are congruent with my experience. It’s kind of a bummer because I want to open weight models to be better. But my anecdotal experience is that they are not.
I recently moved from open weight models to frontier models and god, was it better. There's literally day and nights difference between these models. And the bench kinda confirms it.
its curious how they added the thinking effort in claude opus and gpt but not in deepseek or mimo or glm...
Insightful bench from deep swe. Absolutely Theo good idea to make my own bench, how can you really know if you don't bench it properly. here goes starting yet another project,.. 😑
What about composer ?
World would legit be fucked without Theo 🙇♂
Could you do a video about proper prompting?
Finally a benchmark that doesn't just test vibes. The gap between what benchmarks promise and what actually works in production has been driving me crazy — glad someone's calling it out. Great breakdown as always Theo 🔥
Honestly, the bad SWE prompt sounds like the exact kind of prompt I'd need as a non-AI.
That was a great video, I will mark down the conclusion in Obsidian. You should try obsidian too some day!
Finally, a benchmark that isn’t just vibes 😄
Finally a benchmark that actually tests. However based on what I saw DeepSWE is still an "existing codebase engineering" benchmark, not a "greenfield architecture" benchmark. While the former might matter more for large companies managing huge codebases, the latter is certainly important for hobbyists constantly developing new personal projects. I would love to see a benchmark covering that.
I would really love to see a composer 2.5 in this bench
Maybe something between pi and mini-swe-agent would be better. Like, just an edit tool, a read tool, and a bash tool. Just so it matches real harnesses slightly better while still not giving an advantage to one model in particular.
When we can combine the results from this bench and terminal bench 3 (which aims to be more realistic just like this, but focuses on problems that are very hard even for top models, but aren't necessarily long-horizon like DeepSWE), we will already have a more accurate feeling scoring metric than what Artificial Analysis currently uses.
Interesting, the results of this benchmark lines up well with my experience using them.
We need more benchmarks like these, sadly open-weight models are behind but I hope this helps them get ahead in performance per cost
Wish composer was in the test. It feels really good when I use it