By Alex Swearingen in Video — 06 Aug 2025

GPT-OSS Underperforms in Independent Testing

We put OpenAI's newly released GPT-OSS model through our own independent benchmark, the Brokk Power Ranking, to see how it really performs on coding tasks.

While OpenAI highlighted strong performance in their launch blog, especially on a single coding benchmark, our results tell a different story. In this video, Jonathan from Brokk walks through how GPT-OSS stacks up against other popular open-weight and proprietary models, including GPT-4 Mini, Flash 2.0, and several others.

Highlights:

A look at the benchmarks OpenAI showcased (and what they didn’t)
How GPT-OSS performed in real-world coding scenarios
Comparisons with leading open models
Why these results matter for developers and researchers

Spoiler: GPT-OSS landed near the bottom of the pack in our tests, despite its impressive accessibility and hardware efficiency.

GPT-OSS Underperforms in Independent Testing

A first look at GPT-OSS-120B’s coding ability

Brokk Vs. Cursor: Massive Refactors in Minutes

A first look at GPT-OSS-120B’s coding ability

Brokk Vs. Cursor: Massive Refactors in Minutes

You might also like...