Why we switched SlopCop to Flash 3.5

Why we switched SlopCop to Flash 3.5

Flash 3.5's release showcases the latest iteration of our favourite coding model, placed head-to-head against larger models such as GPT-5.4 and Gemini 3.1 Pro and holding its own. The headline-grabbing downside, of course, is that 3.5 is far more expensive–3x more expensive than Flash 3 on a per-token basis.

We thought, "let's try it anyway." We added it as another model under a "specialist" role for SlopCop to see how it does...

TLDR: Flash 3.5 is cheaper and faster per completed SlopCop scan than Flash 3, while still improving quality. It finished reliably, used the tools efficiently, and remains close enough on report quality to larger/slower/more-expensive models to be the new SlopCop workhorse.

SlopCop for Dummies

SlopCop is, in a nutshell, an "LLM smell" and maintainability risk report web application with the end goal of remediation assistance driven by static analysis tools and AI agents. The goal is largely code review.

We split the workload by separating "find the slop" and "write the report" and giving those jobs to two different model types. The former is split into parallel well-defined tasks for smaller models, called the specialist model, while the latter is a larger report-writing and review task given to a larger model, called the synthesis model.

Prior to our initial beta launch, we evaluated GPT-5.4-mini and Flash 3 Preview as our specialists. We measure:

  • Stability: Do we produce a report at the end of the scan? How often do we retry?
  • Quality: Does the report pass a basic sniff test? Does the report use evidence to make sound claims and related/effective recommendations?
  • Cost: What is the total cost of a scan?

These questions are evaluated across a mix of open-source repositories, models, and levels of reasoning.

Around launch, we found that Flash 3 was the most stable and cheapest, and we ended up selecting that for our launch specialists. This was not without downsides: Flash 3 has a tendency to overstate claims and have less-convincing reports than GPT-5.4-mini.

Since then, we have tuned our prompts, retry logic, and improved harness stability, so the release of Flash 3.5 was a good point at which to re-evaluate our model configuration.

Evaluation and Results

This benchmark round produced 360 scans total: 8 repositories, 5 specialist models, 3 specialist reasoning levels, and 3 synthesis reasoning levels. Each specialist model had 72 scans. The synthesis model was fixed to Gemini Pro 3.1 Preview, and each specialist agent was limited to 5 tool steps.

The repositories were deliberately mixed. Some are newer AI-era projects, such as Codex and Bun. Others are older or pre-AI codebases, such as TitanDB. That mix is to test that SlopCop should does not treat ordinary legacy complexity as automatic evidence of AI slop.

A quick note on the tables: a scan is one completed SlopCop run. A group is the fair-comparison bucket: same repository, same reasoning setting, same synthesis setting, and every specialist model had to finish. I use groups for cost, speed, and quality comparisons so one easy or unlucky repository does not dominate the story.

Let's first start with stability

There are many reasons that a scan may fail, but typically this happens at the specialist level as it is the most demanding in terms of asking for specific shapes of tool inputs and LLM outputs.

The first question is boring but non-negotiable: does the scan actually finish? The answer is a stability pattern that is less about model family and more about how much room the smaller models had to think.

Gemini 3 Flash low finished only 16 of 24 scans, and GPT-5.4-mini low finished 17 of 24. Flash 3.5 was the exception: even if it behaves more like a smaller, faster model operationally, it held up well, with medium going 24 for 24 and low/high both finishing 23 of 24. Larger models were mostly stable regardless, so the choice there became more about cost, speed, and report quality than basic completion.

Since SlopCop comprises two model roles, where we're largely focused on the specialist agent in this experiment. So we should also consider if the scan failed during report generation, i.e., synthesis, and who is at fault.

These failures were not one thing. Some were genuine garbage-in, garbage-out: a specialist got partial static-analysis context, retried, and still could not produce a usable lane. Others were stricter quality gates doing their job a little too bluntly. The system had useful structured material, but a later validation step decided it was not complete or well-grounded enough to publish as a normal scored report. The hard part is not just making models return JSON; it is deciding what to do with partial, cautious, or awkward evidence without pretending it is more certain than it is.

This is obviously why we run regular benchmarks, and shows us where we still need to tweak our prompts and harness.

Cost per completed scan

The cost/speed chart was the most surprising one. Gemini 3.5 Flash sits on the efficient frontier: low was the cheapest and fastest setting in the matched runs, while medium cost only a little more and was much more stable.

(Same chart as above)

Gemini 3 Flash was not the obvious budget winner; at low reasoning it was both slower and more expensive than its medium/high settings, which is exactly the kind of “cheap setting becomes expensive through retries or wandering” behavior this benchmark is meant to catch.

Gemini 3.1 Pro cost more than Flash, but was still reasonably quick. GPT-5.4 produced strong reports, but paid for it in both cost and runtime. GPT-5.4-mini was uneven, but not useless: when it completed, some runs were strong for the cost, especially outside the low-reasoning setting.

Why and how is Flash 3.5 cheaper?

One possible explanation was that Flash 3.5 was simply doing less work: fewer tools, less analysis, worse reports. The tool-use data does not quite support that...

Flash 3.5 still used a healthy number of specialist tool calls, especially at medium and high reasoning. It was not cheap because it ignored the repo.

The more interesting pattern was tool discipline. Gemini 3.1 Pro used the fewest tools, especially at low reasoning. GPT-5.4 used a moderate, steady amount. GPT-5.4-mini was uneven, but not obviously wasteful on every completed scan. Flash 3 low was the real warning sign: the “cheap” setting did not translate into cheap scans, because low reasoning made it slower and less stable.

So Flash 3.5 is more expensive per token than Flash 3, but cheaper per SlopCop scan in this run. The actual bill depended on runtime, retries, and how directly the model used the tools.

What about the quality of each report?

Cost and runtime are easy to measure. Report quality is harder, so I use a qualitative judge pass called the “sniff test.” It is run with GPT-5.4 high and asks for three 1-5 scores:

  • reportQualityScore: the main quality signal. A 1 means the report is poor, misleading, overconfident, or weakly grounded. A 5 means it is specific, evidence-backed, and well-scoped.
  • repositoryRiskScore: the judge’s read of repository risk. A 1 means low apparent risk, and a 5 means severe risk. It can be null when the report evidence is too thin or contradictory to judge the repository itself.
  • confidenceScore: how much to trust the judge assessment. A 1 means the evidence is weak or contradictory. A 5 means the assessment is strongly supported.

The judge also comments on whether the report separated maintainability risk from AI-slop confidence, handled plausible non-AI explanations, kept sampled evidence scoped, avoided broad “clean” claims from non-findings, and tied recommendations back to named evidence.

The sniff test complicated the story...

Flash 3.5 looked excellent on cost, runtime, and stability. The quality signal was more mixed. The recurring critique was that reports sometimes stretched the evidence: sampled findings became broader repo claims, maintainability debt sounded too much like AI-slop evidence, or “no finding in this sampled lane” read like “this area is clean.”

Report quality averaged about 2.9/5, while confidence in the sniff assessment averaged about 3.7/5. The reports were usually useful, and often specific, but the benchmark kept finding places where the wording was stronger than the evidence.

Flash 3.5 ended up being the practical surprise: cheap, fast, stable, and good enough to stay in the running. Flash 3 looked cheaper on paper, but low reasoning was unstable and the completed scans were not compelling enough to justify the tradeoff.

Average cost is dollar token value.

GPT-5.4 consistently produced some of the strongest reports, but cost more and took longer. GPT-5.4-mini had some strong completed scans for the money, but low reasoning was too unstable.

Gemini 3.1 Pro was less compelling than expected: high reasoning helped, but low and medium landed near the bottom on report quality.

Conclusion

We took the Flash 3.5 release as an opportunity to re-evaluate our specialist agent models, benchmarking Flash 3.5 against the incumbent Flash 3, its peer GPT 5.4 mini, and larger models Gemini 3.1 Pro and GPT 5.4. (Anthropic models and GPT 5.5 were eliminated from the full eval due to cost.)

GPT 5.4 full and mini both produced some interesting results, but Flash 3.5 medium went 24 for 24, stayed cheap, and was fast enough that the product still feels usable. Flash 3.5 low was tempting on cost and speed, but medium gives us a cleaner stability story for very little extra money. This is what we have landed on for production SlopCop specialist reports.

We will continue to optimize SlopCop report quality and speed over the coming weeks!