Ornith-1.0 Rewrote Its Own Training Wheels. And That’s the Point

TL;DR

– Ornith-1.0-397B hits 77.5 on Terminal-Bench 2.1. Open-weight state-of-the-art.
– The 9B variant runs on a single consumer GPU (~19GB bf16 on an 80GB card) and beats larger models at coding despite being smaller.
– MIT license. No usage restrictions. No API dependency. Download it, fine-tune it, deploy it under your own roof.
– DeepReinforce’s self-scaffolding trick lets the model write its own evaluation framework during RL training. The orchestration layer evolves instead of getting frozen at launch.

—

Most writeups about Ornith-1.0 are chasing the wrong number. The 397B parameter count. The “9B beats larger models” hook. Yeah, those are real. They’re nice.

But they’re not why I’m writing this.

Here’s the thing that got buried: the model builds its own scaffolding.

During training. Without a human hand-coding retry logic or memory management or tool call sequencing. That’s a different animal entirely.

Self-Scaffolding: The Part Everyone Missed

DeepReinforce calls it self-scaffolding.

The RL loop runs two stages every step.

First, the model reads a task and its current scaffold, then proposes a refined scaffold for that task. Second, conditioned on that scaffold and the task description, it generates a solution rollout. GRPO trains both stages together. Reward signal flows back to the scaffold proposal and the solution output simultaneously.

Over training, higher-reward scaffolds get mutated and selected automatically. By ship time, it’s seen thousands of tasks and written thousands of scaffold variants. Keeps the ones that work.

Dumps the rest.

Here’s what that means practically. Most coding agents pair a model with a fixed evaluation framework someone wrote once and never touched again. You tune it once, ship it, and that’s your ceiling. Ornith flips that. The orchestration layer isn’t static anymore.

It’s a compounding advantage.

The model that writes its own evaluation framework gets better at writing evaluation frameworks.

That is not the same moat as more parameters. That’s something else entirely.

The Benchmarks Are Real — But Look at the Small Model First

Okay, the 397B flagship. 77.5 on Terminal-Bench 2.1. DeepReinforce reports both as state-of-the-art among open models at comparable scale.

Reviewers put it near closed-source territory, approaching levels on Terminal Bench.

It doesn’t quite get there.

And larger models outperform it on some tasks. Fair.

But honestly, the 397B is a flex, not a product. You need serious iron to run it. For most people reading this, the interesting number is the 9B.

Beats larger models on most benchmarks. Ships in GGUF format. Runs on a single consumer GPU. About 19GB in bf16 on an 80GB card. Zero per-token cost once you’ve bought the hardware.

For a solo dev or small agency running local models, this is the comparison that matters. Not “how close to the largest models” — “can I run something competitive on my desk for free.” Yes. At least for the 9B and 35B variants.

Side note: the 35B MoE variant activates roughly 3B parameters per token.

So per-step compute is way lower than the raw number suggests. If you’ve got a decent workstation, it’s within reach.

MIT License Changes Everything for Small Shops

Ornith-1.0 checkpoints live on Hugging Face under the MIT license.

No usage restrictions. No commercial limitations. No API dependency. Fine-tune it. Serve it. Sell products built on top of it. Nobody sends you a compliance questionnaire or a bill at the end of the month.

Closed models improve and change pricing on someone else’s schedule. Open-weight models you control stay where they are until you decide to update them.

Ornith-1.0 at 9B or 35B gives you a coding agent you can run under your own roof, under a license that doesn’t require you to disclose your client work or your internal tooling.

That matters for agencies. That matters for anyone doing proprietary code.

What You Should Actually Do With This

Look, self-scaffolding is still new territory. The community hasn’t had time to figure out where it breaks down at scale. GRPO-based joint optimization looks solid on paper and on benchmarks. But production failure modes. The weird edge cases, the things that only show up after a million tasks.

Those aren’t documented yet.

That’s the honest caveat.

File it.

That said, the direction is clear. If models that write their own orchestration logic become standard, the differentiator stops being the base model and becomes the quality of the RL-generated scaffolds. That’s learnable. That’s tunable. That’s infrastructure, not magic.

So here’s what I’d do. Go grab the 9B GGUF from Hugging Face tonight. Spin it up against your existing codebase. Throw it the tasks you actually have. The boring ones, not the cherry-picked benchmarks.

Compare it to whatever you’re paying for right now. The gap between open and closed models on coding tasks is narrowing fast. Ornith-1.0 is the clearest evidence yet that it won’t stay open forever.

Your move.