Ornith 35B writes its own training scaffold: I put the self-improving coding model to the test on a Mac Mini M4 Pro

June 30, 2026 6-minute read

Ornith • LocalLLM • Ollama • ClaudeCode • AppleSilicon • LLM

I spent the last few days running Ornith 1.0 (35B MoE) locally with Ollama and Claude Code. No cloud, no API keys. It claimed to surpass even Qwen3.5-397B on Terminal-Bench 2.1 (64.2 vs 53.5). Those are tall claims for a 35B model, and I intended to find out. Here are my first impressions, the good and the painful.

First impressions: fast, local, and properly agentic

Setup was genuinely easy. Install Ollama, run ollama pull ornith:35b, then ollama launch claude to wire it straight into Claude Code. That was all it took.

It runs fast because it’s a Mixture-of-Experts model. It’s 35B total, but only about 3B parameters activate per token, so that’s all the memory bandwidth has to move at each step. On the M4 Pro, token generation felt quick.

The agentic loop worked well. With Claude Code it spun up sub-agents to explore internet sources, tested browser-based sites it had built using the Playwright CLI, took screenshots, and ran tests. It was the same workflow I’d run with a frontier model like Opus, except it ran entirely on my own machine.

The context-window wall

Then I hit a wall, and it turned out to be a config problem rather than the model itself. Coding agents need a big context window, and Ollama’s default of 32K filled up immediately during planning. The worst case was a /goal run that went 7 hours with almost no useful progress. The input context had grown to roughly 65,400 tokens, which left only about 100 tokens for output, so it crawled along and hit the output cap every turn.

The fixes that turned it around:

OLLAMA_CONTEXT_LENGTH set to 65536, then 131072
OLLAMA_FLASH_ATTENTION=1 with OLLAMA_KV_CACHE_TYPE=q8_0
CLAUDE_CODE_MAX_OUTPUT_TOKENS set to 16384, keeping the output cap well below the context window
OLLAMA_KEEP_ALIVE=-1 to keep the model resident in RAM
Models stored on an external SSD

The RAM math surprised me in a good way. Going from 65K to 131K context at q8 KV cache cost only about 2.8 GB extra, or roughly 5.6 GB total for the KV cache. With about 24 GB for the model, the whole thing sat around 29 GB, comfortably under half of my 64 GB.

Capability: real, but bounded

Capability is real but has limits. A basic tic-tac-toe game came out in one shot. But when I asked it to create a voxel-based 3D game like Minecraft with a vague prompt and handed that to /goal, the 35B model faltered. It generated a game that was unusable. That’s expected of a model this size. Small, iterative tasks are where it does best.

A note on why I went with Ollama for now. LM Studio didn’t have an official Ornith model, but Ollama did, and ollama launch claude made the Claude Code wiring trivial. I may try llama.cpp next.

Why Ornith over other 35B models

Why Ornith and not another 35B model? The thing that sold me is that DeepReinforce benchmarked it through the Claude Code harness itself, the exact setup I’m running. On Terminal-Bench 2.1 measured through Claude Code, the 35B scores 62.8 against 38.9 for Qwen3.5-35B. On the standard Terminus-2 harness it’s 64.2 against Qwen3.5-35B’s 41.4 and Gemma4-31B’s 42.1, and it even edges past Qwen3.5 at 397B (53.5) while being a fraction of the size. The lead holds across the rest of the table too: SWE-Bench Verified 75.6 vs 70.0, SWE-Bench Pro 50.4 vs 44.6, SWE-Bench Multilingual 69.3 vs 60.3. At this size your real alternatives are Qwen3.5/3.6-35B and Gemma4-31B, and it beats all of them (GLM-5.2 is a 744B model, not a same-size rival).

What’s actually new: it writes its own scaffold

What’s actually new is how it was trained. It’s built on top of pretrained Gemma 4 and Qwen 3.5. Most models are trained inside a fixed, human-designed harness. Ornith instead learns to write its own scaffold during reinforcement learning. Each RL step has two stages: it first proposes a refined scaffold for the task, then uses that scaffold to generate the solution, and reward flows back to both. So it’s optimized not just to write better answers but to author the orchestration that produces them, and good per-task strategies emerge on their own without anyone hand-engineering them. They guard against the obvious reward-hacking with a fixed trust boundary the model can’t touch, a deterministic monitor that zeros out any run that reads withheld files or edits the tests, and a frozen LLM judge that can veto.

Two honest caveats

Two honest caveats. All of these numbers are self-reported by DeepReinforce, with independent verification still pending. And the benchmarks were run at full precision with very large context windows, so the Q4 build I’m running locally with a 65K to 131K window will land somewhat below the headline figures.

Bottom line

To sum it up, if you can run it, try the 35B. It’s fast, surprisingly capable, and fully local. And if you’re already on Qwen3.5 35B, switching is trivial, since Ornith is built right on top of Qwen3.5 (and Gemma 4).

One quick to-do for me: measure the actual tokens per second on Ollama and report back.

Run it yourself

Want to run it yourself? Here is the full setup on a Mac (Apple Silicon, ideally 64 GB so the model plus a large context fit comfortably).

Install Ollama. Download it from ollama.com, or with Homebrew run brew install ollama. Start it (open the app, or run ollama serve).

brew install ollama   # or download from ollama.com
ollama serve          # or just open the Ollama app

Set the configuration before pulling. These are the values that made it stable for me. The menu-bar app does not pick up shell exports, so either set them in the app’s Settings or use launchctl setenv, then restart Ollama:

export OLLAMA_CONTEXT_LENGTH=131072      # large window for agentic coding; use 65536 for faster prompt processing
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0         # near-lossless, halves KV cache memory
export OLLAMA_KEEP_ALIVE=-1              # keep the model resident in RAM
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=16384           # keep the output cap well below the context window

Pull the model (about 21 GB at Q4):

ollama pull ornith:35b

Smoke-test it directly, and check the speed:

ollama run ornith:35b --verbose "Write a Python function to debounce calls."

The eval rate line in the output is your tokens per second.

Launch Claude Code wired to the local model. No keys or base URL to set by hand:

ollama launch claude --model ornith:35b

Test a real task. Open a small repo and ask it to make a scoped change that requires reading and editing a file, then running a command. Confirm it reads, edits, runs, and returns a coherent result. Watch Activity Monitor memory pressure stay green.

A couple of things worth repeating: keep CLAUDE_CODE_MAX_OUTPUT_TOKENS well below OLLAMA_CONTEXT_LENGTH or you will starve generation, and do not select a giant-context model profile (like a 1M-context Opus profile) in Claude Code against a 64K to 131K local model, or it will never auto-compact and the context will overflow.