<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Ornith on Vijay Kodam</title><link>https://vijay.eu/tags/ornith/</link><description>Recent content in Ornith on Vijay Kodam</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Tue, 30 Jun 2026 18:38:33 +0300</lastBuildDate><atom:link href="https://vijay.eu/tags/ornith/index.xml" rel="self" type="application/rss+xml"/><item><title>Ornith 35B writes its own training scaffold: I put the self-improving coding model to the test on a Mac Mini M4 Pro</title><link>https://vijay.eu/posts/ornith-self-scaffolding-llm/</link><pubDate>Tue, 30 Jun 2026 18:38:33 +0300</pubDate><guid>https://vijay.eu/posts/ornith-self-scaffolding-llm/</guid><description>&lt;p&gt;I spent the last few days running &lt;strong&gt;Ornith 1.0 (35B MoE)&lt;/strong&gt; locally with &lt;strong&gt;Ollama and Claude Code&lt;/strong&gt;. No cloud, no API keys. It claimed to surpass even Qwen3.5-397B on Terminal-Bench 2.1 (64.2 vs 53.5). Those are tall claims for a 35B model, and I intended to find out. Here are my first impressions, the good and the painful.&lt;/p&gt;
&lt;h2 id="first-impressions-fast-local-and-properly-agentic"&gt;First impressions: fast, local, and properly agentic&lt;/h2&gt;
&lt;p&gt;Setup was genuinely easy. Install Ollama, run &lt;code&gt;ollama pull ornith:35b&lt;/code&gt;, then &lt;code&gt;ollama launch claude&lt;/code&gt; to wire it straight into Claude Code. That was all it took.&lt;/p&gt;
&lt;p&gt;It runs fast because it&amp;rsquo;s a Mixture-of-Experts model. It&amp;rsquo;s 35B total, but only about 3B parameters activate per token, so that&amp;rsquo;s all the memory bandwidth has to move at each step. On the M4 Pro, token generation felt quick.&lt;/p&gt;
&lt;p&gt;The agentic loop worked well. With Claude Code it spun up sub-agents to explore internet sources, tested browser-based sites it had built using the Playwright CLI, took screenshots, and ran tests. It was the same workflow I&amp;rsquo;d run with a frontier model like Opus, except it ran entirely on my own machine.&lt;/p&gt;
&lt;h2 id="the-context-window-wall"&gt;The context-window wall&lt;/h2&gt;
&lt;p&gt;Then I hit a wall, and it turned out to be a config problem rather than the model itself. Coding agents need a big context window, and Ollama&amp;rsquo;s default of 32K filled up immediately during planning. The worst case was a &lt;code&gt;/goal&lt;/code&gt; run that went 7 hours with almost no useful progress. The input context had grown to roughly 65,400 tokens, which left only about 100 tokens for output, so it crawled along and hit the output cap every turn.&lt;/p&gt;
&lt;p&gt;The fixes that turned it around:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;OLLAMA_CONTEXT_LENGTH&lt;/code&gt; set to 65536, then 131072&lt;/li&gt;
&lt;li&gt;&lt;code&gt;OLLAMA_FLASH_ATTENTION=1&lt;/code&gt; with &lt;code&gt;OLLAMA_KV_CACHE_TYPE=q8_0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CLAUDE_CODE_MAX_OUTPUT_TOKENS&lt;/code&gt; set to 16384, keeping the output cap well below the context window&lt;/li&gt;
&lt;li&gt;&lt;code&gt;OLLAMA_KEEP_ALIVE=-1&lt;/code&gt; to keep the model resident in RAM&lt;/li&gt;
&lt;li&gt;Models stored on an external SSD&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The RAM math surprised me in a good way. Going from 65K to 131K context at q8 KV cache cost only about 2.8 GB extra, or roughly 5.6 GB total for the KV cache. With about 24 GB for the model, the whole thing sat around 29 GB, comfortably under half of my 64 GB.&lt;/p&gt;
&lt;h2 id="capability-real-but-bounded"&gt;Capability: real, but bounded&lt;/h2&gt;
&lt;p&gt;Capability is real but has limits. A basic tic-tac-toe game came out in one shot. But when I asked it to create a voxel-based 3D game like Minecraft with a vague prompt and handed that to &lt;code&gt;/goal&lt;/code&gt;, the 35B model faltered. It generated a game that was unusable. That&amp;rsquo;s expected of a model this size. Small, iterative tasks are where it does best.&lt;/p&gt;
&lt;p&gt;A note on why I went with Ollama for now. LM Studio didn&amp;rsquo;t have an official Ornith model, but Ollama did, and &lt;code&gt;ollama launch claude&lt;/code&gt; made the Claude Code wiring trivial. I may try llama.cpp next.&lt;/p&gt;
&lt;h2 id="why-ornith-over-other-35b-models"&gt;Why Ornith over other 35B models&lt;/h2&gt;
&lt;p&gt;Why Ornith and not another 35B model? The thing that sold me is that DeepReinforce benchmarked it through the Claude Code harness itself, the exact setup I&amp;rsquo;m running. On Terminal-Bench 2.1 measured through Claude Code, the 35B scores 62.8 against 38.9 for Qwen3.5-35B. On the standard Terminus-2 harness it&amp;rsquo;s 64.2 against Qwen3.5-35B&amp;rsquo;s 41.4 and Gemma4-31B&amp;rsquo;s 42.1, and it even edges past Qwen3.5 at 397B (53.5) while being a fraction of the size. The lead holds across the rest of the table too: SWE-Bench Verified 75.6 vs 70.0, SWE-Bench Pro 50.4 vs 44.6, SWE-Bench Multilingual 69.3 vs 60.3. At this size your real alternatives are Qwen3.5/3.6-35B and Gemma4-31B, and it beats all of them (GLM-5.2 is a 744B model, not a same-size rival).&lt;/p&gt;
&lt;h2 id="whats-actually-new-it-writes-its-own-scaffold"&gt;What&amp;rsquo;s actually new: it writes its own scaffold&lt;/h2&gt;
&lt;p&gt;What&amp;rsquo;s actually new is how it was trained. It&amp;rsquo;s built on top of pretrained Gemma 4 and Qwen 3.5. Most models are trained inside a fixed, human-designed harness. Ornith instead learns to write its own scaffold during reinforcement learning. Each RL step has two stages: it first proposes a refined scaffold for the task, then uses that scaffold to generate the solution, and reward flows back to both. So it&amp;rsquo;s optimized not just to write better answers but to author the orchestration that produces them, and good per-task strategies emerge on their own without anyone hand-engineering them. They guard against the obvious reward-hacking with a fixed trust boundary the model can&amp;rsquo;t touch, a deterministic monitor that zeros out any run that reads withheld files or edits the tests, and a frozen LLM judge that can veto.&lt;/p&gt;
&lt;h2 id="two-honest-caveats"&gt;Two honest caveats&lt;/h2&gt;
&lt;p&gt;Two honest caveats. All of these numbers are self-reported by DeepReinforce, with independent verification still pending. And the benchmarks were run at full precision with very large context windows, so the Q4 build I&amp;rsquo;m running locally with a 65K to 131K window will land somewhat below the headline figures.&lt;/p&gt;
&lt;h2 id="bottom-line"&gt;Bottom line&lt;/h2&gt;
&lt;p&gt;To sum it up, if you can run it, try the 35B. It&amp;rsquo;s fast, surprisingly capable, and fully local. And if you&amp;rsquo;re already on Qwen3.5 35B, switching is trivial, since Ornith is built right on top of Qwen3.5 (and Gemma 4).&lt;/p&gt;
&lt;p&gt;One quick to-do for me: measure the actual tokens per second on Ollama and report back.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="run-it-yourself"&gt;Run it yourself&lt;/h2&gt;
&lt;p&gt;Want to run it yourself? Here is the full setup on a Mac (Apple Silicon, ideally 64 GB so the model plus a large context fit comfortably).&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Install Ollama. Download it from ollama.com, or with Homebrew run &lt;code&gt;brew install ollama&lt;/code&gt;. Start it (open the app, or run &lt;code&gt;ollama serve&lt;/code&gt;).&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;brew install ollama &lt;span style="color:#75715e"&gt;# or download from ollama.com&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama serve &lt;span style="color:#75715e"&gt;# or just open the Ollama app&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="2"&gt;
&lt;li&gt;Set the configuration before pulling. These are the values that made it stable for me. The menu-bar app does not pick up shell &lt;code&gt;export&lt;/code&gt;s, so either set them in the app&amp;rsquo;s Settings or use &lt;code&gt;launchctl setenv&lt;/code&gt;, then restart Ollama:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export OLLAMA_CONTEXT_LENGTH&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;131072&lt;/span&gt; &lt;span style="color:#75715e"&gt;# large window for agentic coding; use 65536 for faster prompt processing&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export OLLAMA_FLASH_ATTENTION&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;1&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export OLLAMA_KV_CACHE_TYPE&lt;span style="color:#f92672"&gt;=&lt;/span&gt;q8_0 &lt;span style="color:#75715e"&gt;# near-lossless, halves KV cache memory&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export OLLAMA_KEEP_ALIVE&lt;span style="color:#f92672"&gt;=&lt;/span&gt;-1 &lt;span style="color:#75715e"&gt;# keep the model resident in RAM&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;export CLAUDE_CODE_MAX_OUTPUT_TOKENS&lt;span style="color:#f92672"&gt;=&lt;/span&gt;&lt;span style="color:#ae81ff"&gt;16384&lt;/span&gt; &lt;span style="color:#75715e"&gt;# keep the output cap well below the context window&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="3"&gt;
&lt;li&gt;Pull the model (about 21 GB at Q4):&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama pull ornith:35b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="4"&gt;
&lt;li&gt;Smoke-test it directly, and check the speed:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama run ornith:35b --verbose &lt;span style="color:#e6db74"&gt;&amp;#34;Write a Python function to debounce calls.&amp;#34;&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;eval rate&lt;/code&gt; line in the output is your tokens per second.&lt;/p&gt;
&lt;ol start="5"&gt;
&lt;li&gt;Launch Claude Code wired to the local model. No keys or base URL to set by hand:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"&gt;&lt;code class="language-bash" data-lang="bash"&gt;&lt;span style="display:flex;"&gt;&lt;span&gt;ollama launch claude --model ornith:35b
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start="6"&gt;
&lt;li&gt;Test a real task. Open a small repo and ask it to make a scoped change that requires reading and editing a file, then running a command. Confirm it reads, edits, runs, and returns a coherent result. Watch Activity Monitor memory pressure stay green.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;A couple of things worth repeating: keep &lt;code&gt;CLAUDE_CODE_MAX_OUTPUT_TOKENS&lt;/code&gt; well below &lt;code&gt;OLLAMA_CONTEXT_LENGTH&lt;/code&gt; or you will starve generation, and do not select a giant-context model profile (like a 1M-context Opus profile) in Claude Code against a 64K to 131K local model, or it will never auto-compact and the context will overflow.&lt;/p&gt;</description></item></channel></rss>