I Ran an 80B Coding Model Locally with Claude Code. It Took 1 Hour Instead of 9 Minutes. Here's What Was Wrong.

The Dream Setup

I finally got Claude Code running with my own local LLM. Everything local, everything private, running on my Mac Mini M4 Pro. No API keys, no cloud costs, no data leaving my network. Just me, my hardware, and a 79.7 billion parameter coding model.

But something was very wrong. A simple task (scaffolding a Django REST API with JWT authentication) took 1 hour.

At 23 tokens per second, the ~13,000 tokens of output should have taken about 9 minutes. Where did the other 51 minutes go?

This post walks through how I diagnosed the problem, what I learned about how LLM inference actually works, and the specific fixes that brought the time down significantly. If you’re running Claude Code with Ollama (or any local LLM), this will save you a lot of frustration.

My Setup

  • Hardware: Mac Mini M4 Pro, 64GB unified memory
  • Model: qwen3-coder-next:q4_K_M, a 79.7B parameter coding model, quantized to Q4_K_M (~48 GB on disk, ~54 GB in GPU memory)
  • Inference server: Ollama, running on the Mac Mini
  • Client: Claude Code v2.1.42, connecting to Ollama from another Mac on the local network
  • Launch command:
    ANTHROPIC_AUTH_TOKEN=ollama \
    ANTHROPIC_BASE_URL=http://192.168.10.1:11434 \
    ANTHROPIC_API_KEY="" \
    claude --model qwen3-coder-next:q4_K_M
    

Raw benchmarks looked promising. When I tested the model directly via Ollama’s API:

  • Token generation: ~23 tokens/second
  • Prompt evaluation: ~20 tokens/second

Those numbers aren’t blazing fast, but they’re perfectly usable. So why was the full Claude Code session so painfully slow?

The Problem

The numbers told a clear story:

MetricValue
Total wall time~66 minutes
Output tokens~13,000
Generation speed23 tok/s
Expected generation time~9 minutes
Unexplained overhead~57 minutes

Nearly 90% of the time was spent doing something other than generating tokens. But what?

How I Found the Bottleneck

Claude Code writes debug logs to ~/.claude/debug/. Each session gets its own log file. I opened up the log for my slow session (f519f739-d0f9-41b1-ab4c-a7e01a0e447e.txt, 395 KB of trace data) and started looking at timestamps.

The key was comparing two events for each API call:

  1. API request sent: when Claude Code sends a request to Ollama
  2. Stream started: when the first token comes back

The gap between these two timestamps is the prompt evaluation time, the model “reading” all the input before it can start “writing” a response.

Here’s what I found:

RequestWait TimeInput Tokens
First call (cold start)77 seconds121
Call at 17K tokens15 seconds17,173
Mid-session66 seconds~25,000
Late session132 seconds~30,000
Near the end194 seconds~34,566

The wait times were growing because the input context was growing, and it was growing far beyond what the model could actually handle. More on that in a moment.

With 57 API round-trips in a single session, each one incurring 15–194 seconds of prompt evaluation overhead, the math becomes obvious: most of the hour was spent waiting for the model to re-read the conversation history before generating each response.

optimizingllms

Credits: Generated using NotebookLM

Understanding What Happens During Inference

Before getting into the fixes, it helps to understand what happens when a local LLM processes your request. I’ll use analogies here. For the full technical breakdown with formulas and memory calculations, see my companion post: LLM Inference Internals: KV Cache, Flash Attention, and Optimizing for Apple Silicon.

Two Phases: Reading vs Writing

LLM inference has two distinct phases:

  1. Prompt evaluation (prefill): The model reads the entire input (system prompt, conversation history, tool definitions, your latest message). This is like reading an entire book before you can answer a question about it. It processes all tokens in parallel, but it still takes time proportional to how much there is to read.

  2. Token generation (decode): The model writes its response, one token at a time. This is the “thinking and typing” phase. Each new token depends on everything that came before, so it can only produce one at a time.

The generation speed (23 tok/s) is what you see as the response streams in. The prompt evaluation (20 tok/s) happens silently before any text appears. It’s the “thinking pause” before the model starts responding.

The KV Cache: Bookmarks for the Model

When the model reads your input during prompt evaluation, it computes something called the KV cache (Key-Value cache). Think of these as bookmarks. They capture what the model “understood” about each part of the text, so it doesn’t have to re-read everything for every single token it generates.

Without the KV cache, generating a 500-token response would require re-reading the entire input 500 times. With it, the model reads once and then references its bookmarks.

The catch: the KV cache takes memory, and it grows with the length of the conversation. A 16K token context needs significantly less KV cache memory than a 32K token context.

Context Window: The Model’s Reading Capacity

The context window is how much total text the model can consider at once (input and output combined). It’s like the size of the desk the model is working at. Bigger desk means more documents can be open at once, but it also means more memory used.

My model’s native context length is 262K tokens, but Ollama was configured with a 16K context window (num_ctx: 16384). That’s the actual working size.

Flash Attention: Speed-Reading

Flash Attention is an optimization for how the model processes attention (the mechanism that relates different parts of the text to each other). Standard attention requires memory proportional to the square of the sequence length. Flash Attention brings that down to linear by processing the attention in small tiles rather than all at once.

Think of it as the difference between spreading out every page of a book on the floor to compare them all simultaneously (standard attention) versus reading through the book chapter by chapter, keeping only the current chapter in front of you (Flash Attention).

Autocompaction: Summarizing Old Messages

Claude Code has a built-in mechanism called autocompaction. When the conversation gets too long for the context window, it summarizes older messages to free up space. This keeps the conversation going without hitting the context limit.

The trigger point is based on what Claude Code thinks the context window size is. This turns out to be very important, as we’ll see.

Prompt Caching: Reusing the Bookmarks

When Ollama receives a new request where the beginning of the prompt is identical to the previous request (which is typical in a conversation, since the history stays the same and you just add a new message at the end), it can reuse the KV cache from the previous request. Only the new tokens at the end need to be evaluated.

This is a big optimization, but only works if the model stays loaded in memory and the context hasn’t been evicted.

The Five Root Causes

Armed with an understanding of inference, I could now identify exactly what was going wrong.

1. Prompt Evaluation Dominated Every Request

Each of the 57 API round-trips paid a prompt evaluation cost. Even with KV cache reuse, the model had to evaluate the new portion of the prompt each time. As the conversation grew, these costs escalated from 6 seconds to over 3 minutes per request.

Total prompt evaluation time across 57 calls: ~50 minutes out of the 66-minute session.

2. Context Window Mismatch

This was the biggest problem. Claude Code was configured to think the model had a 180,000 token context window (its default assumption for Claude models). The autocompaction threshold was set at 167,000 tokens:

autocompact: tokens=34566 threshold=167000 effectiveWindow=180000

But Ollama’s actual context window was only 16,384 tokens. This meant:

  • Claude Code never triggered autocompaction (it thought there was tons of room)
  • The conversation context ballooned to 34,566 tokens, more than 2x what Ollama could actually fit
  • Ollama was silently truncating or struggling with context that exceeded its window
  • Each prompt evaluation was processing far more tokens than necessary

3. Fifty-Seven API Round-Trips

Claude Code is an agentic system. It doesn’t just answer once. It plans, calls tools (file reads, writes, bash commands), observes results, and iterates. For my Django project, this meant 57 separate API calls in one session. Each one triggered a new prompt evaluation cycle.

This is normal behavior for Claude Code, but it means prompt evaluation speed is the dominant factor in session time, not token generation speed.

4. Failed Haiku Calls

After nearly every API call, Claude Code tried to count tokens by calling claude-haiku-4-5-20251001, a cheap, fast model used for token counting in the normal Claude API setup. But this model doesn’t exist in Ollama:

countTokensWithFallback: haiku fallback failed: 404
{"type":"error","error":{"type":"not_found_error",
"message":"model 'claude-haiku-4-5-20251001' not found"}}

Each failed call involved a streaming attempt (404), then a non-streaming fallback (also 404). That’s two wasted HTTP round-trips per token counting attempt, happening dozens of times per session.

5. Streaming 404 Fallback Pattern

For each haiku call, the sequence was:

  1. Try streaming endpoint → 404
  2. Fall back to non-streaming endpoint → also 404
  3. Give up and continue without token count

This wasn’t a big deal on its own, but multiplied across 57+ attempts, it added up to unnecessary network overhead and wasted time.

The Fixes

Ollama Server Optimizations

If you’re running the Ollama Mac app, set these environment variables using launchctl setenv (and create a LaunchAgent plist to make them persist across reboots):

launchctl setenv OLLAMA_FLASH_ATTENTION 1
launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0
launchctl setenv OLLAMA_KEEP_ALIVE 60m
launchctl setenv OLLAMA_NUM_BATCH 2048

If you’re running ollama serve from the command line instead, set them as environment variables:

OLLAMA_FLASH_ATTENTION=1 \
OLLAMA_KV_CACHE_TYPE=q8_0 \
OLLAMA_KEEP_ALIVE=60m \
OLLAMA_NUM_BATCH=2048 \
ollama serve

What each one does:

  • OLLAMA_FLASH_ATTENTION=1: Enables Flash Attention, reducing attention memory from O(n²) to O(n) and speeding up prompt evaluation
  • OLLAMA_KV_CACHE_TYPE=q8_0: Quantizes the KV cache to 8-bit, cutting KV cache memory roughly in half with minimal quality loss. This frees up room for larger context
  • OLLAMA_KEEP_ALIVE=60m: Keeps the model loaded in memory for 60 minutes between requests. Prevents expensive cold-start reloads and enables KV cache reuse
  • OLLAMA_NUM_BATCH=2048: Increases the prompt evaluation batch size from the default 512 to 2048. Processes more tokens per forward pass, speeding up prompt eval by roughly 2-3x. Note: this env var is not in the official Ollama docs but is a known undocumented setting that may change in future versions

Also set Context Length to 262144 (256K) in the Ollama app settings (menu bar icon > Settings). This was the single most impactful fix. With 256K context, Claude Code’s default autocompaction threshold (~171K tokens) works correctly without any overrides. RAM usage with qwen3-coder-next at 256K context with q8_0 KV cache settles at ~60 GB on a 64 GB Mac. This is possible because qwen3-coder-next uses a hybrid DeltaNet architecture where only 12 of 48 layers need traditional KV cache, so 256K context uses only ~3 GB of KV cache memory. Then quit and reopen Ollama, and warm the model:

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3-coder-next:q4_K_M",
  "keep_alive": "60m"
}'

Note: Do not specify num_ctx in API calls when the Ollama Mac app GUI already has Context Length configured. Setting it in both places causes the model to fail to load.

Claude Code Client Optimizations

With Ollama context set to 256K, the launch command is straightforward:

ANTHROPIC_AUTH_TOKEN=ollama \
ANTHROPIC_BASE_URL=http://192.168.10.1:11434 \
ANTHROPIC_API_KEY="" \
CLAUDE_CODE_EFFORT_LEVEL=low \
ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3-coder-next:q4_K_M \
claude --model qwen3-coder-next:q4_K_M

No CLAUDE_AUTOCOMPACT_PCT_OVERRIDE is needed with 256K context. Claude Code’s default autocompaction at ~171K tokens works correctly within the 256K window.

CLAUDE_CODE_EFFORT_LEVEL=low reduces the system prompt size and makes the model less verbose, leaving more of the context window for actual conversation and directly reducing prompt eval time.

ANTHROPIC_DEFAULT_HAIKU_MODEL=qwen3-coder-next:q4_K_M fixes the Haiku 404 errors (root cause #4 above). Claude Code calls claude-haiku-4-5-20251001 for background tasks like token counting. Ollama doesn’t have this model, so every call fails with a 404, wasting two HTTP round-trips each time. Pointing it to the same Qwen model eliminates these errors.

If you can’t use 256K context (smaller machines)

If your machine doesn’t have enough RAM for 256K context, use CLAUDE_AUTOCOMPACT_PCT_OVERRIDE to force earlier compaction. Add it to the launch command above.

Formula: CLAUDE_AUTOCOMPACT_PCT_OVERRIDE = (ollama_num_ctx / 180000) * 100

Ollama num_ctxCLAUDE_AUTOCOMPACT_PCT_OVERRIDECompaction triggers atq8_0 KV cache (qwen3-coder-next)
32,768 (32K)18~32K tokens384 MB
65,536 (64K)36~65K tokens768 MB
131,072 (128K)73~131K tokens1.5 GB
262,144 (256K)not needed (use default)~171K tokens3.0 GB

Claude Code hardcodes effectiveWindow=180000 for unknown models. Without the override, it won’t compact until 167K tokens, far beyond what smaller context windows can handle.

A note on auto-compaction with non-Claude models: Compaction may fail with local models at smaller context sizes. The model sometimes responds with tool calls instead of a text summary (GitHub #5778, #18168). If this happens, disable auto-compaction with claude config set -g autoCompactEnabled false (stored in ~/.claude.json) and use /clear to start fresh when you hit the context limit. At 256K context, this was not observed.

Watch out for hallucinated env vars: AI assistants (including the one that helped write this post) commonly suggest env vars that don’t exist in Claude Code. I initially used ANTHROPIC_MODEL_CONTEXT_LENGTH and CLAUDE_CODE_MAX_TURNS, neither of which are real. Always verify against the official Claude Code settings docs and model configuration docs.

Inference Tools for Apple Silicon

If you’re running local LLMs on a Mac, there are several inference engines to consider:

EngineBackendApple Silicon OptimizedBest For
Ollamallama.cpp + MetalYesGeneral use, Claude Code integration
llama.cppNative + MetalYesMaximum control, latest features
MLX / mlx-lmApple ML frameworkBest (native Metal)Maximum Apple Silicon performance
LM Studiollama.cppYesGUI-based experimentation
vLLMCUDA-basedNo (CPU fallback)NVIDIA GPU servers only

Ollama is the easiest path for Claude Code since it provides the Anthropic-compatible API that Claude Code expects. MLX is potentially the fastest on Apple Silicon (native Metal optimization with better memory access patterns), but requires model conversion and an API proxy.

For a detailed technical breakdown of how these engines work internally (KV cache memory math, Flash Attention algorithms, and Apple Silicon memory architecture), read my companion post: LLM Inference Internals: KV Cache, Flash Attention, and Optimizing for Apple Silicon.

Results and Conclusion

After setting Ollama context to 256K and applying the other optimizations, the slowness was resolved. RAM usage settled at ~60 GB on the 64 GB Mac Mini.

MetricBeforeAfter
Ollama context16,384262,144 (256K)
AutocompactionNever triggered (threshold 167K)Works correctly (threshold ~171K, within 256K window)
Context window usedUp to 34K+ (silently truncated)Managed by autocompaction within 256K
Prompt eval per call30-194 secondsSignificantly reduced
Batch size512 (default)2048 (~2-3x faster eval)
Flash AttentionOffOn
KV cache typefp16q8_0 (3 GB for 256K context)
RAM usageN/A~60 GB

The single biggest fix was increasing the Ollama context to 256K. This eliminated the context window mismatch between Claude Code (expects 180K) and Ollama (was set to only 16K). With 256K context, Claude Code’s built-in autocompaction works as designed without any overrides. No compaction failures were observed at 256K.

A 79.7B model running locally at 20 tok/s prompt eval on M4 Pro will always be slower than cloud APIs. Each agentic round-trip takes seconds for prompt evaluation. But the difference between “painfully slow” and “totally usable” comes down to configuration. The model was fine. The wiring between Claude Code and Ollama just needed tuning.

Running local is worth it for privacy and cost. You just need to understand what’s happening under the hood.

Follow me

If you are new to my posts, I regularly post about AI, LLMs, AWS, EKS, Kubernetes and Cloud computing related topics. Do follow me in LinkedIn and visit my dev.to posts. You can find all my previous blog posts in my blog