Ollama + Claude Code = 99% CHEAPER

Nate Herk | AI Automation · 25m 23s · Watch on YouTube · 20 sources

Decision Card

Effort: One evening (~1–2 hours) to install Ollama, pull a ~7–9B model, and run ollama launch claude; or ~30 minutes for the OpenRouter route (edit .claude/settings.local.json, create an API key, load a one-time $10 credit).

Honest take: The “99% cheaper / free” framing is undercut by the video’s own footage: the local 9B Qwen model took 4 minutes to answer one question, needed a manual context-window fix to work at all, and the “free” OpenRouter config silently billed paid Anthropic Haiku for tool calls until the creator overrode every model env var — plus both paths require $5–10 upfront, which he concedes with “there’s really no such thing as free.”

Concrete next steps:

Set up the OpenRouter route using the official guide at https://openrouter.ai/docs/cookbook/coding-agents/claude-code-integration — ~30 min, plus a one-time $10 credit to raise the free-model cap from 50 to 1,000 requests/day.
After the first session, audit the OpenRouter activity log for paid Haiku/Sonnet fallback calls; confirm every model env var (including the small/fast model) points at a :free model — ~10 min.
If trying local: install Ollama, pull a model sized to your RAM/VRAM, and raise the context window to 64k+ per https://docs.ollama.com/context-length before judging quality — ~1 hour.
Skip if you already have a Claude subscription and rarely hit limits — a 9B-class local model is dramatically slower and weaker than Sonnet/Opus, and the savings won’t repay the fiddling for heavy production work.

TL;DR

The video shows two ways to run the Claude Code harness on non-Anthropic models: locally via Ollama (ollama launch claude against a downloaded open-weight model) and via OpenRouter’s free model tier (pointing ANTHROPIC_BASE_URL at OpenRouter in settings.local.json). Both work but come with real trade-offs — slow local inference on consumer hardware, silent context truncation, missing native web search, daily rate limits, and a hidden paid-Haiku fallback unless every model variable is overridden.

Key Points

Claude Code is a harness (“car”) wrapped around a swappable model (“engine”); by default you pay Anthropic for Opus/Sonnet/Haiku, but the engine can be replaced with open-weight models. 00:36
The gap between closed-source and open-source models on SWE-bench Verified is shrinking, and some open-weight models now beat Claude Sonnet 3.7 — a model that was state-of-the-art at release. 03:12
The creator claims using non-Anthropic models inside Claude Code does not violate Anthropic’s terms of service. 04:54
Method 1: ollama pull <model> then ollama launch claude boots Claude Code against a locally hosted model — fully private, no per-token cost. 09:04
Claude Code onboarding still requires either a subscription or an API key with a $5 minimum credit purchase, even if you never consume those credits. 09:58
Ollama’s default context window is far smaller than the model’s advertised maximum (e.g., “200k” shown but not actually given), which broke state between turns until the creator built a custom 64k-context variant. 12:19
Ollama cloud models (e.g., MiniMax) feel close to real Sonnet-in-Claude-Code — spawning four subagents, fast responses — but are no longer local/private and eventually hit paid tiers. 15:58
Best-fit uses for open models: low-stakes/high-volume work — summarizing, grepping, scaffolding, triaging — plus fallback when Claude is down or session limits are hit. 16:21
Method 2: OpenRouter, configured through env vars in .claude/settings.local.json (base URL, auth token, and model names), with $10 loaded once to raise free-model limits from 50 to 1,000 requests/day. 17:35
Critical gotcha: setting only the main model (as OpenRouter’s docs suggest) leaves Claude Code silently calling paid Anthropic Haiku for tool calls and file searches — every model env var must be overridden. 20:04
Open models routed this way lack Claude’s native web search; workarounds are direct fetches or external search tools (Brave Search, Perplexity, Tavily). 23:23
Cheap paid models are the pragmatic middle path: Gemma 4 31B at ~$0.14/$0.40 per million tokens versus Opus 4.6 at $5/$25 — “50 to 100x cheaper” rather than strictly free. 24:28

Notable Quotes

“What I’m going to show you guys how to do is we basically just open up the hood and we switch out the engine.” 01:15

“There’s really no such thing as free because if you want to run a really good model locally, then you need the hardware to support it.” 14:34

“It would have, by default, used Sonnet or Haiku for basically all of these things, and it would have charged you without you even knowing.” 20:30

Verified Claims

Claude Code can be pointed at OpenRouter by overriding environment variables (ANTHROPIC_BASE_URL, auth token, model names) in project settings. 17:35 Sources: OpenRouter — Claude Code Integration, OpenRouter blog — Claude Code with OpenRouter Verdict: Confirmed — official docs specify ANTHROPIC_BASE_URL=https://openrouter.ai/api, the OpenRouter key as ANTHROPIC_AUTH_TOKEN, and an explicitly empty ANTHROPIC_API_KEY.
Without credits you get 50 free-model requests/day; a one-time $10 credit purchase raises that to 1,000/day. 18:15 Sources: OpenRouter — API Credit & Rate Limits, OpenRouter Help Center — Rate Limits Verdict: Confirmed — with an additional 20 requests/minute cap on :free variants the video doesn’t dwell on.
ollama launch claude starts Claude Code against local or Ollama-cloud models with no manual env-var setup. 09:04 Sources: Ollama blog — ollama launch, Ollama docs — Claude Code integration, Ollama blog — Anthropic API compatibility Verdict: Confirmed — enabled by Ollama’s Anthropic Messages API compatibility (v0.14.0+); docs recommend 64k+ context for coding.
Ollama’s default context window is much smaller than a model’s advertised maximum and silently truncates, breaking Claude Code sessions. 12:19 Sources: Ollama docs — Context length, The Autodidacts — Increase Ollama context length Verdict: Confirmed — historical default of 4,096 tokens (now VRAM-scaled) truncates from the start of the conversation with no warning; fix via num_ctx/OLLAMA_CONTEXT_LENGTH.
Running non-Anthropic models through the Claude Code harness does not violate Anthropic’s terms of service. 04:54 Sources: Claude Code docs — Legal and compliance, The Register — Anthropic clarifies ban on third-party tool access Verdict: Confirmed — the direction matters, though: pointing Claude Code at other providers is supported, but the inverse (using Claude subscription OAuth tokens in third-party harnesses) is explicitly banned.
Some open-weight models now score higher than Claude Sonnet 3.7 on SWE-bench Verified. 03:12 Sources: BenchLM — SWE-bench Verified leaderboard, MorphLLM — Claude benchmarks Verdict: Confirmed — top open-weight models (DeepSeek, MiniMax, Qwen) cluster near ~80% on SWE-bench Verified, well above Sonnet 3.7’s launch-era scores, though still below current Claude frontier models (~88–95%).
Gemma 4 delivers top open-model quality at the smallest size in its class. 04:21 Sources: Google blog — Gemma 4 Verdict: Confirmed with nuance — Gemma 4 31B’s ~1,452 LMArena Elo outperforms open models with up to 20x more parameters, though it ranks third among open models overall, not first as the chart framing implies.
Opus 4.6 costs $5/$25 per million input/output tokens versus ~$0.14/$0.40 for Gemma 4 31B on OpenRouter. 24:38 Sources: Anthropic — Claude pricing, OpenRouter — Gemma 4 31B Verdict: Confirmed — Opus 4.6 pricing matches exactly; Gemma 4 31B rates vary slightly by underlying provider ($0.10–0.14 input, $0.35–0.40 output), consistent with the video’s “50–100x cheaper” math.

Tools, Papers & Standards Mentioned

Claude Code — Legal and compliance / docs; pricing at platform.claude.com
Ollama — Claude Code integration, ollama launch, Anthropic API compatibility, context length
OpenRouter — Claude Code integration guide, rate limits, pricing
Gemma 4 (Google) — announcement, OpenRouter listing / free tier
Qwen (Alibaba) — Qwen3-Coder-Next technical report
SWE-bench Verified — leaderboard
MiniMax M2.7, Brave Search, Perplexity, Tavily, VS Code — mentioned in passing; no canonical source surfaced during research (video references status.claude.com for Anthropic status checks)

Follow-up Questions

How do open-weight models actually score on agentic tool-calling reliability inside Claude Code specifically (not raw SWE-bench) — is there a benchmark for harness compatibility across Qwen, Gemma 4, and MiniMax?
What data-retention and training policies apply to prompts sent through OpenRouter’s :free model variants, and how does that compare to fully local Ollama inference for sensitive codebases?
What is the minimum practical hardware (VRAM/RAM) for a local model that sustains a 64k+ context window at usable tokens/sec for Claude Code sessions, and where is the price crossover versus cheap cloud models like Gemma 4 31B?