Running models locally means no API keys, no usage bills, and no sending proprietary code to someone else’s servers. It also means slower responses and lower quality on hard problems. Whether that tradeoff makes sense depends on what you’re doing.

If you came here from a search for “openclaw local model” or “openclaw local llm,” start simple: use Gemma 4 8B or Qwen3.6 9B on 16GB machines, Qwen3.6 27B on 24GB+ GPUs (single RTX 5090 or M5 Pro), and keep one cloud fallback for jobs where local models get stuck. Local OpenClaw is good now. It is not magic.

Two things changed the math in 2026. The Qwen3.6 release on April 22 dropped a 27B dense coding model that beats a 397B MoE on SWE-bench. And Apple’s M5 Max (announced March 2026) put 128GB of unified memory and Neural Accelerators in every GPU core — 70B-class models now run on a laptop. Combined with Ollama becoming an official OpenClaw provider, the setup is simpler than it has ever been.

Model rankings

Current local models ranked for coding work in OpenClaw, based on SWE-bench Verified scores, tool-calling reliability, and real-world agent performance:

ModelParametersActivationVRAM NeededSWE-benchSpeed (RTX 5090)Best For
Qwen3.6 27B27B27B (dense)18GB+77.2%~70 t/sBest quality-to-size ratio for coding
Qwen3 Coder Plus72B72B (dense)48GB+70.6%~30 t/sHardest coding tasks, full agent loops
Qwen3.6 35B-A3B35B3B (MoE)16GB+~180 t/sSpeed-critical work, high throughput
Qwen3.6 9B9B9B (dense)8GB+~186 t/sEntry-level hardware, simple tasks
Llama 3.3 70B70B70B (dense)dual GPU~27 t/s (2x 5090)General coding, good instruction following
Gemma 4 8B8B8B (dense)8GB+~150 t/sPrivacy-first, lightweight setups
Qwen3 32B32B32B (dense)24GB+~60 t/sSolid all-rounder, widely tested

Qwen3.6 27B is the new headline. 77.2% on SWE-bench Verified, 59.3% on Terminal-Bench 2.0 (matching Claude Opus 4.5 exactly), and it runs on 18GB of VRAM. A dense 27B model beating a 397B MoE on coding is the result of architectural changes in the 3.6 release — Gated Delta Networks plus targeted post-training on real PRs.

The 35B-A3B MoE is the wildcard. Only 3B parameters activate per forward pass, so it runs at ~180 t/s on a single RTX 5090. Quality is lower than 27B dense on hard problems, but for file reads, boilerplate generation, and simple edits it feels like a cloud API.

Hardware requirements

Local model quality scales with model size, and model size scales with hardware needs. The tiers below assume 2026 hardware — RTX 50-series on the NVIDIA side, M5 on the Apple side.

8–16GB VRAM (entry level)

RTX 5060 / 5070 or 16GB unified memory (M5 base, M5 Pro entry). Enough for Qwen3.6 9B and the 35B-A3B MoE. The 9B handles simple tasks and code summarization at ~186 t/s on a 5090; expect ~120 t/s on a 5070. The 35B-A3B uses far less memory than its parameter count suggests because only 3B parameters activate per pass.

Models: Qwen3.6 9B, Qwen3.6 35B-A3B, Gemma 4 8B

Single RTX 5090 (32GB GDDR7, 1,792 GB/s) or 36–64GB unified memory (M5 Pro / M5 Max base). This is where local models become practical for real work. Qwen3.6 27B runs comfortably on 18GB and its SWE-bench score (77.2%) rivals cloud models you’d pay per token to use.

The RTX 5090’s bandwidth jump over the 4090 (1,792 GB/s vs 1,008 GB/s — a 78% increase) is the relevant number for inference, not the raw FLOPS. Token generation is memory-bandwidth bound, and a single 5090 generates around 186 t/s on Qwen 8B and 124 t/s on 14B-class models.

Models: Qwen3.6 27B, Qwen3 32B, Qwen3.6 Plus (when fitted)

48GB+ effective memory (premium)

Dual RTX 5090 (64GB combined) or 96–128GB M5 Max unified memory. Qwen3 Coder Plus and Llama 3.3 70B live here.

Two routes:

  • Dual 5090 rig: 27 t/s on Llama 70B Q4_K_M with vLLM tensor parallelism — within shouting distance of an H100 at a fraction of the cost. Best for desktops where you can fit two cards.
  • M5 Max 128GB MacBook Pro: A 70B Q4_K_M model (~40GB on disk) loads entirely into unified memory with room to spare for context. Apple’s Neural Accelerators embedded in every GPU core push prompt processing 3.3–4x faster than M4 Max, and steady-state generation lands around 18–25 t/s. The trade-off is portability: this is the only setup that fits in a backpack.

Models: Qwen3 Coder Plus, Llama 3.3 70B, Qwen3.6 Plus full precision

The M5 Max with 128GB of unified memory is the new sweet spot for serious local work on a laptop. Apple’s MLX framework now ships with Neural Accelerator support, which means the GPU and the Neural Engine both work in parallel on every forward pass instead of one or the other.

Setting up Ollama

Ollama is the simplest way to run local models. Install it, pull a model, and you have an OpenAI-compatible API running on localhost.

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (pick one based on your hardware)
ollama pull qwen3.6:27b          # Best quality, needs 18GB+ VRAM
ollama pull qwen3.6:35b-a3b      # Fast MoE model, runs on 16GB
ollama pull qwen3.6:9b           # Lightweight, runs on 8GB
ollama pull qwen3-coder-plus     # Premium, needs 48GB+ or 96GB unified

Ollama serves an API at http://localhost:11434 by default.

Tip from r/LocalLLaMA: Several users report better performance switching from Ollama to llama.cpp directly for the 27B and larger models. Ollama adds convenience, but llama.cpp gives you more control over quantization and memory allocation. Start with Ollama — switch to llama.cpp if you hit performance walls.

OpenClaw configuration

Since Ollama is now an official provider, the setup is straightforward. Run the onboarding wizard:

openclaw onboard --auth-choice ollama

Or add Ollama manually in ~/.openclaw/openclaw.json:

{
  models: {
    providers: {
      ollama: {
        baseUrl: "http://localhost:11434/v1",
        api: "openai-completions",
        models: [
          {
            id: "qwen3.6:27b",
            name: "Qwen3.6 27B",
            reasoning: false,
            contextWindow: 131072,
            maxTokens: 8192
          }
        ]
      }
    }
  },
  agents: {
    defaults: {
      model: { primary: "ollama/qwen3.6:27b" },
      models: {
        "ollama/qwen3.6:27b": { alias: "qwen-local" }
      }
    }
  }
}

Switch to your local model:

/model qwen-local

What local models handle well

After running Qwen3.6 27B locally for several weeks, a few things hold up:

  • Reading and summarizing code. Ask it what a function does and it gives you a solid answer. Not as nuanced as Sonnet 4.6, but good enough for navigating unfamiliar codebases.
  • Code generation for common patterns. Boilerplate, CRUD operations, config files, test scaffolding. It writes functional code on the first try most of the time. The 3.6 release was post-trained on real merged PRs, and it shows in the diff quality.
  • File operations and simple refactoring. Listing files, searching for patterns, renaming variables across a file. Mechanical tasks that don’t require deep reasoning.
  • Agentic tool calling. Qwen3.6 raised the bar on function-calling reliability — Terminal-Bench 2.0 at 59.3% matches Claude Opus 4.5 exactly. For OpenClaw’s tool loop, that translates to fewer “model called the wrong function with the wrong args” errors.

Where local models fall short

  • Multi-file refactors. Anything that requires holding context across 5+ files gets unreliable. The model either loses track or makes inconsistent changes. Cloud models with 200K+ context windows still have an advantage here, though the gap narrowed with Qwen3.6.
  • Complex debugging. If the bug requires reasoning through multiple abstraction layers, local models suggest surface-level fixes when the problem runs deeper. Claude Opus 4.7 still beats Qwen3.6 27B by ~14 points on hard SWE-bench tasks.
  • Speed on dense models on older hardware. The 27B model runs at about 70 tokens/second on a single RTX 5090 and ~22 t/s on an M5 Max. If you’re still on a 3090 or 4090, expect closer to 30-40 t/s. (The 35B-A3B MoE at 180+ t/s is the exception.)
  • Very long context. Qwen3.6 supports up to 256K tokens in theory, but inference quality degrades on consumer hardware past 32K. Keep contextWindow realistic in your config.

The hybrid approach

Most people who try local models end up with a hybrid setup: local for the cheap stuff, cloud for the hard stuff.

{
  agents: {
    defaults: {
      model: {
        primary: "ollama/qwen3.6:27b",
        thinking: "anthropic/claude-sonnet-4-6-20260514"
      }
    }
  }
}

The local model handles file reads, simple edits, and boilerplate — maybe 60-70% of a typical coding session. Sonnet handles the debugging, architecture decisions, and multi-file work. Your API bill drops to a few dollars a day instead of $20-50.

Switch manually when you know a task needs more capability:

/model sonnet

Or use Haimaker’s auto-router to handle the routing for you. The auto-router detects task complexity and sends hard problems to cloud models automatically, so you don’t have to think about when to switch.

Troubleshooting

Model loads slowly or crashes. You’re probably out of memory. Try a smaller quantization: ollama pull qwen3.6:27b-q4_K_M uses less memory at a small quality cost. Q4_K_M is the sweet spot for most people — minimal quality loss, significant memory savings.

Tool calls fail. Set "reasoning": false in your model config and stick to Qwen3.6 models — they handle OpenClaw’s tool-calling format more reliably than Mistral or older Llama models. If tool calls still break, update Ollama to the latest version. The official provider integration fixed several edge cases.

Context window errors. Set contextWindow accurately in your config. For Qwen3.6 models, 131072 (128K) is a safe default on 24GB+ VRAM hardware (single RTX 5090 fits this comfortably). On 16GB, stick to 32768 to avoid quality degradation.

Slow generation speed. If you’re getting under 40 t/s on the 27B model on a 5090 or under 18 t/s on M5 Max, check whether other processes are using your GPU. Close any browser tabs running WebGL or video. On Mac, Activity Monitor → GPU History will show what’s competing for unified memory. M5 users should also confirm Ollama is using the MLX backend (v0.21+) — the speed gap between the Metal-only and MLX paths is roughly 2x on prompt processing.

TRY HAIMAKER FOR CLOUD ROUTING


For model pricing comparisons, see cheapest models for OpenClaw. For reducing token costs on cloud models, see cutting costs by 96% with QMD.