Cohere's North Mini Code Bets That Agent-Aware Training Beats Brute-Force Scale

Most open-source coding models are still chasing the same playbook: bigger parameter counts, more code tokens, hope for the best on SWE-Bench. Cohere just released a 30B Mixture-of-Experts model with only 3B active parameters that reportedly outscores 120B-class systems on agentic coding tasks — and the interesting part isn’t the size. It’s that the model was trained to live inside multiple agent harnesses from day one, rather than being grafted onto one after the fact. North Mini Code is a wager that the next gap in code models isn’t raw capability. It’s behavioral fit with the scaffolds that actually call them.

Why a 3B-Active Model Outscoring 120B Systems Matters

According to Cohere’s release, North Mini Code scores 33.4 on Artificial Analysis’ Coding Index, ahead of Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), and Devstral Small 2 (24B Dense), as well as substantially larger models including Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B). The model is released on Hugging Face under Apache 2.0 in both BF16 and FP8 weights.

The practical implication for engineering teams is cost-per-rollout. Agentic coding runs are expensive because trajectories are long and variable — a single SWE-Bench-style task can chew through hundreds of tool calls. A sparse MoE that activates only 3B parameters per token can serve those rollouts at a fraction of the memory and latency of a 120B dense competitor, while landing in the same accuracy band. If you’re running an internal code review bot or a refactoring agent at scale, the cost ceiling on “acceptable” agent depth just moved. Expect the gap between “open model you can self-host” and “frontier API you have to rent” to keep narrowing for code work specifically, because code has verifiable rewards and reinforcement learning can compound the way it can’t for general chat.

How Multi-Harness Training Changes the Agent Stack

Cohere’s most interesting design choice isn’t the architecture — it’s that they deliberately trained the model against several agent harnesses (SWE-Agent, mini-SWE-agent, OpenCode, Terminus 2) instead of overfitting to one. They report adding just 6% of harness data to the second SFT stage yielded a 10% gain on the OpenCode harness without degrading SWE-Agent performance, and the model hits 61.0% pass@1 on mini-SWE-Agent essentially for free via cross-harness transfer.

The agent harness ecosystem has fractured. SWE-Agent gives a model a rich CLI with bash, str_replace_editor and submit tools. mini-SWE-agent strips that down to one bash tool with raw stdout. OpenCode hands the model typed JSON tools like edit, grep, todowrite and task. Terminus 2 abandons native tool calling entirely and forces plain-text chat turns. A model that only knows one of these is a model that’s locked into one vendor’s scaffold. If you’re a platform team evaluating which open coding model to standardize on, harness robustness is the new portability metric — it’s the difference between a model you can drop into your existing agent framework and a model that forces you to rebuild around it. The deeper question of whether to even own that stack is the same one covered in the agents-versus-automation decision guide: not every workflow needs a tool-calling loop, but the ones that do should not be tied to a single scaffold.

The prediction: harness-agnosticism becomes a baseline expectation for serious open coding models within the next two release cycles. Vendors that ship single-harness specialists will get reviewed as toys.

Asynchronous RL and Why the Training Loop Is the Real Moat

The post-training pipeline is where Cohere is doing the most under-discussed work. They run a two-stage SFT (64K then 128K context) followed by reinforcement learning with verifiable rewards (RLVR), using over 70k verifiable tasks across roughly 5k unique repositories. They deduplicate against SWE-Bench and SWE-Bench-Pro sources to avoid leakage. The SFT-only checkpoint already hits 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2 — but then RLVR adds another 7.9% absolute pass@1 on Terminal-Bench v2 and 3.0% absolute on SWE-Bench.

Coding-agent rollouts have brutal length variance — Cohere notes the slowest trajectories are routinely an order of magnitude longer than the median. A naive synchronous RL loop would stall the trainer waiting on stragglers. Their fix: decouple sampling from learning with a vLLM sidecar serving rollouts continuously, refresh policy weights every K=4 learner steps, and use a windowed FIFO queue to drain the slowest trajectories without distorting the data distribution. They train using CISPO with token-level loss aggregation, so long agent traces — where most credit assignment lives — aren’t down-weighted against short ones.

For a team building a production code agent, this is the gap between a model that can solve a task and one that reliably terminates. Cohere reports the RLVR model produces shorter trajectories, fewer invalid or failing tool calls, and less repetitive tool-call looping. Anyone who has watched an LLM agent fall into a grep → cat → grep death spiral knows that termination quality matters more than peak capability. The infrastructure for asynchronous agentic RL — the queue management, the off-policy correction, the per-task step budgets — is becoming the actual moat. Weights are getting commoditized; the training loop that produces well-behaved agents is not. The same split applies when evaluating custom AI versus off-the-shelf SaaS AI: the part you can’t buy is the post-training discipline that decides how an agent fails.

What Human Evaluation Reveals About RLVR’s Real Wins

Cohere also ran an internal pairwise human evaluation across four task types — code explanation, code editing, data visualization, and implementation from scratch — using OpenCode through the Harbor framework. The headline number: the final RLVR checkpoint won 66.1% of pairwise comparisons against the SFT-only checkpoint across 85 samples, with the largest gains specifically on code editing.

That the biggest human-preference jump is on editing, not generation from scratch, tracks with what production teams report. Editing tasks demand the model preserve surrounding context, respect existing conventions, and not hallucinate APIs that aren’t in the file — all things RLVR with executable verification can punish directly. Generation-from-scratch tasks have more degrees of freedom, so the SFT baseline already covers a lot of acceptable output. If you’re embedding a coding model into an AI-integrated product, the takeaway is to weight your eval suite toward edit-in-place benchmarks rather than greenfield generation. Editing is where users actually catch failures, and editing is where RLVR’s gains compound.

FAQ

Q: What is North Mini Code and who is it for? A: North Mini Code is Cohere’s first developer-focused model, a 30B-parameter Mixture-of-Experts system with 3B active parameters per token, released on Hugging Face under Apache 2.0. It’s designed for agentic software engineering — running inside coding agent harnesses like OpenCode, SWE-Agent, and mini-SWE-agent rather than as a standalone autocomplete model.

Q: How does North Mini Code compare to other open-source coding models? A: On Artificial Analysis’ Coding Index, Cohere reports North Mini Code scoring 33.4, ahead of similarly-sized models like Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), and Devstral Small 2 (24B Dense), and also ahead of substantially larger models including Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B). The SFT-only checkpoint hits 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2.

Q: What is RLVR and why does it matter for coding agents? A: RLVR stands for reinforcement learning with verifiable rewards — instead of learning from human preference labels, the model is rewarded based on whether its code actually passes unit tests or whether its terminal actions produce the correct state. For coding agents this is unusually clean: correctness is binary, the verifier is automated, and the model learns to terminate trajectories properly rather than loop on broken tool calls.

Key Takeaways

Teams evaluating open coding models should add harness portability to their checklist — a model that only works inside one scaffold is a vendor lock-in waiting to happen.
Sparse MoE architectures with small active-parameter counts are making self-hosted agentic coding economically viable; the cost calculus for renting frontier APIs versus running open weights is shifting fast.
The competitive frontier for code models is moving from base capability to behavioral quality — termination, tool-call validity, and trajectory length — and only RL-style post-training reliably moves those metrics.
Internal eval suites should over-index on code editing tasks rather than greenfield generation, because that’s where post-training gains and real-world user complaints concentrate.
Expect the next wave of open coding releases to ship with explicit harness-compatibility matrices, the same way models today advertise context length and quantization formats.

Why a 3B-Active Model Outscoring 120B Systems Matters

How Multi-Harness Training Changes the Agent Stack

Asynchronous RL and Why the Training Loop Is the Real Moat

What Human Evaluation Reveals About RLVR’s Real Wins

FAQ

Key Takeaways

Build With Zyfolks

AI-Integrated Software

AI Automation

AI Agents

Have a project in mind?