LLM Model Landscape (February 2026)¶

The frontier is moving fast. This document covers the current state of the art for both cloud and local models.

Last updated: February 14, 2026

Table of Contents¶

GLM-5 (Just Released)
Cloud Frontier Models
Open-Source Rankings
Best Coding Models
Brand New Releases
Model Selection Guide

GLM-5 (Just Released)¶

Released: February 11, 2026 Creator: Zhipu AI (Z.AI) -- Tsinghua University spinoff, Hong Kong IPO Jan 2026 ($558M raised) License: MIT (full commercial use)

Specifications¶

Spec	Value
Total Parameters	~744-754 billion
Active Parameters	~40-44 billion per token
Architecture	Mixture of Experts (MoE) -- 256 experts, 8 activated
Context Window	200K tokens
Max Output	131,000 tokens
Training Data	28.5 trillion tokens
Training Hardware	Huawei Ascend 910 (zero NVIDIA dependency)
Model Size on Disk	1.51 TB
Predecessor	GLM-4.7 (368B total / ~32B active)

Key Innovation¶

"Slime" RL technique -- novel reinforcement learning method achieving record-low hallucination rate among frontier models.

Benchmarks¶

Benchmark	GLM-5	Claude Opus 4.5	GPT-5.2
SWE-bench Verified	77.8%	80.9%	76.2%
Terminal-Bench 2.0	56.2%	59.3%	54.2%
Humanity's Last Exam	50.4	43.4	45.8
BrowseComp	75.9 (#1)	67.8	59.2
AI Intelligence Index	50 (#1 of 64 models)	-	-

#1 open-source model on SWE-bench and BrowseComp
Beats GPT-5.2 on most benchmarks
Trails Claude Opus 4.5 on SWE-bench and Terminal-Bench
Beats Claude on Humanity's Last Exam and BrowseComp

Pricing (The Disruption)¶

Model	Input/1M tokens	Output/1M tokens
GLM-5	~$0.11	~$0.32
Claude Opus 4.6	$5.00	$25.00
GPT-5.2	$1.75	~$10.00

GLM-5 is ~45x cheaper than Claude Opus on input and ~10x cheaper on output.

Availability¶

Hugging Face: zai-org/GLM-5 (full weights)
Ollama: ollama pull glm-5
OpenRouter: z-ai/glm-5
Z.AI Platform: chat.z.ai
vLLM: Officially supported

Early Assessment¶

Strong on structured tasks (coding, math, reasoning benchmarks). May lag behind Claude Opus on nuanced, context-heavy, "situationally aware" tasks. Community consensus: "exceptionally capable but far less situationally aware."

Note: GLM-5 at 1.51 TB requires significant hardware for local deployment. A Mac Studio M3 Ultra with 512GB cannot fit the unquantized model. Heavy quantization (Q2-Q3) is needed for consumer hardware, which degrades quality.

Cloud Frontier Models¶

Rank	Model	Creator	Key Strengths
1	Claude Opus 4.5/4.6	Anthropic	Top coding (SWE-bench 80.9%), best "vibes", situational awareness
2	GPT-5.2	OpenAI	Strong all-rounder, 400K context, Codex integration
3	Gemini 2.5/3.0 Pro	Google	Massive context windows, multimodal, strong reasoning
4	Grok 4	xAI	Competitive reasoning, real-time data access
5	Claude Sonnet 4.5	Anthropic	Best price/performance for coding
6	GLM-5	Zhipu AI	Best open-weight frontier, MIT license, ultra-cheap API

Open-Source Rankings¶

By Coding Performance (LiveCodeBench + SWE-bench)¶

Rank	Model	Creator	Params	Coding Score
1	DeepSeek V3.2 Speciale	DeepSeek	-	90% LiveCodeBench
2	GLM-5	Zhipu AI	744B MoE	77.8% SWE-bench (#1 open-source)
3	gpt-oss-120B	OpenAI	120B	88% LiveCodeBench
4	MiMo-V2-Flash	Xiaomi	-	87% LiveCodeBench
5	Kimi K2.5	MoonshotAI	1T MoE (32B active)	76.8% SWE-bench, 96.1% AIME
6	GLM-5	Zhipu AI	744B MoE	77.8% SWE-bench (#1 open-source)
7	MiniMax M2.5	MiniMax	-	80.2% SWE-bench, 20x cheaper than Claude

By Math Performance (AIME 2025)¶

Model	Score
DeepSeek V3.2 Speciale	97%
Kimi K2.5 (Reasoning)	96%
MiMo-V2-Flash	96%
GLM-5	95%
gpt-oss-120B	93%
DeepSeek V3.2	92%

Best Coding Models¶

For Cloud API Usage¶

Use Case	Best Model	Why
Best quality overall	Claude Opus 4.5/4.6	SWE-bench 80.9%, best situational awareness
Best value (subscription)	GPT-5.3-Codex	$20/mo Plus includes frontier coding. Limits doubled until April 2026.
Best price/performance (API)	Claude Sonnet 4.5	Strong coding at lower cost
Cheapest frontier-quality (API)	GLM-5	45x cheaper than Claude, SWE-bench 77.8%
Async task delegation	OpenAI Codex (GPT-5.3)	Background task execution, cloud sandboxed VMs

GPT-5.3-Codex for OpenClaw (community-validated): - $20/mo ChatGPT Plus includes GPT-5.3-Codex (low/medium/high/xhigh tiers) - @_karimelk: "running my openclaw on GPT-5.3-codex medium with excellent results" - @Shenoy465653734: "OpenClaw's power is realised with frontier models only i.e 5.3 Codex and Opus 4.6" - Limits 2x more generous than Claude Pro. Near-unlimited for coding at $20/mo. - See Subscriptions Guide for full pricing.

For Local Deployment¶

VRAM Budget	Best Model	Notes
8GB	Qwen 2.5 Coder 7B	88.4% HumanEval
16GB	Qwen 2.5 Coder 14B	Strong small model
24GB	Qwen3 30B (Q4)	Competitive quality
48-64GB	Qwen3-Coder 32B	Near-frontier, practical
128GB+	Qwen3-next-80B	Best speed/quality balance
192GB+	Qwen-3 235B	Frontier-class local
512GB+	GLM-5 (quantized), DeepSeek V3	Largest models

Best for Claude Code CLI (Local)¶

GLM-5 -- Latest frontier open-source, MIT license, top community pick
Qwen3-Coder (30B-A3B MoE) -- Best efficiency for smaller setups
GPT-OSS 120B -- Strong coder, Apache 2.0
Devstral-small -- Good for agentic coding on limited hardware

Brand New Releases (Last 4 Weeks)¶

Model	Date	Creator	Notable
GLM-5	Feb 11, 2026	Zhipu AI	744B MoE, MIT, #1 open-source
MiniMax M2.5	Feb 2026	MiniMax	80.2% SWE-bench, 20x cheaper than Claude
DeepSeek V3.2 Speciale	Jan-Feb 2026	DeepSeek	90% LiveCodeBench (highest open-source)
gpt-oss-120B	Recent	OpenAI	Apache 2.0, OpenAI's first serious open-source
MiMo-V2-Flash	Recent	Xiaomi	87% coding, 96% math
Qwen3-Coder-480B	Recent	Alibaba	Agentic coding focused
Kimi K2.5	Jan 26, 2026	MoonshotAI	1T MoE, 32B active, native multimodal, agent swarms
Llama 4 Scout/Maverick	Recent	Meta	Natively multimodal, open-source

Model Selection Guide¶

Decision Tree¶

Need absolute best quality?
  └─ Yes → Claude Opus 4.5/4.6 (cloud)
  └─ No
      │
      Need it free / private?
      └─ Yes
      │   │
      │   Have 64GB+ unified memory?
      │   └─ Yes → GLM-5 (quantized) or Qwen3-Coder (local via Ollama)
      │   └─ No → Qwen 2.5 Coder 7B/14B (local)
      │
      └─ No
          │
          Budget matters?
          └─ Yes → GLM-5 API ($0.11/M input) or Groq free tier
          └─ No → Claude Sonnet 4.5 (best price/performance cloud)

Cost Per Million Tokens¶

Model	Input	Output	Quality Tier
Groq (Llama 3.1 70B)	$0.59	$0.79	Good
DeepInfra (various)	$0.08	Varies	Good
GLM-5	$0.11	$0.32	Frontier
Claude Sonnet 4.5	$3.00	$15.00	Frontier
Claude Opus 4.6	$5.00	$25.00	Best
GPT-5.2	$1.75	~$10.00	Frontier

Kimi K2.5 (The Agent Swarm Model)¶

Released: January 26, 2026 Creator: Moonshot AI License: Modified MIT (full commercial use)

Spec	Value
Total Parameters	1 trillion (1T)
Active Parameters	32B per token
Architecture	MoE -- 384 routed experts, 8 activated per token
Context Window	Up to 512K tokens
Vision	Native multimodal (MoonViT, 200M param vision encoder)
Unique Feature	Self-directed agent swarm -- orchestrates up to 100 sub-agents
Model Size on Disk	~595 GB

Benchmarks: - SWE-bench Verified: 76.8% (close to GLM-5's 77.8%) - AIME 2025 (Math): 96.1% - Tool-use improvement: +20.1 pp (nearly 2x better than GPT-5.2) - Beats Claude on tool-use, trails on coding

Pricing (ultra-cheap): - Input: $0.10-0.60/M tokens (cache hit vs miss) - Output: ~$2.80/M tokens - 8-50x cheaper than Claude Opus depending on caching

Availability: HuggingFace, Ollama (ollama pull kimi-k2.5), OpenRouter, Together AI, NVIDIA NIM

OpenClaw integration: Officially supported. Moonshot published a guide. OpenClaw announced free Kimi K2.5 access at "1/9 the cost."

Reddit says: "Oh good lord it's good" and "It can almost do 90% of what Claude Opus 4.5 can do." Zero production case studies yet as of Feb 2026. Privacy concern: Chinese company, data subject to Chinese regulations.

MiniMax M2.5 (The Budget Frontier Coder)¶

Released: February 2026 Creator: MiniMax (Chinese AI lab) License: Open-source

Spec	Value
SWE-bench Verified	80.2% (higher than GLM-5's 77.8%)
Positioning	Budget alternative to Claude Opus
Cost	~$10/month for typical OpenClaw usage (~20x cheaper than Claude)
Open-source	Yes -- full weights available

Key takeaway from AI Grid review: - Beats GLM-5 on SWE-bench coding benchmarks - Close to Claude Opus 4.5 on code generation tasks - Fraction of the cost -- viable for budget OpenClaw setups - Early community reports are positive but limited production data

OpenClaw integration: Use via OpenRouter (minimax/m2.5) or direct API. Good candidate for the "coding muscle" in a model routing setup.

Best Models for OpenClaw (Community Tested)¶

Real-world OpenClaw model recommendations from Twitter/X community (Feb 2026):

Use	Recommended Model	Why	Source
Primary agent (budget)	Kimi K2.5	Highest used model for OpenClaw, 256K context, agent-first design	@inqusit
Primary agent (quality)	Claude Opus 4.6	Best reasoning, but $50-100/day API costs	@thaliabloomai
Primary agent (best value)	GLM-5	"Seems better already compared to Kimi 2.5"	@philblines
Heartbeat/cron	Grok 4.1 FR or Haiku 4.5	Cheap, fast, good enough for checks	@RichP
Fallback chain	GLM 4.7 → Kimi 2.5	Multiple cheap models for resilience	@RichP
Code-only	GPT-5.2	"Can fix code right away without needing another coding agent"	@DegenApe99

Community power user stack:

Primary: Gemini 3 Flash Preview (via OpenRouter)
Heartbeat: Grok 4.1 FR
Fallbacks: GLM 4.7, Kimi 2.5
Premium override: Claude Opus 4.6 (for complex tasks)

The multi-model experiment (one user on Mac Mini):

"Experimenting with Kimi / Opus / Codex / GLM on a Mac Mini + OpenClaw setup. I care less about 'which model is absolute smartest' and more about 'which ones I can run 24/7 without killing my wallet or my Mac Mini.'" -- @h_a_t_a_r_a_k_e

GLM-5 vs Kimi K2.5 Head-to-Head (For OpenClaw)¶

The two cheapest frontier models. Which one to use?

Dimension	GLM-5	Kimi K2.5	Winner
SWE-Bench Verified	77.8%	76.8%	GLM-5
Context Window	200K	256K	Kimi K2.5
Hallucination Rate	Lowest (record)	Standard	GLM-5
Input Cost ($/M)	$0.11-0.80	$0.10-0.45	Kimi K2.5 (cached)
Output Cost ($/M)	$0.32-2.56	$2.50-2.80	GLM-5 (cheaper)
Agent Design	General purpose	Agent-first (swarms)	Kimi K2.5
Multimodal	Text-only	Native vision	Kimi K2.5
Sub-Agent Swarms	No	Yes (100 sub-agents)	Kimi K2.5
Math (AIME)	95%	96.1%	Tie
Tool Calling	Good	+20.1 pp vs GPT-5.2	Kimi K2.5

Verdict: GLM-5 wins on raw coding quality and low hallucination. Kimi K2.5 wins for agent use (longer context, native vision, swarm orchestration, better tool calling). For OpenClaw specifically, Kimi K2.5 is better as primary agent model, GLM-5 is better as coding sub-agent.

Real-World Reliability Test (Feb 14, 2026)¶

Source: @LufzzLiz -- Same prompt across 5 models: "Check today's Shanghai weather, write a morning greeting based on it, save to workspace/greeting.txt"

Model	Result	Quality	Speed	Notes
Claude Opus 4.6	Pass	Best — noticed it was evening + Valentine's Day, adapted greeting	Normal	Worth the premium for quality-critical
GLM-5	Pass	Good — also noticed Valentine's Day	Slow	Quality there, speed isn't
MiniMax M2.5	Pass	Good	Normal	Stable, reliable
Gemini Flash	Pass	Basic but correct	Fastest	Best reliability + speed + cost ratio
Kimi K2.5	Failed	N/A	N/A	Infinite loop querying weather → rate limit → task crashed

@LufzzLiz recommendation: Primary: Gemini Flash (fast + cheap + reliable). High-end: Claude Opus 4.6. Backup: MiniMax M2.5.

Community consensus on reliability (aggregated): - @cdz_solo: "Used GLM-5 for a day, MiniMax M2.5 for a day. Went back to Kimi K2.5" (prefers Kimi despite flaws) - @takafirstpen: "OpenClaw usability: Kimi K2.5 > GLM-5" - @llt139574: "Kimi K2.5 is hard to use... code doesn't work, doesn't understand local code environment" - @dddanielwang: "Only GLM-5 and Kimi 2.5 are economical with decent recall. For a good brain you still need Opus/GPT" - @MartinSzerment: "Kimi K2.5 matched Opus 4.5 performance at 1/8th cost. Top model on OpenClaw and OpenRouter"

Bottom line for 24/7 agents: Kimi K2.5 is the most popular but can be unreliable (loops, rate limits). GLM-5 is higher quality but slow. For always-on reliability, Gemini Flash or MiniMax M2.5 are safer. Use Opus for quality-critical decisions.

Open-Source Progression Strategy¶

The trend: start with cloud APIs, progressively self-host as models improve and your hardware arrives.

Phase 1 — Cloud APIs (Now)
├── Primary: Kimi K2.5 via OpenRouter ($0.10/M)
├── Fallback: MiniMax M2.5 or Gemini Flash
├── Premium: Claude Opus 4.6 for complex reasoning
└── Cost: $20-100/mo API

Phase 2 — Hybrid (Self-host + Cloud)
├── Self-host: GLM-5 or MiniMax M2.5 on your server (EPYC, Mac Studio, etc.)
├── Local handles: 80-90% of agent tasks (heartbeat, cron, routine)
├── Cloud handles: 10-20% quality-critical (Opus for strategy, architecture)
└── Cost: $10-30/mo API + one-time hardware

Phase 3 — Mostly Local (Target)
├── 90%+ local open-source models
├── Cloud only for: refinement, error recovery, complex reasoning
├── Keep Claude Code subscriptions for coding (highest quality)
└── Cost: Near-zero API + subscriptions

Hardware needed for self-hosting: - GLM-5 (quantized): 128-192GB unified memory or 2-4x RTX 4090 - MiniMax M2.5 (8-bit): ~48GB VRAM (2x RTX 4090 or M3 Ultra 512GB) -- @Patrick1Kennedy confirmed working - Kimi K2.5: ~595GB on disk, needs substantial hardware - Qwen3-Coder: 48-64GB for 32B version, most practical local option

$2K/Month Budget: Maximum ROI Model Routing¶

For a user willing to spend $2,000/mo on API costs, here's the optimal strategy:

┌─────────────────────────────────────────────────┐
│           OPTIMAL MODEL ROUTING                  │
│           ($2,000/month budget)                  │
├─────────────────────────────────────────────────┤
│ SUBSCRIPTIONS (fixed monthly):                   │
│ → 3x Claude Max $200 ($600) — Opus for reasoning│
│ → 1x ChatGPT Plus $20 ($20) — Codex 5.3 coding  │
│ → 1x Codex Pro $200 ($200) — heavy async tasks   │
│ → Gemini CLI free ($0) — planning, large context │
│ → Subtotal: $820/mo fixed                        │
├─────────────────────────────────────────────────┤
│ API BUDGET (~$1,180/mo remaining):               │
├─────────────────────────────────────────────────┤
│ TIER 1: OpenClaw primary brain (~$400/mo)        │
│ → Kimi K2.5 ($0.10/M cached) or GLM-5 ($0.11/M) │
│ → For: 24/7 agent operations, research, crons    │
├─────────────────────────────────────────────────┤
│ TIER 2: Coding sub-agents (~$300/mo)             │
│ → GPT-5.3-Codex via API ($1.75/M) + MiniMax M2.5│
│ → For: delegated coding, debugging, refactoring  │
├─────────────────────────────────────────────────┤
│ TIER 3: Heartbeat/trivial (~$30/mo)              │
│ → Gemini Flash (near-free) or Haiku 4.5          │
│ → For: health checks, simple queries, 30min beat │
├─────────────────────────────────────────────────┤
│ TIER 4: Buffer (~$450/mo)                        │
│ → Overflow, experimentation, spikes              │
└─────────────────────────────────────────────────┘

Why Codex 5.3 changes the math: A single $20/mo ChatGPT Plus gives you near-unlimited frontier coding via Codex 5.3 (limits doubled until April 2026). This replaces hundreds of dollars in API coding costs. @phl43: "I have yet to run against the usage limits."

At $2K/mo you can process ~2.4 billion tokens/month using API routing, vs ~80 million tokens on Opus alone. That's 30x more work for the same money. Add subscriptions and you get frontier quality where it matters + massive volume everywhere else.

Critical Gotchas (Model-Specific)¶

Model	Gotcha	Impact
Kimi K2.5	2.5x more verbose than other models -- silently inflates costs	Budget blow-out
Kimi K2.5	"Horribly bad interfaces" for frontend/UI code	Use GLM-5 or Opus for UI
MiniMax M2.5	Suspected benchmark gaming -- wrote fake tests to pass SWE-bench	Real quality may be lower
MiniMax M2.5	Only 128K context (vs 200K GLM-5, 256K Kimi)	Fails on large codebases
GLM-5	No vision/multimodal support yet	Can't analyze screenshots
GLM-5	Struggles with frontend/UI generation	Strong backend, weak frontend
All models	OpenClaw sends heartbeats to primary model by default	#1 budget waste -- always configure heartbeat model separately

Use MiniMax M2.5-lightning (not regular M2.5) for sub-agents -- same quality, faster output.

Sample OpenClaw Config ($2K/mo Budget)¶

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "openrouter/z-ai/glm-5",
        "fallbacks": ["openrouter/moonshotai/kimi-k2.5", "deepseek/deepseek-chat"]
      },
      "heartbeat": { "every": "30m", "model": "deepseek/deepseek-chat" },
      "subagents": { "model": "openrouter/minimax/minimax-m2-5-lightning" }
    }
  }
}

Honest Assessment¶

Claude Opus 4.5/4.6 is still the best for complex, multi-file coding requiring situational awareness
GLM-5 is the biggest disruption -- frontier quality at budget pricing, fully open-source
Kimi K2.5 is the best value for agentic/tool-use workloads -- 150x cheaper than Claude for background tasks
MiniMax M2.5 is the dark horse -- 80.2% SWE-bench at 20x cheaper than Claude
Local models are great for autocomplete, quick queries, and privacy. Not yet a full replacement for frontier cloud models on complex tasks.
The gap is closing rapidly. What required H100 clusters 18 months ago now runs on a Mac Mini.