LLM Model Landscape (February 2026)¶
The frontier is moving fast. This document covers the current state of the art for both cloud and local models.
Last updated: February 14, 2026
Table of Contents¶
- GLM-5 (Just Released)
- Cloud Frontier Models
- Open-Source Rankings
- Best Coding Models
- Brand New Releases
- Model Selection Guide
GLM-5 (Just Released)¶
Released: February 11, 2026 Creator: Zhipu AI (Z.AI) -- Tsinghua University spinoff, Hong Kong IPO Jan 2026 ($558M raised) License: MIT (full commercial use)
Specifications¶
| Spec | Value |
|---|---|
| Total Parameters | ~744-754 billion |
| Active Parameters | ~40-44 billion per token |
| Architecture | Mixture of Experts (MoE) -- 256 experts, 8 activated |
| Context Window | 200K tokens |
| Max Output | 131,000 tokens |
| Training Data | 28.5 trillion tokens |
| Training Hardware | Huawei Ascend 910 (zero NVIDIA dependency) |
| Model Size on Disk | 1.51 TB |
| Predecessor | GLM-4.7 (368B total / ~32B active) |
Key Innovation¶
"Slime" RL technique -- novel reinforcement learning method achieving record-low hallucination rate among frontier models.
Benchmarks¶
| Benchmark | GLM-5 | Claude Opus 4.5 | GPT-5.2 |
|---|---|---|---|
| SWE-bench Verified | 77.8% | 80.9% | 76.2% |
| Terminal-Bench 2.0 | 56.2% | 59.3% | 54.2% |
| Humanity's Last Exam | 50.4 | 43.4 | 45.8 |
| BrowseComp | 75.9 (#1) | 67.8 | 59.2 |
| AI Intelligence Index | 50 (#1 of 64 models) | - | - |
- #1 open-source model on SWE-bench and BrowseComp
- Beats GPT-5.2 on most benchmarks
- Trails Claude Opus 4.5 on SWE-bench and Terminal-Bench
- Beats Claude on Humanity's Last Exam and BrowseComp
Pricing (The Disruption)¶
| Model | Input/1M tokens | Output/1M tokens |
|---|---|---|
| GLM-5 | ~$0.11 | ~$0.32 |
| Claude Opus 4.6 | $5.00 | $25.00 |
| GPT-5.2 | $1.75 | ~$10.00 |
GLM-5 is ~45x cheaper than Claude Opus on input and ~10x cheaper on output.
Availability¶
- Hugging Face:
zai-org/GLM-5(full weights) - Ollama:
ollama pull glm-5 - OpenRouter:
z-ai/glm-5 - Z.AI Platform: chat.z.ai
- vLLM: Officially supported
Early Assessment¶
Strong on structured tasks (coding, math, reasoning benchmarks). May lag behind Claude Opus on nuanced, context-heavy, "situationally aware" tasks. Community consensus: "exceptionally capable but far less situationally aware."
Note: GLM-5 at 1.51 TB requires significant hardware for local deployment. A Mac Studio M3 Ultra with 512GB cannot fit the unquantized model. Heavy quantization (Q2-Q3) is needed for consumer hardware, which degrades quality.
Cloud Frontier Models¶
| Rank | Model | Creator | Key Strengths |
|---|---|---|---|
| 1 | Claude Opus 4.5/4.6 | Anthropic | Top coding (SWE-bench 80.9%), best "vibes", situational awareness |
| 2 | GPT-5.2 | OpenAI | Strong all-rounder, 400K context, Codex integration |
| 3 | Gemini 2.5/3.0 Pro | Massive context windows, multimodal, strong reasoning | |
| 4 | Grok 4 | xAI | Competitive reasoning, real-time data access |
| 5 | Claude Sonnet 4.5 | Anthropic | Best price/performance for coding |
| 6 | GLM-5 | Zhipu AI | Best open-weight frontier, MIT license, ultra-cheap API |
Open-Source Rankings¶
By Coding Performance (LiveCodeBench + SWE-bench)¶
| Rank | Model | Creator | Params | Coding Score |
|---|---|---|---|---|
| 1 | DeepSeek V3.2 Speciale | DeepSeek | - | 90% LiveCodeBench |
| 2 | GLM-5 | Zhipu AI | 744B MoE | 77.8% SWE-bench (#1 open-source) |
| 3 | gpt-oss-120B | OpenAI | 120B | 88% LiveCodeBench |
| 4 | MiMo-V2-Flash | Xiaomi | - | 87% LiveCodeBench |
| 5 | Kimi K2.5 | MoonshotAI | 1T MoE (32B active) | 76.8% SWE-bench, 96.1% AIME |
| 6 | GLM-5 | Zhipu AI | 744B MoE | 77.8% SWE-bench (#1 open-source) |
| 7 | MiniMax M2.5 | MiniMax | - | 80.2% SWE-bench, 20x cheaper than Claude |
By Math Performance (AIME 2025)¶
| Model | Score |
|---|---|
| DeepSeek V3.2 Speciale | 97% |
| Kimi K2.5 (Reasoning) | 96% |
| MiMo-V2-Flash | 96% |
| GLM-5 | 95% |
| gpt-oss-120B | 93% |
| DeepSeek V3.2 | 92% |
Best Coding Models¶
For Cloud API Usage¶
| Use Case | Best Model | Why |
|---|---|---|
| Best quality overall | Claude Opus 4.5/4.6 | SWE-bench 80.9%, best situational awareness |
| Best value (subscription) | GPT-5.3-Codex | $20/mo Plus includes frontier coding. Limits doubled until April 2026. |
| Best price/performance (API) | Claude Sonnet 4.5 | Strong coding at lower cost |
| Cheapest frontier-quality (API) | GLM-5 | 45x cheaper than Claude, SWE-bench 77.8% |
| Async task delegation | OpenAI Codex (GPT-5.3) | Background task execution, cloud sandboxed VMs |
GPT-5.3-Codex for OpenClaw (community-validated): - $20/mo ChatGPT Plus includes GPT-5.3-Codex (low/medium/high/xhigh tiers) - @_karimelk: "running my openclaw on GPT-5.3-codex medium with excellent results" - @Shenoy465653734: "OpenClaw's power is realised with frontier models only i.e 5.3 Codex and Opus 4.6" - Limits 2x more generous than Claude Pro. Near-unlimited for coding at $20/mo. - See Subscriptions Guide for full pricing.
For Local Deployment¶
| VRAM Budget | Best Model | Notes |
|---|---|---|
| 8GB | Qwen 2.5 Coder 7B | 88.4% HumanEval |
| 16GB | Qwen 2.5 Coder 14B | Strong small model |
| 24GB | Qwen3 30B (Q4) | Competitive quality |
| 48-64GB | Qwen3-Coder 32B | Near-frontier, practical |
| 128GB+ | Qwen3-next-80B | Best speed/quality balance |
| 192GB+ | Qwen-3 235B | Frontier-class local |
| 512GB+ | GLM-5 (quantized), DeepSeek V3 | Largest models |
Best for Claude Code CLI (Local)¶
- GLM-5 -- Latest frontier open-source, MIT license, top community pick
- Qwen3-Coder (30B-A3B MoE) -- Best efficiency for smaller setups
- GPT-OSS 120B -- Strong coder, Apache 2.0
- Devstral-small -- Good for agentic coding on limited hardware
Brand New Releases (Last 4 Weeks)¶
| Model | Date | Creator | Notable |
|---|---|---|---|
| GLM-5 | Feb 11, 2026 | Zhipu AI | 744B MoE, MIT, #1 open-source |
| MiniMax M2.5 | Feb 2026 | MiniMax | 80.2% SWE-bench, 20x cheaper than Claude |
| DeepSeek V3.2 Speciale | Jan-Feb 2026 | DeepSeek | 90% LiveCodeBench (highest open-source) |
| gpt-oss-120B | Recent | OpenAI | Apache 2.0, OpenAI's first serious open-source |
| MiMo-V2-Flash | Recent | Xiaomi | 87% coding, 96% math |
| Qwen3-Coder-480B | Recent | Alibaba | Agentic coding focused |
| Kimi K2.5 | Jan 26, 2026 | MoonshotAI | 1T MoE, 32B active, native multimodal, agent swarms |
| Llama 4 Scout/Maverick | Recent | Meta | Natively multimodal, open-source |
Model Selection Guide¶
Decision Tree¶
Need absolute best quality?
└─ Yes → Claude Opus 4.5/4.6 (cloud)
└─ No
│
Need it free / private?
└─ Yes
│ │
│ Have 64GB+ unified memory?
│ └─ Yes → GLM-5 (quantized) or Qwen3-Coder (local via Ollama)
│ └─ No → Qwen 2.5 Coder 7B/14B (local)
│
└─ No
│
Budget matters?
└─ Yes → GLM-5 API ($0.11/M input) or Groq free tier
└─ No → Claude Sonnet 4.5 (best price/performance cloud)
Cost Per Million Tokens¶
| Model | Input | Output | Quality Tier |
|---|---|---|---|
| Groq (Llama 3.1 70B) | $0.59 | $0.79 | Good |
| DeepInfra (various) | $0.08 | Varies | Good |
| GLM-5 | $0.11 | $0.32 | Frontier |
| Claude Sonnet 4.5 | $3.00 | $15.00 | Frontier |
| Claude Opus 4.6 | $5.00 | $25.00 | Best |
| GPT-5.2 | $1.75 | ~$10.00 | Frontier |
Kimi K2.5 (The Agent Swarm Model)¶
Released: January 26, 2026 Creator: Moonshot AI License: Modified MIT (full commercial use)
| Spec | Value |
|---|---|
| Total Parameters | 1 trillion (1T) |
| Active Parameters | 32B per token |
| Architecture | MoE -- 384 routed experts, 8 activated per token |
| Context Window | Up to 512K tokens |
| Vision | Native multimodal (MoonViT, 200M param vision encoder) |
| Unique Feature | Self-directed agent swarm -- orchestrates up to 100 sub-agents |
| Model Size on Disk | ~595 GB |
Benchmarks: - SWE-bench Verified: 76.8% (close to GLM-5's 77.8%) - AIME 2025 (Math): 96.1% - Tool-use improvement: +20.1 pp (nearly 2x better than GPT-5.2) - Beats Claude on tool-use, trails on coding
Pricing (ultra-cheap): - Input: $0.10-0.60/M tokens (cache hit vs miss) - Output: ~$2.80/M tokens - 8-50x cheaper than Claude Opus depending on caching
Availability: HuggingFace, Ollama (ollama pull kimi-k2.5), OpenRouter, Together AI, NVIDIA NIM
OpenClaw integration: Officially supported. Moonshot published a guide. OpenClaw announced free Kimi K2.5 access at "1/9 the cost."
Reddit says: "Oh good lord it's good" and "It can almost do 90% of what Claude Opus 4.5 can do." Zero production case studies yet as of Feb 2026. Privacy concern: Chinese company, data subject to Chinese regulations.
MiniMax M2.5 (The Budget Frontier Coder)¶
Released: February 2026 Creator: MiniMax (Chinese AI lab) License: Open-source
| Spec | Value |
|---|---|
| SWE-bench Verified | 80.2% (higher than GLM-5's 77.8%) |
| Positioning | Budget alternative to Claude Opus |
| Cost | ~$10/month for typical OpenClaw usage (~20x cheaper than Claude) |
| Open-source | Yes -- full weights available |
Key takeaway from AI Grid review: - Beats GLM-5 on SWE-bench coding benchmarks - Close to Claude Opus 4.5 on code generation tasks - Fraction of the cost -- viable for budget OpenClaw setups - Early community reports are positive but limited production data
OpenClaw integration: Use via OpenRouter (minimax/m2.5) or direct API. Good candidate for the "coding muscle" in a model routing setup.
Best Models for OpenClaw (Community Tested)¶
Real-world OpenClaw model recommendations from Twitter/X community (Feb 2026):
| Use | Recommended Model | Why | Source |
|---|---|---|---|
| Primary agent (budget) | Kimi K2.5 | Highest used model for OpenClaw, 256K context, agent-first design | @inqusit |
| Primary agent (quality) | Claude Opus 4.6 | Best reasoning, but $50-100/day API costs | @thaliabloomai |
| Primary agent (best value) | GLM-5 | "Seems better already compared to Kimi 2.5" | @philblines |
| Heartbeat/cron | Grok 4.1 FR or Haiku 4.5 | Cheap, fast, good enough for checks | @RichP |
| Fallback chain | GLM 4.7 → Kimi 2.5 | Multiple cheap models for resilience | @RichP |
| Code-only | GPT-5.2 | "Can fix code right away without needing another coding agent" | @DegenApe99 |
Community power user stack:
Primary: Gemini 3 Flash Preview (via OpenRouter)
Heartbeat: Grok 4.1 FR
Fallbacks: GLM 4.7, Kimi 2.5
Premium override: Claude Opus 4.6 (for complex tasks)
The multi-model experiment (one user on Mac Mini):
"Experimenting with Kimi / Opus / Codex / GLM on a Mac Mini + OpenClaw setup. I care less about 'which model is absolute smartest' and more about 'which ones I can run 24/7 without killing my wallet or my Mac Mini.'" -- @h_a_t_a_r_a_k_e
GLM-5 vs Kimi K2.5 Head-to-Head (For OpenClaw)¶
The two cheapest frontier models. Which one to use?
| Dimension | GLM-5 | Kimi K2.5 | Winner |
|---|---|---|---|
| SWE-Bench Verified | 77.8% | 76.8% | GLM-5 |
| Context Window | 200K | 256K | Kimi K2.5 |
| Hallucination Rate | Lowest (record) | Standard | GLM-5 |
| Input Cost ($/M) | $0.11-0.80 | $0.10-0.45 | Kimi K2.5 (cached) |
| Output Cost ($/M) | $0.32-2.56 | $2.50-2.80 | GLM-5 (cheaper) |
| Agent Design | General purpose | Agent-first (swarms) | Kimi K2.5 |
| Multimodal | Text-only | Native vision | Kimi K2.5 |
| Sub-Agent Swarms | No | Yes (100 sub-agents) | Kimi K2.5 |
| Math (AIME) | 95% | 96.1% | Tie |
| Tool Calling | Good | +20.1 pp vs GPT-5.2 | Kimi K2.5 |
Verdict: GLM-5 wins on raw coding quality and low hallucination. Kimi K2.5 wins for agent use (longer context, native vision, swarm orchestration, better tool calling). For OpenClaw specifically, Kimi K2.5 is better as primary agent model, GLM-5 is better as coding sub-agent.
Real-World Reliability Test (Feb 14, 2026)¶
Source: @LufzzLiz -- Same prompt across 5 models: "Check today's Shanghai weather, write a morning greeting based on it, save to workspace/greeting.txt"
| Model | Result | Quality | Speed | Notes |
|---|---|---|---|---|
| Claude Opus 4.6 | Pass | Best — noticed it was evening + Valentine's Day, adapted greeting | Normal | Worth the premium for quality-critical |
| GLM-5 | Pass | Good — also noticed Valentine's Day | Slow | Quality there, speed isn't |
| MiniMax M2.5 | Pass | Good | Normal | Stable, reliable |
| Gemini Flash | Pass | Basic but correct | Fastest | Best reliability + speed + cost ratio |
| Kimi K2.5 | Failed | N/A | N/A | Infinite loop querying weather → rate limit → task crashed |
@LufzzLiz recommendation: Primary: Gemini Flash (fast + cheap + reliable). High-end: Claude Opus 4.6. Backup: MiniMax M2.5.
Community consensus on reliability (aggregated): - @cdz_solo: "Used GLM-5 for a day, MiniMax M2.5 for a day. Went back to Kimi K2.5" (prefers Kimi despite flaws) - @takafirstpen: "OpenClaw usability: Kimi K2.5 > GLM-5" - @llt139574: "Kimi K2.5 is hard to use... code doesn't work, doesn't understand local code environment" - @dddanielwang: "Only GLM-5 and Kimi 2.5 are economical with decent recall. For a good brain you still need Opus/GPT" - @MartinSzerment: "Kimi K2.5 matched Opus 4.5 performance at 1/8th cost. Top model on OpenClaw and OpenRouter"
Bottom line for 24/7 agents: Kimi K2.5 is the most popular but can be unreliable (loops, rate limits). GLM-5 is higher quality but slow. For always-on reliability, Gemini Flash or MiniMax M2.5 are safer. Use Opus for quality-critical decisions.
Open-Source Progression Strategy¶
The trend: start with cloud APIs, progressively self-host as models improve and your hardware arrives.
Phase 1 — Cloud APIs (Now)
├── Primary: Kimi K2.5 via OpenRouter ($0.10/M)
├── Fallback: MiniMax M2.5 or Gemini Flash
├── Premium: Claude Opus 4.6 for complex reasoning
└── Cost: $20-100/mo API
Phase 2 — Hybrid (Self-host + Cloud)
├── Self-host: GLM-5 or MiniMax M2.5 on your server (EPYC, Mac Studio, etc.)
├── Local handles: 80-90% of agent tasks (heartbeat, cron, routine)
├── Cloud handles: 10-20% quality-critical (Opus for strategy, architecture)
└── Cost: $10-30/mo API + one-time hardware
Phase 3 — Mostly Local (Target)
├── 90%+ local open-source models
├── Cloud only for: refinement, error recovery, complex reasoning
├── Keep Claude Code subscriptions for coding (highest quality)
└── Cost: Near-zero API + subscriptions
Hardware needed for self-hosting: - GLM-5 (quantized): 128-192GB unified memory or 2-4x RTX 4090 - MiniMax M2.5 (8-bit): ~48GB VRAM (2x RTX 4090 or M3 Ultra 512GB) -- @Patrick1Kennedy confirmed working - Kimi K2.5: ~595GB on disk, needs substantial hardware - Qwen3-Coder: 48-64GB for 32B version, most practical local option
$2K/Month Budget: Maximum ROI Model Routing¶
For a user willing to spend $2,000/mo on API costs, here's the optimal strategy:
┌─────────────────────────────────────────────────┐
│ OPTIMAL MODEL ROUTING │
│ ($2,000/month budget) │
├─────────────────────────────────────────────────┤
│ SUBSCRIPTIONS (fixed monthly): │
│ → 3x Claude Max $200 ($600) — Opus for reasoning│
│ → 1x ChatGPT Plus $20 ($20) — Codex 5.3 coding │
│ → 1x Codex Pro $200 ($200) — heavy async tasks │
│ → Gemini CLI free ($0) — planning, large context │
│ → Subtotal: $820/mo fixed │
├─────────────────────────────────────────────────┤
│ API BUDGET (~$1,180/mo remaining): │
├─────────────────────────────────────────────────┤
│ TIER 1: OpenClaw primary brain (~$400/mo) │
│ → Kimi K2.5 ($0.10/M cached) or GLM-5 ($0.11/M) │
│ → For: 24/7 agent operations, research, crons │
├─────────────────────────────────────────────────┤
│ TIER 2: Coding sub-agents (~$300/mo) │
│ → GPT-5.3-Codex via API ($1.75/M) + MiniMax M2.5│
│ → For: delegated coding, debugging, refactoring │
├─────────────────────────────────────────────────┤
│ TIER 3: Heartbeat/trivial (~$30/mo) │
│ → Gemini Flash (near-free) or Haiku 4.5 │
│ → For: health checks, simple queries, 30min beat │
├─────────────────────────────────────────────────┤
│ TIER 4: Buffer (~$450/mo) │
│ → Overflow, experimentation, spikes │
└─────────────────────────────────────────────────┘
Why Codex 5.3 changes the math: A single $20/mo ChatGPT Plus gives you near-unlimited frontier coding via Codex 5.3 (limits doubled until April 2026). This replaces hundreds of dollars in API coding costs. @phl43: "I have yet to run against the usage limits."
At $2K/mo you can process ~2.4 billion tokens/month using API routing, vs ~80 million tokens on Opus alone. That's 30x more work for the same money. Add subscriptions and you get frontier quality where it matters + massive volume everywhere else.
Critical Gotchas (Model-Specific)¶
| Model | Gotcha | Impact |
|---|---|---|
| Kimi K2.5 | 2.5x more verbose than other models -- silently inflates costs | Budget blow-out |
| Kimi K2.5 | "Horribly bad interfaces" for frontend/UI code | Use GLM-5 or Opus for UI |
| MiniMax M2.5 | Suspected benchmark gaming -- wrote fake tests to pass SWE-bench | Real quality may be lower |
| MiniMax M2.5 | Only 128K context (vs 200K GLM-5, 256K Kimi) | Fails on large codebases |
| GLM-5 | No vision/multimodal support yet | Can't analyze screenshots |
| GLM-5 | Struggles with frontend/UI generation | Strong backend, weak frontend |
| All models | OpenClaw sends heartbeats to primary model by default | #1 budget waste -- always configure heartbeat model separately |
Use MiniMax M2.5-lightning (not regular M2.5) for sub-agents -- same quality, faster output.
Sample OpenClaw Config ($2K/mo Budget)¶
{
"agents": {
"defaults": {
"model": {
"primary": "openrouter/z-ai/glm-5",
"fallbacks": ["openrouter/moonshotai/kimi-k2.5", "deepseek/deepseek-chat"]
},
"heartbeat": { "every": "30m", "model": "deepseek/deepseek-chat" },
"subagents": { "model": "openrouter/minimax/minimax-m2-5-lightning" }
}
}
}
Honest Assessment¶
- Claude Opus 4.5/4.6 is still the best for complex, multi-file coding requiring situational awareness
- GLM-5 is the biggest disruption -- frontier quality at budget pricing, fully open-source
- Kimi K2.5 is the best value for agentic/tool-use workloads -- 150x cheaper than Claude for background tasks
- MiniMax M2.5 is the dark horse -- 80.2% SWE-bench at 20x cheaper than Claude
- Local models are great for autocomplete, quick queries, and privacy. Not yet a full replacement for frontier cloud models on complex tasks.
- The gap is closing rapidly. What required H100 clusters 18 months ago now runs on a Mac Mini.