Skip to content

LLM Model Landscape (February 2026)

The frontier is moving fast. This document covers the current state of the art for both cloud and local models.

Last updated: February 14, 2026


Table of Contents


GLM-5 (Just Released)

Released: February 11, 2026 Creator: Zhipu AI (Z.AI) -- Tsinghua University spinoff, Hong Kong IPO Jan 2026 ($558M raised) License: MIT (full commercial use)

Specifications

Spec Value
Total Parameters ~744-754 billion
Active Parameters ~40-44 billion per token
Architecture Mixture of Experts (MoE) -- 256 experts, 8 activated
Context Window 200K tokens
Max Output 131,000 tokens
Training Data 28.5 trillion tokens
Training Hardware Huawei Ascend 910 (zero NVIDIA dependency)
Model Size on Disk 1.51 TB
Predecessor GLM-4.7 (368B total / ~32B active)

Key Innovation

"Slime" RL technique -- novel reinforcement learning method achieving record-low hallucination rate among frontier models.

Benchmarks

Benchmark GLM-5 Claude Opus 4.5 GPT-5.2
SWE-bench Verified 77.8% 80.9% 76.2%
Terminal-Bench 2.0 56.2% 59.3% 54.2%
Humanity's Last Exam 50.4 43.4 45.8
BrowseComp 75.9 (#1) 67.8 59.2
AI Intelligence Index 50 (#1 of 64 models) - -
  • #1 open-source model on SWE-bench and BrowseComp
  • Beats GPT-5.2 on most benchmarks
  • Trails Claude Opus 4.5 on SWE-bench and Terminal-Bench
  • Beats Claude on Humanity's Last Exam and BrowseComp

Pricing (The Disruption)

Model Input/1M tokens Output/1M tokens
GLM-5 ~$0.11 ~$0.32
Claude Opus 4.6 $5.00 $25.00
GPT-5.2 $1.75 ~$10.00

GLM-5 is ~45x cheaper than Claude Opus on input and ~10x cheaper on output.

Availability

  • Hugging Face: zai-org/GLM-5 (full weights)
  • Ollama: ollama pull glm-5
  • OpenRouter: z-ai/glm-5
  • Z.AI Platform: chat.z.ai
  • vLLM: Officially supported

Early Assessment

Strong on structured tasks (coding, math, reasoning benchmarks). May lag behind Claude Opus on nuanced, context-heavy, "situationally aware" tasks. Community consensus: "exceptionally capable but far less situationally aware."

Note: GLM-5 at 1.51 TB requires significant hardware for local deployment. A Mac Studio M3 Ultra with 512GB cannot fit the unquantized model. Heavy quantization (Q2-Q3) is needed for consumer hardware, which degrades quality.


Cloud Frontier Models

Rank Model Creator Key Strengths
1 Claude Opus 4.5/4.6 Anthropic Top coding (SWE-bench 80.9%), best "vibes", situational awareness
2 GPT-5.2 OpenAI Strong all-rounder, 400K context, Codex integration
3 Gemini 2.5/3.0 Pro Google Massive context windows, multimodal, strong reasoning
4 Grok 4 xAI Competitive reasoning, real-time data access
5 Claude Sonnet 4.5 Anthropic Best price/performance for coding
6 GLM-5 Zhipu AI Best open-weight frontier, MIT license, ultra-cheap API

Open-Source Rankings

By Coding Performance (LiveCodeBench + SWE-bench)

Rank Model Creator Params Coding Score
1 DeepSeek V3.2 Speciale DeepSeek - 90% LiveCodeBench
2 GLM-5 Zhipu AI 744B MoE 77.8% SWE-bench (#1 open-source)
3 gpt-oss-120B OpenAI 120B 88% LiveCodeBench
4 MiMo-V2-Flash Xiaomi - 87% LiveCodeBench
5 Kimi K2.5 MoonshotAI 1T MoE (32B active) 76.8% SWE-bench, 96.1% AIME
6 GLM-5 Zhipu AI 744B MoE 77.8% SWE-bench (#1 open-source)
7 MiniMax M2.5 MiniMax - 80.2% SWE-bench, 20x cheaper than Claude

By Math Performance (AIME 2025)

Model Score
DeepSeek V3.2 Speciale 97%
Kimi K2.5 (Reasoning) 96%
MiMo-V2-Flash 96%
GLM-5 95%
gpt-oss-120B 93%
DeepSeek V3.2 92%

Best Coding Models

For Cloud API Usage

Use Case Best Model Why
Best quality overall Claude Opus 4.5/4.6 SWE-bench 80.9%, best situational awareness
Best value (subscription) GPT-5.3-Codex $20/mo Plus includes frontier coding. Limits doubled until April 2026.
Best price/performance (API) Claude Sonnet 4.5 Strong coding at lower cost
Cheapest frontier-quality (API) GLM-5 45x cheaper than Claude, SWE-bench 77.8%
Async task delegation OpenAI Codex (GPT-5.3) Background task execution, cloud sandboxed VMs

GPT-5.3-Codex for OpenClaw (community-validated): - $20/mo ChatGPT Plus includes GPT-5.3-Codex (low/medium/high/xhigh tiers) - @_karimelk: "running my openclaw on GPT-5.3-codex medium with excellent results" - @Shenoy465653734: "OpenClaw's power is realised with frontier models only i.e 5.3 Codex and Opus 4.6" - Limits 2x more generous than Claude Pro. Near-unlimited for coding at $20/mo. - See Subscriptions Guide for full pricing.

For Local Deployment

VRAM Budget Best Model Notes
8GB Qwen 2.5 Coder 7B 88.4% HumanEval
16GB Qwen 2.5 Coder 14B Strong small model
24GB Qwen3 30B (Q4) Competitive quality
48-64GB Qwen3-Coder 32B Near-frontier, practical
128GB+ Qwen3-next-80B Best speed/quality balance
192GB+ Qwen-3 235B Frontier-class local
512GB+ GLM-5 (quantized), DeepSeek V3 Largest models

Best for Claude Code CLI (Local)

  1. GLM-5 -- Latest frontier open-source, MIT license, top community pick
  2. Qwen3-Coder (30B-A3B MoE) -- Best efficiency for smaller setups
  3. GPT-OSS 120B -- Strong coder, Apache 2.0
  4. Devstral-small -- Good for agentic coding on limited hardware

Brand New Releases (Last 4 Weeks)

Model Date Creator Notable
GLM-5 Feb 11, 2026 Zhipu AI 744B MoE, MIT, #1 open-source
MiniMax M2.5 Feb 2026 MiniMax 80.2% SWE-bench, 20x cheaper than Claude
DeepSeek V3.2 Speciale Jan-Feb 2026 DeepSeek 90% LiveCodeBench (highest open-source)
gpt-oss-120B Recent OpenAI Apache 2.0, OpenAI's first serious open-source
MiMo-V2-Flash Recent Xiaomi 87% coding, 96% math
Qwen3-Coder-480B Recent Alibaba Agentic coding focused
Kimi K2.5 Jan 26, 2026 MoonshotAI 1T MoE, 32B active, native multimodal, agent swarms
Llama 4 Scout/Maverick Recent Meta Natively multimodal, open-source

Model Selection Guide

Decision Tree

Need absolute best quality?
  └─ Yes → Claude Opus 4.5/4.6 (cloud)
  └─ No
      Need it free / private?
      └─ Yes
      │   │
      │   Have 64GB+ unified memory?
      │   └─ Yes → GLM-5 (quantized) or Qwen3-Coder (local via Ollama)
      │   └─ No → Qwen 2.5 Coder 7B/14B (local)
      └─ No
          Budget matters?
          └─ Yes → GLM-5 API ($0.11/M input) or Groq free tier
          └─ No → Claude Sonnet 4.5 (best price/performance cloud)

Cost Per Million Tokens

Model Input Output Quality Tier
Groq (Llama 3.1 70B) $0.59 $0.79 Good
DeepInfra (various) $0.08 Varies Good
GLM-5 $0.11 $0.32 Frontier
Claude Sonnet 4.5 $3.00 $15.00 Frontier
Claude Opus 4.6 $5.00 $25.00 Best
GPT-5.2 $1.75 ~$10.00 Frontier

Kimi K2.5 (The Agent Swarm Model)

Released: January 26, 2026 Creator: Moonshot AI License: Modified MIT (full commercial use)

Spec Value
Total Parameters 1 trillion (1T)
Active Parameters 32B per token
Architecture MoE -- 384 routed experts, 8 activated per token
Context Window Up to 512K tokens
Vision Native multimodal (MoonViT, 200M param vision encoder)
Unique Feature Self-directed agent swarm -- orchestrates up to 100 sub-agents
Model Size on Disk ~595 GB

Benchmarks: - SWE-bench Verified: 76.8% (close to GLM-5's 77.8%) - AIME 2025 (Math): 96.1% - Tool-use improvement: +20.1 pp (nearly 2x better than GPT-5.2) - Beats Claude on tool-use, trails on coding

Pricing (ultra-cheap): - Input: $0.10-0.60/M tokens (cache hit vs miss) - Output: ~$2.80/M tokens - 8-50x cheaper than Claude Opus depending on caching

Availability: HuggingFace, Ollama (ollama pull kimi-k2.5), OpenRouter, Together AI, NVIDIA NIM

OpenClaw integration: Officially supported. Moonshot published a guide. OpenClaw announced free Kimi K2.5 access at "1/9 the cost."

Reddit says: "Oh good lord it's good" and "It can almost do 90% of what Claude Opus 4.5 can do." Zero production case studies yet as of Feb 2026. Privacy concern: Chinese company, data subject to Chinese regulations.


MiniMax M2.5 (The Budget Frontier Coder)

Released: February 2026 Creator: MiniMax (Chinese AI lab) License: Open-source

Spec Value
SWE-bench Verified 80.2% (higher than GLM-5's 77.8%)
Positioning Budget alternative to Claude Opus
Cost ~$10/month for typical OpenClaw usage (~20x cheaper than Claude)
Open-source Yes -- full weights available

Key takeaway from AI Grid review: - Beats GLM-5 on SWE-bench coding benchmarks - Close to Claude Opus 4.5 on code generation tasks - Fraction of the cost -- viable for budget OpenClaw setups - Early community reports are positive but limited production data

OpenClaw integration: Use via OpenRouter (minimax/m2.5) or direct API. Good candidate for the "coding muscle" in a model routing setup.


Best Models for OpenClaw (Community Tested)

Real-world OpenClaw model recommendations from Twitter/X community (Feb 2026):

Use Recommended Model Why Source
Primary agent (budget) Kimi K2.5 Highest used model for OpenClaw, 256K context, agent-first design @inqusit
Primary agent (quality) Claude Opus 4.6 Best reasoning, but $50-100/day API costs @thaliabloomai
Primary agent (best value) GLM-5 "Seems better already compared to Kimi 2.5" @philblines
Heartbeat/cron Grok 4.1 FR or Haiku 4.5 Cheap, fast, good enough for checks @RichP
Fallback chain GLM 4.7 → Kimi 2.5 Multiple cheap models for resilience @RichP
Code-only GPT-5.2 "Can fix code right away without needing another coding agent" @DegenApe99

Community power user stack:

Primary: Gemini 3 Flash Preview (via OpenRouter)
Heartbeat: Grok 4.1 FR
Fallbacks: GLM 4.7, Kimi 2.5
Premium override: Claude Opus 4.6 (for complex tasks)

The multi-model experiment (one user on Mac Mini):

"Experimenting with Kimi / Opus / Codex / GLM on a Mac Mini + OpenClaw setup. I care less about 'which model is absolute smartest' and more about 'which ones I can run 24/7 without killing my wallet or my Mac Mini.'" -- @h_a_t_a_r_a_k_e


GLM-5 vs Kimi K2.5 Head-to-Head (For OpenClaw)

The two cheapest frontier models. Which one to use?

Dimension GLM-5 Kimi K2.5 Winner
SWE-Bench Verified 77.8% 76.8% GLM-5
Context Window 200K 256K Kimi K2.5
Hallucination Rate Lowest (record) Standard GLM-5
Input Cost ($/M) $0.11-0.80 $0.10-0.45 Kimi K2.5 (cached)
Output Cost ($/M) $0.32-2.56 $2.50-2.80 GLM-5 (cheaper)
Agent Design General purpose Agent-first (swarms) Kimi K2.5
Multimodal Text-only Native vision Kimi K2.5
Sub-Agent Swarms No Yes (100 sub-agents) Kimi K2.5
Math (AIME) 95% 96.1% Tie
Tool Calling Good +20.1 pp vs GPT-5.2 Kimi K2.5

Verdict: GLM-5 wins on raw coding quality and low hallucination. Kimi K2.5 wins for agent use (longer context, native vision, swarm orchestration, better tool calling). For OpenClaw specifically, Kimi K2.5 is better as primary agent model, GLM-5 is better as coding sub-agent.

Real-World Reliability Test (Feb 14, 2026)

Source: @LufzzLiz -- Same prompt across 5 models: "Check today's Shanghai weather, write a morning greeting based on it, save to workspace/greeting.txt"

Model Result Quality Speed Notes
Claude Opus 4.6 Pass Best — noticed it was evening + Valentine's Day, adapted greeting Normal Worth the premium for quality-critical
GLM-5 Pass Good — also noticed Valentine's Day Slow Quality there, speed isn't
MiniMax M2.5 Pass Good Normal Stable, reliable
Gemini Flash Pass Basic but correct Fastest Best reliability + speed + cost ratio
Kimi K2.5 Failed N/A N/A Infinite loop querying weather → rate limit → task crashed

@LufzzLiz recommendation: Primary: Gemini Flash (fast + cheap + reliable). High-end: Claude Opus 4.6. Backup: MiniMax M2.5.

Community consensus on reliability (aggregated): - @cdz_solo: "Used GLM-5 for a day, MiniMax M2.5 for a day. Went back to Kimi K2.5" (prefers Kimi despite flaws) - @takafirstpen: "OpenClaw usability: Kimi K2.5 > GLM-5" - @llt139574: "Kimi K2.5 is hard to use... code doesn't work, doesn't understand local code environment" - @dddanielwang: "Only GLM-5 and Kimi 2.5 are economical with decent recall. For a good brain you still need Opus/GPT" - @MartinSzerment: "Kimi K2.5 matched Opus 4.5 performance at 1/8th cost. Top model on OpenClaw and OpenRouter"

Bottom line for 24/7 agents: Kimi K2.5 is the most popular but can be unreliable (loops, rate limits). GLM-5 is higher quality but slow. For always-on reliability, Gemini Flash or MiniMax M2.5 are safer. Use Opus for quality-critical decisions.

Open-Source Progression Strategy

The trend: start with cloud APIs, progressively self-host as models improve and your hardware arrives.

Phase 1 — Cloud APIs (Now)
├── Primary: Kimi K2.5 via OpenRouter ($0.10/M)
├── Fallback: MiniMax M2.5 or Gemini Flash
├── Premium: Claude Opus 4.6 for complex reasoning
└── Cost: $20-100/mo API

Phase 2 — Hybrid (Self-host + Cloud)
├── Self-host: GLM-5 or MiniMax M2.5 on your server (EPYC, Mac Studio, etc.)
├── Local handles: 80-90% of agent tasks (heartbeat, cron, routine)
├── Cloud handles: 10-20% quality-critical (Opus for strategy, architecture)
└── Cost: $10-30/mo API + one-time hardware

Phase 3 — Mostly Local (Target)
├── 90%+ local open-source models
├── Cloud only for: refinement, error recovery, complex reasoning
├── Keep Claude Code subscriptions for coding (highest quality)
└── Cost: Near-zero API + subscriptions

Hardware needed for self-hosting: - GLM-5 (quantized): 128-192GB unified memory or 2-4x RTX 4090 - MiniMax M2.5 (8-bit): ~48GB VRAM (2x RTX 4090 or M3 Ultra 512GB) -- @Patrick1Kennedy confirmed working - Kimi K2.5: ~595GB on disk, needs substantial hardware - Qwen3-Coder: 48-64GB for 32B version, most practical local option

$2K/Month Budget: Maximum ROI Model Routing

For a user willing to spend $2,000/mo on API costs, here's the optimal strategy:

┌─────────────────────────────────────────────────┐
│           OPTIMAL MODEL ROUTING                  │
│           ($2,000/month budget)                  │
├─────────────────────────────────────────────────┤
│ SUBSCRIPTIONS (fixed monthly):                   │
│ → 3x Claude Max $200 ($600) — Opus for reasoning│
│ → 1x ChatGPT Plus $20 ($20) — Codex 5.3 coding  │
│ → 1x Codex Pro $200 ($200) — heavy async tasks   │
│ → Gemini CLI free ($0) — planning, large context │
│ → Subtotal: $820/mo fixed                        │
├─────────────────────────────────────────────────┤
│ API BUDGET (~$1,180/mo remaining):               │
├─────────────────────────────────────────────────┤
│ TIER 1: OpenClaw primary brain (~$400/mo)        │
│ → Kimi K2.5 ($0.10/M cached) or GLM-5 ($0.11/M) │
│ → For: 24/7 agent operations, research, crons    │
├─────────────────────────────────────────────────┤
│ TIER 2: Coding sub-agents (~$300/mo)             │
│ → GPT-5.3-Codex via API ($1.75/M) + MiniMax M2.5│
│ → For: delegated coding, debugging, refactoring  │
├─────────────────────────────────────────────────┤
│ TIER 3: Heartbeat/trivial (~$30/mo)              │
│ → Gemini Flash (near-free) or Haiku 4.5          │
│ → For: health checks, simple queries, 30min beat │
├─────────────────────────────────────────────────┤
│ TIER 4: Buffer (~$450/mo)                        │
│ → Overflow, experimentation, spikes              │
└─────────────────────────────────────────────────┘

Why Codex 5.3 changes the math: A single $20/mo ChatGPT Plus gives you near-unlimited frontier coding via Codex 5.3 (limits doubled until April 2026). This replaces hundreds of dollars in API coding costs. @phl43: "I have yet to run against the usage limits."

At $2K/mo you can process ~2.4 billion tokens/month using API routing, vs ~80 million tokens on Opus alone. That's 30x more work for the same money. Add subscriptions and you get frontier quality where it matters + massive volume everywhere else.

Critical Gotchas (Model-Specific)

Model Gotcha Impact
Kimi K2.5 2.5x more verbose than other models -- silently inflates costs Budget blow-out
Kimi K2.5 "Horribly bad interfaces" for frontend/UI code Use GLM-5 or Opus for UI
MiniMax M2.5 Suspected benchmark gaming -- wrote fake tests to pass SWE-bench Real quality may be lower
MiniMax M2.5 Only 128K context (vs 200K GLM-5, 256K Kimi) Fails on large codebases
GLM-5 No vision/multimodal support yet Can't analyze screenshots
GLM-5 Struggles with frontend/UI generation Strong backend, weak frontend
All models OpenClaw sends heartbeats to primary model by default #1 budget waste -- always configure heartbeat model separately

Use MiniMax M2.5-lightning (not regular M2.5) for sub-agents -- same quality, faster output.

Sample OpenClaw Config ($2K/mo Budget)

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "openrouter/z-ai/glm-5",
        "fallbacks": ["openrouter/moonshotai/kimi-k2.5", "deepseek/deepseek-chat"]
      },
      "heartbeat": { "every": "30m", "model": "deepseek/deepseek-chat" },
      "subagents": { "model": "openrouter/minimax/minimax-m2-5-lightning" }
    }
  }
}

Honest Assessment

  • Claude Opus 4.5/4.6 is still the best for complex, multi-file coding requiring situational awareness
  • GLM-5 is the biggest disruption -- frontier quality at budget pricing, fully open-source
  • Kimi K2.5 is the best value for agentic/tool-use workloads -- 150x cheaper than Claude for background tasks
  • MiniMax M2.5 is the dark horse -- 80.2% SWE-bench at 20x cheaper than Claude
  • Local models are great for autocomplete, quick queries, and privacy. Not yet a full replacement for frontier cloud models on complex tasks.
  • The gap is closing rapidly. What required H100 clusters 18 months ago now runs on a Mac Mini.