Skip to content

Hardware Guide: Local AI Inference

Apple Silicon, AMD EPYC, NVIDIA GPUs, and everything in between. What to buy for local LLM inference, what performance to expect, and when to self-host vs rent cloud GPUs.


Table of Contents


Why Apple Silicon

Unified Memory Architecture (The Killer Feature)

Traditional NVIDIA systems:

CPU <---> System RAM (DDR5, ~90 GB/s)
              |
        [PCIe Bus - bottleneck: ~32 GB/s]
              |
GPU <---> VRAM (GDDR6X, ~1000 GB/s but isolated to 24-32GB)

Apple Silicon:

CPU + GPU + Neural Engine + Media Engine
              |
        [ALL share the SAME memory pool]
              |
     Unified LPDDR5x Memory (up to 819 GB/s on M3 Ultra)

Why this matters for LLMs: LLM inference is memory-bandwidth-bound. The model weights stream through the processor for every token generated. An RTX 4090 has 24GB VRAM -- a 70B model at Q4 needs ~40GB and simply doesn't fit. A Mac Studio with 192GB runs it directly.

Key Advantages

  1. Massive memory -- 128-512GB unified, all accessible to GPU. No PCIe bottleneck.
  2. Power efficiency -- 5-10x better performance-per-watt than NVIDIA for inference.
  3. Silent operation -- 0 dB possible. No server room needed.
  4. 24/7 viable -- 2-4W idle. Runs on a desk, costs ~$10-25/year in electricity.
  5. No driver hell -- No CUDA configuration, no multi-GPU wiring.
  6. Data privacy -- Everything stays local.

Key Limitations

  1. Training is slow -- 3x slower than NVIDIA. Use cloud for training.
  2. CUDA ecosystem -- Most ML research code is NVIDIA-first.
  3. Not upgradeable -- Soldered RAM. Buy what you need upfront.
  4. Raw speed for small models -- NVIDIA wins when models fit entirely in VRAM.

What to Buy

Mac Mini M4 Lineup

Config RAM Price Best For
M4 base 16GB $599 Experimentation only. Runs 7-8B models barely.
M4 24GB $999 Budget entry. Runs 14B models.
M4 32GB $1,199 Good starter. Comfortable 14B, tight 32B.
M4 Pro 12-core 48GB $1,799 Can run 70B quantized.
M4 Pro 14-core 64GB ~$2,200 Crowd favorite. 32B+ comfortably.

Mac Studio Lineup

Config RAM Price Best For
M4 Max 14-core 36GB $1,999 Entry Studio. Marginal over Mac Mini.
M4 Max 16-core 128GB ~$4,750 Serious LLM work. 70-100B+ models.
M3 Ultra 32-core 192GB ~$5,600 200B+ models. Professional AI dev.
M3 Ultra 32-core 512GB ~$9,499 Runs DeepSeek V3 671B locally.

Memory Bandwidth (The Real Bottleneck)

Chip Bandwidth Impact
M4 base 120 GB/s Slowest. Limits all models.
M4 Pro 273 GB/s 2.3x faster than base.
M4 Max 546 GB/s Strong for large models.
M3 Ultra 819 GB/s Currently the bandwidth king.

Key insight: "Memory bandwidth is the real bottleneck, not GPU cores." The M4 Pro with 64GB offers more performance than two base M4 units clustered.


Performance Benchmarks

Single-Machine Token Generation

Hardware Model Quantization Tokens/s
M4 base 16GB Llama 3.1 8B Q4 ~15-18
M4 Pro 64GB Qwen2.5-Coder 32B Q4_K_M ~18
M4 Pro 64GB 8B models Q4 35-45
M4 Max 128GB Qwen3-next-80B Q4 ~70
M3 Ultra 512GB Gemma-3 27B Q4 ~41
M3 Ultra 512GB Qwen-3 235B Q4 ~30
M3 Ultra 512GB DeepSeek V3 671B Q2 ~5-10

vs NVIDIA

Metric M3 Ultra (Mac Studio) RTX 4090 H100
Memory Up to 512GB unified 24GB VRAM 80GB VRAM
Bandwidth 819 GB/s 1,008 GB/s 3,350 GB/s
Power 40-80W under load 450W 700W
Price $5,600-9,500 $2,000+ $30,000+
Llama 7B tok/s 30-40 60-80 200+
Can run 70B? Yes (192GB+) No (24GB limit) Yes

Apple wins on capacity. NVIDIA wins on speed per GB. When your model doesn't fit in 24GB VRAM, all that NVIDIA bandwidth is irrelevant.


Models You Can Run

On Mac Mini M4 Pro 64GB

Model Params Quantization Tok/s
Qwen2.5-Coder 32B 32B Q4_K_M ~18
Qwen3-Coder 14B 14B Q8 10-12
Llama 3.3 8B 8B Q4 18-22
Mistral 7B 7B Q4 20-25
GLM-4.5-Air ~30B Q4 ~53

On Mac Studio M3 Ultra 192-512GB

Model Params Quantization Tok/s
Qwen-3 235B 235B Q4_K_M ~30
GLM-4.7 368B MoE Q4 ~25
Qwen3-vl 235B 235B Q4_K_M ~30
DeepSeek V3.1 671B 671B Q2-Q4 ~5-10
Falcon 180B 180B Q5_K_M 5-8

Clustering with RDMA

The macOS 26.2 Revolution (December 2025)

Apple enabled RDMA (Remote Direct Memory Access) over Thunderbolt 5: - Zero-copy data transfers between device memory - 5-9 microsecond latency (matches datacenter InfiniBand) - 80 Gb/s over Thunderbolt 5 - 90-95% practical bandwidth utilization

Before RDMA: Clustering actually degraded performance. llama.cpp dropped from 20.4 tok/s (1 node) to 15.2 tok/s (4 nodes) due to TCP overhead.

After RDMA: Performance scales. Exo + RDMA: 19.5 to 31.9 tok/s across 4 nodes.

Cluster Benchmarks

Config Model Nodes Tok/s
4x Mac Studio M3 Ultra Qwen3 235B 1 20.4
4x Mac Studio M3 Ultra Qwen3 235B 4 (RDMA) 31.9
4x Mac Studio M3 Ultra DeepSeek V3.1 671B 4 (RDMA) 32.5
4x Mac Studio M3 Ultra Kimi K2 Thinking 1T 4 (RDMA) ~15

Cluster Cost vs NVIDIA

Cluster Cost Total Memory Power
4x Mac Studio M3 Ultra 512GB ~$38-50K 2TB <500W
Equivalent 8x H200 GPU $270K+ 640GB HBM 5,600W

Jeff Geerling's landmark test: 4x Mac Studio M3 Ultra (Apple-provided hardware) = 1.5TB combined memory. Ran Kimi K2 Thinking (1 trillion parameters) at ~15 tok/s. Cost: ~$40K. Equivalent NVIDIA: $780K+.


Software Stack

Inference Runtimes (Ranked by Popularity)

  1. Ollama -- Most popular. brew install ollama && ollama run llama3.3. OpenAI-compatible API. Since v0.14.0, supports Anthropic API (Claude Code compatible).
  2. LM Studio -- GUI-based. Great model management. Beginner-friendly.
  3. MLX -- Apple's native ML framework. Best raw performance on Apple Silicon.
  4. llama.cpp -- More control over quantization. Power users prefer over Ollama.
  5. Exo -- Distributed inference. RDMA support. Key tool for clustering.
  6. vLLM-MLX -- Production-grade. Achieved 464 tok/s on M4 Max.

Coding Agent Tools (Local Models)

Tool Setup Best Models
Claude Code + Ollama Set ANTHROPIC_BASE_URL=http://localhost:11434 GLM-4.7, Qwen3-Coder
OpenAI Codex CLI Config via codex.toml profiles Same + GPT-OSS
Roo Code VS Code extension Devstral, Qwen3
Aider Terminal CLI Any OpenAI-compatible
Continue.dev VS Code/JetBrains Most flexible

The ollama launch Command (v0.15)

ollama launch claude-code    # Auto-configures Claude Code with local model
ollama launch codex          # Sets up OpenAI Codex CLI
ollama launch opencode       # Sets up OpenCode

No manual environment variables needed.


Power Efficiency

Hardware Power Under AI Load Monthly Electric Cost
Mac Mini M4 10-30W ~$2-5
Mac Studio M3 Ultra 40-80W ~$6-12
4x Mac Studio cluster <500W total ~$36
Single RTX 4090 PC 450-600W ~$32-43
8x H200 GPU equivalent 5,600W ~$403

A Mac Mini draws less power than a light bulb. A 4x Mac Studio cluster uses 10x less power than the NVIDIA equivalent.


What NOT to Buy

Config Why Not
Mac Mini M4 base 16GB ($599) Too little RAM for anything beyond toy models
Any Mac with 24GB "for AI" The jump to 32GB ($200 more) is worth every penny
Mac Studio for speed alone Bandwidth matches cheaper configs. You're paying for capacity.
Mac Studio 512GB if you only need 70B models 192GB is enough. Save $4,000.

Honest Reality Check

A Medium post calculated that at 4.5B tokens used in 6 months of Cursor/coding agent use, running locally would be impractical. Local works for supplemental use, not as a full replacement for frontier cloud models on heavy workloads.

The pragmatic approach: Local for iteration/autocomplete/privacy. Cloud for complex multi-file coding and heavy reasoning.


Non-Apple Hardware

Apple Silicon dominates the "quiet home office" niche, but NVIDIA GPUs and AMD CPUs dominate everything else -- especially raw speed, enterprise scale, and cost-per-token.

When to Choose Non-Apple

Scenario Apple Silicon NVIDIA/AMD
Silent home office Winner Loud fans
Models that fit in 24GB VRAM Slower Winner (2-4x faster)
Models >80GB Winner (unified memory) Need multi-GPU ($$$)
Training/fine-tuning Too slow Winner (CUDA ecosystem)
Enterprise/datacenter Not scalable Winner
Power efficiency Winner (10x better) 450-700W per GPU
Budget <$2,000 Mac Mini M4 RTX 4090 PC
Budget $5-20K Mac Studio Winner (Threadripper + multi-GPU)

GPU Comparison

Consumer GPUs (As of Feb 2026)

GPU VRAM Tokens/s (8B Q4) Price Best For
RTX 5090 32GB ~213 $1,999 New king. 67% faster than 4090.
RTX 4090 24GB ~128 ~$1,600 Best value. Matches $17K A100 for many tasks.
RTX 4080 16GB ~80 ~$1,000 Good for 7-13B models
RTX 4060 8GB ~40 ~$300 Entry-level. 7B models only.
RTX 3090 24GB ~95 $700-800 Used market bargain

Enterprise GPUs

GPU VRAM Tokens/s (8B) Price Use Case
H100 80GB HBM3 ~144 $30,000+ Enterprise standard
H200 141GB HBM3e ~200+ $35,000+ Latest datacenter
B200 96GB HBM3e TBD $40,000+ Blackwell (Feb 2026)
A100 80GB HBM2e ~138 $15,000+ Workhorse, widely available
L40S 48GB GDDR6 ~110 $8,000+ High-throughput inference
RTX A6000 48GB GDDR6 ~100 $5,000 Prosumer. Runs 70B models.

Key insight: RTX 4090 at $1,600 delivers ~85% the performance of an A100 at $15,000 for inference. The 4090 is the productivity sweet spot for individual developers and small teams.

RTX 5090 game-changer: 67% faster than 4090, only 25% more expensive. If buying new in 2026, the 5090 is the obvious choice.


CPU Options: AMD vs Intel

AMD EPYC 9th Gen (Turin) -- The Enterprise Standard

Feature Spec
Cores 64-384 per socket
RAM Up to 1.5TB DDR5-6400 per socket
PCIe Gen 5.0, 128 lanes
Performance 1.4-1.76x throughput vs Intel Xeon (same workloads)
Best models EPYC 9555, 9575F, 9965

Real benchmark: AMD EPYC 9575F achieves 10x better "goodput" vs Intel at 400ms latency constraints running Llama 3.3 70B with vLLM. Up to 380 tok/s with AMD PACE parallelization.

Best for: Dedicated AI inference servers, OpenClaw hosting, multi-agent deployments.

AMD Threadripper 7000 -- Desktop Powerhouse

Model Cores RAM Channels Price
7960X 96 8x DDR5 ~$5,000
7970X 128 8x DDR5 ~$8,000
7980X 160 8x DDR5 ~$10,000

Why Threadripper for AI: 8-channel DDR5 = massive memory bandwidth for CPU offloading. Huge PCIe bandwidth for multi-GPU setups.

Reddit user: Threadripper + RTX A6000 = handles 70B models "too fast to read"

Best for: Desktop AI workstations, multi-GPU inference rigs.

Intel Xeon (Sapphire Rapids, 6th Gen)

Model Status
Xeon 8592V Intel claims parity, but AMD wins 1.4-10x in benchmarks
Xeon 6980P Latest, but AMD EPYC still faster per-dollar
Gaudi 3 AI accelerator alternative to NVIDIA, but less community support

Verdict: Intel is losing the AI inference war. AMD EPYC dominates 2026 benchmarks. Intel's market share dropped to 37% (from 72% unit share). Choose AMD unless you have existing Intel infrastructure commitments.


VRAM Requirements by Model Size

Model Size Q4_K_M Quant FP16 (Full) Minimum GPU
7-8B 6-7 GB 14-16 GB RTX 4060 (8GB)
13-14B 10-12 GB 26-28 GB RTX 4070 (12GB)
27-32B 18-23 GB 54-64 GB RTX 4090 (24GB)
70B 37-46 GB 140 GB Dual RTX 4090s OR A6000 (48GB)
120-180B 70-100 GB 240-360 GB H100 (80GB) or multi-GPU
405B 150+ GB 810 GB Multi-H100 cluster

Tips: - Q4_K_M quantization reduces VRAM by ~50% vs FP16 with minimal quality loss - Q3_K_M saves an additional 15-20% VRAM (more quality loss) - CPU offloading can supplement -- move some layers to system RAM (slower but works)


Community Builds

Budget Build: $2,500 (7-13B models)

CPU:     AMD Ryzen 7 7700X (~$400)
GPU:     NVIDIA RTX 4090 ($1,600)
RAM:     64GB DDR5 ($300)
Storage: 2TB NVMe ($150)
Case/PSU/Cooling: ~$250
─────────────────────────────
Total:   ~$2,500
Performance: 100+ tokens/sec on 8B models
Can run: 7B, 8B, 13B, 14B comfortably

Mid-Range Build: $19,000 (Up to 70B models)

CPU:     AMD Threadripper 7980X ($10,000)
GPU:     NVIDIA RTX A6000 48GB ($5,000)
RAM:     256GB DDR5 ($2,000)
Platform/Cooling: $2,000
─────────────────────────────
Total:   ~$19,000
Performance: 40+ tokens/sec on 70B models
Can run: Everything up to 70B. CPU offload for larger.

Enterprise Build: $120-150K (70B+ production)

CPU:     AMD EPYC 9555 dual-socket ($30,000)
GPU:     2-4x NVIDIA H100 ($60,000-120,000)
RAM:     1.5TB DDR5 ($20,000)
Network: High-speed fabric ($10,000+)
─────────────────────────────
Total:   $120,000-150,000
Performance: 300+ tokens/sec concurrent
Can run: Any model. Production-grade multi-tenant.

Self-Host vs Cloud: When Does Buying Win?

Cloud GPU Pricing (Continuous Use)

Model Size GPU Needed Cloud Cost/hr Monthly (24/7)
7B A100 $2-3/hr $1,440-2,160
13B A100 $3-4/hr $2,160-2,880
70B 4-8x H100 $12-48/hr $8,640-34,560

Break-Even Analysis

Utilization 7B Models 13B Models 70B Models
10% Cloud wins Self-host wins Cloud wins
25% Self-host wins Self-host wins Cloud wins
50% Self-host wins Self-host wins Self-host wins
100% Self-host wins Self-host wins Self-host wins

RTX 4090 ROI: $1,600 hardware cost. Cloud H100 equivalent: ~$48/hr. Break-even: ~33-50 hours of continuous use (~2-3 days).

Rule of thumb: - <10% utilization: Use cloud (pay per hour) - 10-50% utilization: Self-host with consumer GPUs (RTX 4090/5090) - >50% utilization: Self-host with enterprise hardware (EPYC + H100) - Need >8,000 conversations/day: Infrastructure investment justified