Hardware Guide: Local AI Inference¶
Apple Silicon, AMD EPYC, NVIDIA GPUs, and everything in between. What to buy for local LLM inference, what performance to expect, and when to self-host vs rent cloud GPUs.
Table of Contents¶
- Why Apple Silicon
- What to Buy (Apple)
- Performance Benchmarks (Apple)
- Models You Can Run
- Clustering with RDMA
- Software Stack
- Power Efficiency
- What NOT to Buy
- Non-Apple Hardware
- GPU Comparison
- CPU Options: AMD vs Intel
- VRAM Requirements by Model Size
- Community Builds
- Self-Host vs Cloud: When Does Buying Win?
Why Apple Silicon¶
Unified Memory Architecture (The Killer Feature)¶
Traditional NVIDIA systems:
CPU <---> System RAM (DDR5, ~90 GB/s)
|
[PCIe Bus - bottleneck: ~32 GB/s]
|
GPU <---> VRAM (GDDR6X, ~1000 GB/s but isolated to 24-32GB)
Apple Silicon:
CPU + GPU + Neural Engine + Media Engine
|
[ALL share the SAME memory pool]
|
Unified LPDDR5x Memory (up to 819 GB/s on M3 Ultra)
Why this matters for LLMs: LLM inference is memory-bandwidth-bound. The model weights stream through the processor for every token generated. An RTX 4090 has 24GB VRAM -- a 70B model at Q4 needs ~40GB and simply doesn't fit. A Mac Studio with 192GB runs it directly.
Key Advantages¶
- Massive memory -- 128-512GB unified, all accessible to GPU. No PCIe bottleneck.
- Power efficiency -- 5-10x better performance-per-watt than NVIDIA for inference.
- Silent operation -- 0 dB possible. No server room needed.
- 24/7 viable -- 2-4W idle. Runs on a desk, costs ~$10-25/year in electricity.
- No driver hell -- No CUDA configuration, no multi-GPU wiring.
- Data privacy -- Everything stays local.
Key Limitations¶
- Training is slow -- 3x slower than NVIDIA. Use cloud for training.
- CUDA ecosystem -- Most ML research code is NVIDIA-first.
- Not upgradeable -- Soldered RAM. Buy what you need upfront.
- Raw speed for small models -- NVIDIA wins when models fit entirely in VRAM.
What to Buy¶
Mac Mini M4 Lineup¶
| Config | RAM | Price | Best For |
|---|---|---|---|
| M4 base | 16GB | $599 | Experimentation only. Runs 7-8B models barely. |
| M4 | 24GB | $999 | Budget entry. Runs 14B models. |
| M4 | 32GB | $1,199 | Good starter. Comfortable 14B, tight 32B. |
| M4 Pro 12-core | 48GB | $1,799 | Can run 70B quantized. |
| M4 Pro 14-core | 64GB | ~$2,200 | Crowd favorite. 32B+ comfortably. |
Mac Studio Lineup¶
| Config | RAM | Price | Best For |
|---|---|---|---|
| M4 Max 14-core | 36GB | $1,999 | Entry Studio. Marginal over Mac Mini. |
| M4 Max 16-core | 128GB | ~$4,750 | Serious LLM work. 70-100B+ models. |
| M3 Ultra 32-core | 192GB | ~$5,600 | 200B+ models. Professional AI dev. |
| M3 Ultra 32-core | 512GB | ~$9,499 | Runs DeepSeek V3 671B locally. |
Memory Bandwidth (The Real Bottleneck)¶
| Chip | Bandwidth | Impact |
|---|---|---|
| M4 base | 120 GB/s | Slowest. Limits all models. |
| M4 Pro | 273 GB/s | 2.3x faster than base. |
| M4 Max | 546 GB/s | Strong for large models. |
| M3 Ultra | 819 GB/s | Currently the bandwidth king. |
Key insight: "Memory bandwidth is the real bottleneck, not GPU cores." The M4 Pro with 64GB offers more performance than two base M4 units clustered.
Performance Benchmarks¶
Single-Machine Token Generation¶
| Hardware | Model | Quantization | Tokens/s |
|---|---|---|---|
| M4 base 16GB | Llama 3.1 8B | Q4 | ~15-18 |
| M4 Pro 64GB | Qwen2.5-Coder 32B | Q4_K_M | ~18 |
| M4 Pro 64GB | 8B models | Q4 | 35-45 |
| M4 Max 128GB | Qwen3-next-80B | Q4 | ~70 |
| M3 Ultra 512GB | Gemma-3 27B | Q4 | ~41 |
| M3 Ultra 512GB | Qwen-3 235B | Q4 | ~30 |
| M3 Ultra 512GB | DeepSeek V3 671B | Q2 | ~5-10 |
vs NVIDIA¶
| Metric | M3 Ultra (Mac Studio) | RTX 4090 | H100 |
|---|---|---|---|
| Memory | Up to 512GB unified | 24GB VRAM | 80GB VRAM |
| Bandwidth | 819 GB/s | 1,008 GB/s | 3,350 GB/s |
| Power | 40-80W under load | 450W | 700W |
| Price | $5,600-9,500 | $2,000+ | $30,000+ |
| Llama 7B tok/s | 30-40 | 60-80 | 200+ |
| Can run 70B? | Yes (192GB+) | No (24GB limit) | Yes |
Apple wins on capacity. NVIDIA wins on speed per GB. When your model doesn't fit in 24GB VRAM, all that NVIDIA bandwidth is irrelevant.
Models You Can Run¶
On Mac Mini M4 Pro 64GB¶
| Model | Params | Quantization | Tok/s |
|---|---|---|---|
| Qwen2.5-Coder 32B | 32B | Q4_K_M | ~18 |
| Qwen3-Coder 14B | 14B | Q8 | 10-12 |
| Llama 3.3 8B | 8B | Q4 | 18-22 |
| Mistral 7B | 7B | Q4 | 20-25 |
| GLM-4.5-Air | ~30B | Q4 | ~53 |
On Mac Studio M3 Ultra 192-512GB¶
| Model | Params | Quantization | Tok/s |
|---|---|---|---|
| Qwen-3 235B | 235B | Q4_K_M | ~30 |
| GLM-4.7 | 368B MoE | Q4 | ~25 |
| Qwen3-vl 235B | 235B | Q4_K_M | ~30 |
| DeepSeek V3.1 671B | 671B | Q2-Q4 | ~5-10 |
| Falcon 180B | 180B | Q5_K_M | 5-8 |
Clustering with RDMA¶
The macOS 26.2 Revolution (December 2025)¶
Apple enabled RDMA (Remote Direct Memory Access) over Thunderbolt 5: - Zero-copy data transfers between device memory - 5-9 microsecond latency (matches datacenter InfiniBand) - 80 Gb/s over Thunderbolt 5 - 90-95% practical bandwidth utilization
Before RDMA: Clustering actually degraded performance. llama.cpp dropped from 20.4 tok/s (1 node) to 15.2 tok/s (4 nodes) due to TCP overhead.
After RDMA: Performance scales. Exo + RDMA: 19.5 to 31.9 tok/s across 4 nodes.
Cluster Benchmarks¶
| Config | Model | Nodes | Tok/s |
|---|---|---|---|
| 4x Mac Studio M3 Ultra | Qwen3 235B | 1 | 20.4 |
| 4x Mac Studio M3 Ultra | Qwen3 235B | 4 (RDMA) | 31.9 |
| 4x Mac Studio M3 Ultra | DeepSeek V3.1 671B | 4 (RDMA) | 32.5 |
| 4x Mac Studio M3 Ultra | Kimi K2 Thinking 1T | 4 (RDMA) | ~15 |
Cluster Cost vs NVIDIA¶
| Cluster | Cost | Total Memory | Power |
|---|---|---|---|
| 4x Mac Studio M3 Ultra 512GB | ~$38-50K | 2TB | <500W |
| Equivalent 8x H200 GPU | $270K+ | 640GB HBM | 5,600W |
Jeff Geerling's landmark test: 4x Mac Studio M3 Ultra (Apple-provided hardware) = 1.5TB combined memory. Ran Kimi K2 Thinking (1 trillion parameters) at ~15 tok/s. Cost: ~$40K. Equivalent NVIDIA: $780K+.
Software Stack¶
Inference Runtimes (Ranked by Popularity)¶
- Ollama -- Most popular.
brew install ollama && ollama run llama3.3. OpenAI-compatible API. Since v0.14.0, supports Anthropic API (Claude Code compatible). - LM Studio -- GUI-based. Great model management. Beginner-friendly.
- MLX -- Apple's native ML framework. Best raw performance on Apple Silicon.
- llama.cpp -- More control over quantization. Power users prefer over Ollama.
- Exo -- Distributed inference. RDMA support. Key tool for clustering.
- vLLM-MLX -- Production-grade. Achieved 464 tok/s on M4 Max.
Coding Agent Tools (Local Models)¶
| Tool | Setup | Best Models |
|---|---|---|
| Claude Code + Ollama | Set ANTHROPIC_BASE_URL=http://localhost:11434 |
GLM-4.7, Qwen3-Coder |
| OpenAI Codex CLI | Config via codex.toml profiles |
Same + GPT-OSS |
| Roo Code | VS Code extension | Devstral, Qwen3 |
| Aider | Terminal CLI | Any OpenAI-compatible |
| Continue.dev | VS Code/JetBrains | Most flexible |
The ollama launch Command (v0.15)¶
ollama launch claude-code # Auto-configures Claude Code with local model
ollama launch codex # Sets up OpenAI Codex CLI
ollama launch opencode # Sets up OpenCode
No manual environment variables needed.
Power Efficiency¶
| Hardware | Power Under AI Load | Monthly Electric Cost |
|---|---|---|
| Mac Mini M4 | 10-30W | ~$2-5 |
| Mac Studio M3 Ultra | 40-80W | ~$6-12 |
| 4x Mac Studio cluster | <500W total | ~$36 |
| Single RTX 4090 PC | 450-600W | ~$32-43 |
| 8x H200 GPU equivalent | 5,600W | ~$403 |
A Mac Mini draws less power than a light bulb. A 4x Mac Studio cluster uses 10x less power than the NVIDIA equivalent.
What NOT to Buy¶
| Config | Why Not |
|---|---|
| Mac Mini M4 base 16GB ($599) | Too little RAM for anything beyond toy models |
| Any Mac with 24GB "for AI" | The jump to 32GB ($200 more) is worth every penny |
| Mac Studio for speed alone | Bandwidth matches cheaper configs. You're paying for capacity. |
| Mac Studio 512GB if you only need 70B models | 192GB is enough. Save $4,000. |
Honest Reality Check¶
A Medium post calculated that at 4.5B tokens used in 6 months of Cursor/coding agent use, running locally would be impractical. Local works for supplemental use, not as a full replacement for frontier cloud models on heavy workloads.
The pragmatic approach: Local for iteration/autocomplete/privacy. Cloud for complex multi-file coding and heavy reasoning.
Non-Apple Hardware¶
Apple Silicon dominates the "quiet home office" niche, but NVIDIA GPUs and AMD CPUs dominate everything else -- especially raw speed, enterprise scale, and cost-per-token.
When to Choose Non-Apple¶
| Scenario | Apple Silicon | NVIDIA/AMD |
|---|---|---|
| Silent home office | Winner | Loud fans |
| Models that fit in 24GB VRAM | Slower | Winner (2-4x faster) |
| Models >80GB | Winner (unified memory) | Need multi-GPU ($$$) |
| Training/fine-tuning | Too slow | Winner (CUDA ecosystem) |
| Enterprise/datacenter | Not scalable | Winner |
| Power efficiency | Winner (10x better) | 450-700W per GPU |
| Budget <$2,000 | Mac Mini M4 | RTX 4090 PC |
| Budget $5-20K | Mac Studio | Winner (Threadripper + multi-GPU) |
GPU Comparison¶
Consumer GPUs (As of Feb 2026)¶
| GPU | VRAM | Tokens/s (8B Q4) | Price | Best For |
|---|---|---|---|---|
| RTX 5090 | 32GB | ~213 | $1,999 | New king. 67% faster than 4090. |
| RTX 4090 | 24GB | ~128 | ~$1,600 | Best value. Matches $17K A100 for many tasks. |
| RTX 4080 | 16GB | ~80 | ~$1,000 | Good for 7-13B models |
| RTX 4060 | 8GB | ~40 | ~$300 | Entry-level. 7B models only. |
| RTX 3090 | 24GB | ~95 | $700-800 | Used market bargain |
Enterprise GPUs¶
| GPU | VRAM | Tokens/s (8B) | Price | Use Case |
|---|---|---|---|---|
| H100 | 80GB HBM3 | ~144 | $30,000+ | Enterprise standard |
| H200 | 141GB HBM3e | ~200+ | $35,000+ | Latest datacenter |
| B200 | 96GB HBM3e | TBD | $40,000+ | Blackwell (Feb 2026) |
| A100 | 80GB HBM2e | ~138 | $15,000+ | Workhorse, widely available |
| L40S | 48GB GDDR6 | ~110 | $8,000+ | High-throughput inference |
| RTX A6000 | 48GB GDDR6 | ~100 | $5,000 | Prosumer. Runs 70B models. |
Key insight: RTX 4090 at $1,600 delivers ~85% the performance of an A100 at $15,000 for inference. The 4090 is the productivity sweet spot for individual developers and small teams.
RTX 5090 game-changer: 67% faster than 4090, only 25% more expensive. If buying new in 2026, the 5090 is the obvious choice.
CPU Options: AMD vs Intel¶
AMD EPYC 9th Gen (Turin) -- The Enterprise Standard¶
| Feature | Spec |
|---|---|
| Cores | 64-384 per socket |
| RAM | Up to 1.5TB DDR5-6400 per socket |
| PCIe | Gen 5.0, 128 lanes |
| Performance | 1.4-1.76x throughput vs Intel Xeon (same workloads) |
| Best models | EPYC 9555, 9575F, 9965 |
Real benchmark: AMD EPYC 9575F achieves 10x better "goodput" vs Intel at 400ms latency constraints running Llama 3.3 70B with vLLM. Up to 380 tok/s with AMD PACE parallelization.
Best for: Dedicated AI inference servers, OpenClaw hosting, multi-agent deployments.
AMD Threadripper 7000 -- Desktop Powerhouse¶
| Model | Cores | RAM Channels | Price |
|---|---|---|---|
| 7960X | 96 | 8x DDR5 | ~$5,000 |
| 7970X | 128 | 8x DDR5 | ~$8,000 |
| 7980X | 160 | 8x DDR5 | ~$10,000 |
Why Threadripper for AI: 8-channel DDR5 = massive memory bandwidth for CPU offloading. Huge PCIe bandwidth for multi-GPU setups.
Reddit user: Threadripper + RTX A6000 = handles 70B models "too fast to read"
Best for: Desktop AI workstations, multi-GPU inference rigs.
Intel Xeon (Sapphire Rapids, 6th Gen)¶
| Model | Status |
|---|---|
| Xeon 8592V | Intel claims parity, but AMD wins 1.4-10x in benchmarks |
| Xeon 6980P | Latest, but AMD EPYC still faster per-dollar |
| Gaudi 3 | AI accelerator alternative to NVIDIA, but less community support |
Verdict: Intel is losing the AI inference war. AMD EPYC dominates 2026 benchmarks. Intel's market share dropped to 37% (from 72% unit share). Choose AMD unless you have existing Intel infrastructure commitments.
VRAM Requirements by Model Size¶
| Model Size | Q4_K_M Quant | FP16 (Full) | Minimum GPU |
|---|---|---|---|
| 7-8B | 6-7 GB | 14-16 GB | RTX 4060 (8GB) |
| 13-14B | 10-12 GB | 26-28 GB | RTX 4070 (12GB) |
| 27-32B | 18-23 GB | 54-64 GB | RTX 4090 (24GB) |
| 70B | 37-46 GB | 140 GB | Dual RTX 4090s OR A6000 (48GB) |
| 120-180B | 70-100 GB | 240-360 GB | H100 (80GB) or multi-GPU |
| 405B | 150+ GB | 810 GB | Multi-H100 cluster |
Tips: - Q4_K_M quantization reduces VRAM by ~50% vs FP16 with minimal quality loss - Q3_K_M saves an additional 15-20% VRAM (more quality loss) - CPU offloading can supplement -- move some layers to system RAM (slower but works)
Community Builds¶
Budget Build: $2,500 (7-13B models)¶
CPU: AMD Ryzen 7 7700X (~$400)
GPU: NVIDIA RTX 4090 ($1,600)
RAM: 64GB DDR5 ($300)
Storage: 2TB NVMe ($150)
Case/PSU/Cooling: ~$250
─────────────────────────────
Total: ~$2,500
Performance: 100+ tokens/sec on 8B models
Can run: 7B, 8B, 13B, 14B comfortably
Mid-Range Build: $19,000 (Up to 70B models)¶
CPU: AMD Threadripper 7980X ($10,000)
GPU: NVIDIA RTX A6000 48GB ($5,000)
RAM: 256GB DDR5 ($2,000)
Platform/Cooling: $2,000
─────────────────────────────
Total: ~$19,000
Performance: 40+ tokens/sec on 70B models
Can run: Everything up to 70B. CPU offload for larger.
Enterprise Build: $120-150K (70B+ production)¶
CPU: AMD EPYC 9555 dual-socket ($30,000)
GPU: 2-4x NVIDIA H100 ($60,000-120,000)
RAM: 1.5TB DDR5 ($20,000)
Network: High-speed fabric ($10,000+)
─────────────────────────────
Total: $120,000-150,000
Performance: 300+ tokens/sec concurrent
Can run: Any model. Production-grade multi-tenant.
Self-Host vs Cloud: When Does Buying Win?¶
Cloud GPU Pricing (Continuous Use)¶
| Model Size | GPU Needed | Cloud Cost/hr | Monthly (24/7) |
|---|---|---|---|
| 7B | A100 | $2-3/hr | $1,440-2,160 |
| 13B | A100 | $3-4/hr | $2,160-2,880 |
| 70B | 4-8x H100 | $12-48/hr | $8,640-34,560 |
Break-Even Analysis¶
| Utilization | 7B Models | 13B Models | 70B Models |
|---|---|---|---|
| 10% | Cloud wins | Self-host wins | Cloud wins |
| 25% | Self-host wins | Self-host wins | Cloud wins |
| 50% | Self-host wins | Self-host wins | Self-host wins |
| 100% | Self-host wins | Self-host wins | Self-host wins |
RTX 4090 ROI: $1,600 hardware cost. Cloud H100 equivalent: ~$48/hr. Break-even: ~33-50 hours of continuous use (~2-3 days).
Rule of thumb: - <10% utilization: Use cloud (pay per hour) - 10-50% utilization: Self-host with consumer GPUs (RTX 4090/5090) - >50% utilization: Self-host with enterprise hardware (EPYC + H100) - Need >8,000 conversations/day: Infrastructure investment justified