Hardware Guide: Local AI Inference¶

Apple Silicon, AMD EPYC, NVIDIA GPUs, and everything in between. What to buy for local LLM inference, what performance to expect, and when to self-host vs rent cloud GPUs.

Table of Contents¶

Why Apple Silicon
What to Buy (Apple)
Performance Benchmarks (Apple)
Models You Can Run
Clustering with RDMA
Software Stack
Power Efficiency
What NOT to Buy
Non-Apple Hardware
GPU Comparison
CPU Options: AMD vs Intel
VRAM Requirements by Model Size
Community Builds
Self-Host vs Cloud: When Does Buying Win?

Why Apple Silicon¶

Unified Memory Architecture (The Killer Feature)¶

Traditional NVIDIA systems:

CPU <---> System RAM (DDR5, ~90 GB/s)
              |
        [PCIe Bus - bottleneck: ~32 GB/s]
              |
GPU <---> VRAM (GDDR6X, ~1000 GB/s but isolated to 24-32GB)

Apple Silicon:

CPU + GPU + Neural Engine + Media Engine
              |
        [ALL share the SAME memory pool]
              |
     Unified LPDDR5x Memory (up to 819 GB/s on M3 Ultra)

Why this matters for LLMs: LLM inference is memory-bandwidth-bound. The model weights stream through the processor for every token generated. An RTX 4090 has 24GB VRAM -- a 70B model at Q4 needs ~40GB and simply doesn't fit. A Mac Studio with 192GB runs it directly.

Key Advantages¶

Massive memory -- 128-512GB unified, all accessible to GPU. No PCIe bottleneck.
Power efficiency -- 5-10x better performance-per-watt than NVIDIA for inference.
Silent operation -- 0 dB possible. No server room needed.
24/7 viable -- 2-4W idle. Runs on a desk, costs ~$10-25/year in electricity.
No driver hell -- No CUDA configuration, no multi-GPU wiring.
Data privacy -- Everything stays local.

Key Limitations¶

Training is slow -- 3x slower than NVIDIA. Use cloud for training.
CUDA ecosystem -- Most ML research code is NVIDIA-first.
Not upgradeable -- Soldered RAM. Buy what you need upfront.
Raw speed for small models -- NVIDIA wins when models fit entirely in VRAM.

What to Buy¶

Mac Mini M4 Lineup¶

Config	RAM	Price	Best For
M4 base	16GB	$599	Experimentation only. Runs 7-8B models barely.
M4	24GB	$999	Budget entry. Runs 14B models.
M4	32GB	$1,199	Good starter. Comfortable 14B, tight 32B.
M4 Pro 12-core	48GB	$1,799	Can run 70B quantized.
M4 Pro 14-core	64GB	~$2,200	Crowd favorite. 32B+ comfortably.

Mac Studio Lineup¶

Config	RAM	Price	Best For
M4 Max 14-core	36GB	$1,999	Entry Studio. Marginal over Mac Mini.
M4 Max 16-core	128GB	~$4,750	Serious LLM work. 70-100B+ models.
M3 Ultra 32-core	192GB	~$5,600	200B+ models. Professional AI dev.
M3 Ultra 32-core	512GB	~$9,499	Runs DeepSeek V3 671B locally.

Memory Bandwidth (The Real Bottleneck)¶

Chip	Bandwidth	Impact
M4 base	120 GB/s	Slowest. Limits all models.
M4 Pro	273 GB/s	2.3x faster than base.
M4 Max	546 GB/s	Strong for large models.
M3 Ultra	819 GB/s	Currently the bandwidth king.

Key insight: "Memory bandwidth is the real bottleneck, not GPU cores." The M4 Pro with 64GB offers more performance than two base M4 units clustered.

Performance Benchmarks¶

Single-Machine Token Generation¶

Hardware	Model	Quantization	Tokens/s
M4 base 16GB	Llama 3.1 8B	Q4	~15-18
M4 Pro 64GB	Qwen2.5-Coder 32B	Q4_K_M	~18
M4 Pro 64GB	8B models	Q4	35-45
M4 Max 128GB	Qwen3-next-80B	Q4	~70
M3 Ultra 512GB	Gemma-3 27B	Q4	~41
M3 Ultra 512GB	Qwen-3 235B	Q4	~30
M3 Ultra 512GB	DeepSeek V3 671B	Q2	~5-10

vs NVIDIA¶

Metric	M3 Ultra (Mac Studio)	RTX 4090	H100
Memory	Up to 512GB unified	24GB VRAM	80GB VRAM
Bandwidth	819 GB/s	1,008 GB/s	3,350 GB/s
Power	40-80W under load	450W	700W
Price	$5,600-9,500	$2,000+	$30,000+
Llama 7B tok/s	30-40	60-80	200+
Can run 70B?	Yes (192GB+)	No (24GB limit)	Yes

Apple wins on capacity. NVIDIA wins on speed per GB. When your model doesn't fit in 24GB VRAM, all that NVIDIA bandwidth is irrelevant.

Models You Can Run¶

On Mac Mini M4 Pro 64GB¶

Model	Params	Quantization	Tok/s
Qwen2.5-Coder 32B	32B	Q4_K_M	~18
Qwen3-Coder 14B	14B	Q8	10-12
Llama 3.3 8B	8B	Q4	18-22
Mistral 7B	7B	Q4	20-25
GLM-4.5-Air	~30B	Q4	~53

On Mac Studio M3 Ultra 192-512GB¶

Model	Params	Quantization	Tok/s
Qwen-3 235B	235B	Q4_K_M	~30
GLM-4.7	368B MoE	Q4	~25
Qwen3-vl 235B	235B	Q4_K_M	~30
DeepSeek V3.1 671B	671B	Q2-Q4	~5-10
Falcon 180B	180B	Q5_K_M	5-8

Clustering with RDMA¶

The macOS 26.2 Revolution (December 2025)¶

Apple enabled RDMA (Remote Direct Memory Access) over Thunderbolt 5: - Zero-copy data transfers between device memory - 5-9 microsecond latency (matches datacenter InfiniBand) - 80 Gb/s over Thunderbolt 5 - 90-95% practical bandwidth utilization

Before RDMA: Clustering actually degraded performance. llama.cpp dropped from 20.4 tok/s (1 node) to 15.2 tok/s (4 nodes) due to TCP overhead.

After RDMA: Performance scales. Exo + RDMA: 19.5 to 31.9 tok/s across 4 nodes.

Cluster Benchmarks¶

Config	Model	Nodes	Tok/s
4x Mac Studio M3 Ultra	Qwen3 235B	1	20.4
4x Mac Studio M3 Ultra	Qwen3 235B	4 (RDMA)	31.9
4x Mac Studio M3 Ultra	DeepSeek V3.1 671B	4 (RDMA)	32.5
4x Mac Studio M3 Ultra	Kimi K2 Thinking 1T	4 (RDMA)	~15

Cluster Cost vs NVIDIA¶

Cluster	Cost	Total Memory	Power
4x Mac Studio M3 Ultra 512GB	~$38-50K	2TB	<500W
Equivalent 8x H200 GPU	$270K+	640GB HBM	5,600W

Jeff Geerling's landmark test: 4x Mac Studio M3 Ultra (Apple-provided hardware) = 1.5TB combined memory. Ran Kimi K2 Thinking (1 trillion parameters) at ~15 tok/s. Cost: ~$40K. Equivalent NVIDIA: $780K+.

Software Stack¶

Inference Runtimes (Ranked by Popularity)¶

Ollama -- Most popular. brew install ollama && ollama run llama3.3. OpenAI-compatible API. Since v0.14.0, supports Anthropic API (Claude Code compatible).
LM Studio -- GUI-based. Great model management. Beginner-friendly.
MLX -- Apple's native ML framework. Best raw performance on Apple Silicon.
llama.cpp -- More control over quantization. Power users prefer over Ollama.
Exo -- Distributed inference. RDMA support. Key tool for clustering.
vLLM-MLX -- Production-grade. Achieved 464 tok/s on M4 Max.

Coding Agent Tools (Local Models)¶

Tool	Setup	Best Models
Claude Code + Ollama	Set `ANTHROPIC_BASE_URL=http://localhost:11434`	GLM-4.7, Qwen3-Coder
OpenAI Codex CLI	Config via `codex.toml` profiles	Same + GPT-OSS
Roo Code	VS Code extension	Devstral, Qwen3
Aider	Terminal CLI	Any OpenAI-compatible
Continue.dev	VS Code/JetBrains	Most flexible

The `ollama launch` Command (v0.15)¶

ollama launch claude-code    # Auto-configures Claude Code with local model
ollama launch codex          # Sets up OpenAI Codex CLI
ollama launch opencode       # Sets up OpenCode

No manual environment variables needed.

Power Efficiency¶

Hardware	Power Under AI Load	Monthly Electric Cost
Mac Mini M4	10-30W	~$2-5
Mac Studio M3 Ultra	40-80W	~$6-12
4x Mac Studio cluster	<500W total	~$36
Single RTX 4090 PC	450-600W	~$32-43
8x H200 GPU equivalent	5,600W	~$403

A Mac Mini draws less power than a light bulb. A 4x Mac Studio cluster uses 10x less power than the NVIDIA equivalent.

What NOT to Buy¶

Config	Why Not
Mac Mini M4 base 16GB ($599)	Too little RAM for anything beyond toy models
Any Mac with 24GB "for AI"	The jump to 32GB ($200 more) is worth every penny
Mac Studio for speed alone	Bandwidth matches cheaper configs. You're paying for capacity.
Mac Studio 512GB if you only need 70B models	192GB is enough. Save $4,000.

Honest Reality Check¶

A Medium post calculated that at 4.5B tokens used in 6 months of Cursor/coding agent use, running locally would be impractical. Local works for supplemental use, not as a full replacement for frontier cloud models on heavy workloads.

The pragmatic approach: Local for iteration/autocomplete/privacy. Cloud for complex multi-file coding and heavy reasoning.

Non-Apple Hardware¶

Apple Silicon dominates the "quiet home office" niche, but NVIDIA GPUs and AMD CPUs dominate everything else -- especially raw speed, enterprise scale, and cost-per-token.

When to Choose Non-Apple¶

Scenario	Apple Silicon	NVIDIA/AMD
Silent home office	Winner	Loud fans
Models that fit in 24GB VRAM	Slower	Winner (2-4x faster)
Models >80GB	Winner (unified memory)	Need multi-GPU ($$$)
Training/fine-tuning	Too slow	Winner (CUDA ecosystem)
Enterprise/datacenter	Not scalable	Winner
Power efficiency	Winner (10x better)	450-700W per GPU
Budget <$2,000	Mac Mini M4	RTX 4090 PC
Budget $5-20K	Mac Studio	Winner (Threadripper + multi-GPU)

GPU Comparison¶

Consumer GPUs (As of Feb 2026)¶

GPU	VRAM	Tokens/s (8B Q4)	Price	Best For
RTX 5090	32GB	~213	$1,999	New king. 67% faster than 4090.
RTX 4090	24GB	~128	~$1,600	Best value. Matches $17K A100 for many tasks.
RTX 4080	16GB	~80	~$1,000	Good for 7-13B models
RTX 4060	8GB	~40	~$300	Entry-level. 7B models only.
RTX 3090	24GB	~95	$700-800	Used market bargain

Enterprise GPUs¶

GPU	VRAM	Tokens/s (8B)	Price	Use Case
H100	80GB HBM3	~144	$30,000+	Enterprise standard
H200	141GB HBM3e	~200+	$35,000+	Latest datacenter
B200	96GB HBM3e	TBD	$40,000+	Blackwell (Feb 2026)
A100	80GB HBM2e	~138	$15,000+	Workhorse, widely available
L40S	48GB GDDR6	~110	$8,000+	High-throughput inference
RTX A6000	48GB GDDR6	~100	$5,000	Prosumer. Runs 70B models.

Key insight: RTX 4090 at $1,600 delivers ~85% the performance of an A100 at $15,000 for inference. The 4090 is the productivity sweet spot for individual developers and small teams.

RTX 5090 game-changer: 67% faster than 4090, only 25% more expensive. If buying new in 2026, the 5090 is the obvious choice.

CPU Options: AMD vs Intel¶

AMD EPYC 9th Gen (Turin) -- The Enterprise Standard¶

Feature	Spec
Cores	64-384 per socket
RAM	Up to 1.5TB DDR5-6400 per socket
PCIe	Gen 5.0, 128 lanes
Performance	1.4-1.76x throughput vs Intel Xeon (same workloads)
Best models	EPYC 9555, 9575F, 9965

Real benchmark: AMD EPYC 9575F achieves 10x better "goodput" vs Intel at 400ms latency constraints running Llama 3.3 70B with vLLM. Up to 380 tok/s with AMD PACE parallelization.

Best for: Dedicated AI inference servers, OpenClaw hosting, multi-agent deployments.

AMD Threadripper 7000 -- Desktop Powerhouse¶

Model	Cores	RAM Channels	Price
7960X	96	8x DDR5	~$5,000
7970X	128	8x DDR5	~$8,000
7980X	160	8x DDR5	~$10,000

Why Threadripper for AI: 8-channel DDR5 = massive memory bandwidth for CPU offloading. Huge PCIe bandwidth for multi-GPU setups.

Reddit user: Threadripper + RTX A6000 = handles 70B models "too fast to read"

Best for: Desktop AI workstations, multi-GPU inference rigs.

Intel Xeon (Sapphire Rapids, 6th Gen)¶

Model	Status
Xeon 8592V	Intel claims parity, but AMD wins 1.4-10x in benchmarks
Xeon 6980P	Latest, but AMD EPYC still faster per-dollar
Gaudi 3	AI accelerator alternative to NVIDIA, but less community support

Verdict: Intel is losing the AI inference war. AMD EPYC dominates 2026 benchmarks. Intel's market share dropped to 37% (from 72% unit share). Choose AMD unless you have existing Intel infrastructure commitments.

VRAM Requirements by Model Size¶

Model Size	Q4_K_M Quant	FP16 (Full)	Minimum GPU
7-8B	6-7 GB	14-16 GB	RTX 4060 (8GB)
13-14B	10-12 GB	26-28 GB	RTX 4070 (12GB)
27-32B	18-23 GB	54-64 GB	RTX 4090 (24GB)
70B	37-46 GB	140 GB	Dual RTX 4090s OR A6000 (48GB)
120-180B	70-100 GB	240-360 GB	H100 (80GB) or multi-GPU
405B	150+ GB	810 GB	Multi-H100 cluster

Tips: - Q4_K_M quantization reduces VRAM by ~50% vs FP16 with minimal quality loss - Q3_K_M saves an additional 15-20% VRAM (more quality loss) - CPU offloading can supplement -- move some layers to system RAM (slower but works)

Community Builds¶

Budget Build: $2,500 (7-13B models)¶

CPU:     AMD Ryzen 7 7700X (~$400)
GPU:     NVIDIA RTX 4090 ($1,600)
RAM:     64GB DDR5 ($300)
Storage: 2TB NVMe ($150)
Case/PSU/Cooling: ~$250
─────────────────────────────
Total:   ~$2,500
Performance: 100+ tokens/sec on 8B models
Can run: 7B, 8B, 13B, 14B comfortably

Mid-Range Build: $19,000 (Up to 70B models)¶

CPU:     AMD Threadripper 7980X ($10,000)
GPU:     NVIDIA RTX A6000 48GB ($5,000)
RAM:     256GB DDR5 ($2,000)
Platform/Cooling: $2,000
─────────────────────────────
Total:   ~$19,000
Performance: 40+ tokens/sec on 70B models
Can run: Everything up to 70B. CPU offload for larger.

Enterprise Build: $120-150K (70B+ production)¶

CPU:     AMD EPYC 9555 dual-socket ($30,000)
GPU:     2-4x NVIDIA H100 ($60,000-120,000)
RAM:     1.5TB DDR5 ($20,000)
Network: High-speed fabric ($10,000+)
─────────────────────────────
Total:   $120,000-150,000
Performance: 300+ tokens/sec concurrent
Can run: Any model. Production-grade multi-tenant.

Self-Host vs Cloud: When Does Buying Win?¶

Cloud GPU Pricing (Continuous Use)¶

Model Size	GPU Needed	Cloud Cost/hr	Monthly (24/7)
7B	A100	$2-3/hr	$1,440-2,160
13B	A100	$3-4/hr	$2,160-2,880
70B	4-8x H100	$12-48/hr	$8,640-34,560

Break-Even Analysis¶

Utilization	7B Models	13B Models	70B Models
10%	Cloud wins	Self-host wins	Cloud wins
25%	Self-host wins	Self-host wins	Cloud wins
50%	Self-host wins	Self-host wins	Self-host wins
100%	Self-host wins	Self-host wins	Self-host wins

RTX 4090 ROI: $1,600 hardware cost. Cloud H100 equivalent: ~$48/hr. Break-even: ~33-50 hours of continuous use (~2-3 days).

Rule of thumb: - <10% utilization: Use cloud (pay per hour) - 10-50% utilization: Self-host with consumer GPUs (RTX 4090/5090) - >50% utilization: Self-host with enterprise hardware (EPYC + H100) - Need >8,000 conversations/day: Infrastructure investment justified