Best local and self-hosted AI in 2026: eight platforms scored hardware-first
Most local-AI content in 2026 leads with software features. The actual buying decision is hardware-constrained before it is software-feature-driven. A Mac with M3 Max and 96GB unified memory can run Llama 3.3 70B at usable speed; a $1500 RTX 4070 PC tops out comfortably at 30B class models; CPU-only hardware stops at Phi-3-mini and Llama 3.2 3B. Every other choice (which UI, which RAG layer, which API server) is downstream of which model catalog your hardware can actually load into memory. Layered on top is the irreducible value local AI delivers that no cloud provider can match contractually: zero data egress. We scored Ollama, LM Studio, GPT4All, Jan, LocalAI, Open WebUI, AnythingLLM, and Msty across hardware-tier compatibility, feature depth, privacy posture, and developer ergonomics. Match a stack to your machine with our AI stack optimizer in 60 seconds, sharpen your local-LLM prompts in the prompt compiler, or track GPU pricing in the AI tool pricing tracker. Jump to the hardware-first decision tree.
- Hardware decides first: Which model catalog you can run is determined by your hardware before any software choice matters. CPU-only tops out at Phi-3-mini and Llama 3.2 3B; an M3 Max with 64GB unified memory runs Llama 3.3 70B at usable speed.
- Best pairings: Ollama plus Open WebUI is the strongest open-source stack on Apple Silicon. LM Studio is the smoothest GUI for Windows or Linux RTX users. GPT4All and Jan are the practical limit on CPU-only hardware.
- Privacy advantage: A local model running on your own hardware with no outbound connection during inference is the only privacy guarantee no cloud provider can contractually match.
- Cost reality: All eight platforms are free software. A capable consumer-GPU build costs $1,500 to $2,200; a Mac M3 Max 64GB runs $4,000 to $4,500. Break-even vs cloud API spend occurs around 10 to 50 million tokens per month.
The eight platforms at a glance
Quick verdict by use case. Each pick names the winner and a one-line rationale; the matrices and deep dives below show the work. Use the hardware tabs further down to filter by what your machine can actually run.
Hardware tier reality: what your machine can actually run
The single largest determinant of the local-AI experience is the hardware underneath. Four tiers cover ~95% of real-world setups. Click a tab below to see the model catalog and tokens-per-second range each tier sustains for the eight tools in this comparison. Specifications are May 2026 reference builds; substitute equivalent silicon.
Apple Silicon: unified memory is the cheat code
Apple's unified-memory architecture lets the GPU address the same memory pool as the CPU, so a 70B model that requires roughly 40GB of VRAM (and which the largest single consumer NVIDIA card, the RTX 4090 at 24GB, cannot load) runs natively. The trade-off is memory bandwidth: M-series tops out around 400 GB/s on the Max tier and 800 GB/s on the Ultra, versus the RTX 4090 at roughly 1 TB/s. Macs win on what fits in memory; RTX wins on raw inference speed for what does fit.
| Model | Size on disk (Q4) | Tokens/sec (M3 Max 64GB) | Notes |
|---|---|---|---|
| Llama 3.2 3B | 2 GB | ~80-100 | Trivial; runs while browsing |
| Llama 3.1 8B | 5 GB | ~40-55 | Daily-driver model class |
| Mixtral 8x7B | 26 GB | ~18-25 | MoE architecture is friendly to unified memory |
| Qwen 2.5 32B | 20 GB | ~15-22 | Strong code and reasoning |
| Llama 3.3 70B Q4 | 40 GB | ~8-15 | Usable for batch work, slow for chat with long context |
| Llama 3.1 405B | 230 GB | N/A | Out of reach on consumer Macs; needs 256GB+ unified |
NVIDIA RTX consumer cards: bandwidth wins on what fits
RTX cards deliver the highest tokens-per-second per dollar for models that fit in VRAM. The hard ceiling is VRAM capacity: the RTX 4090 (24GB) is the largest single consumer card as of May 2026, and the RTX 5090 (32GB) is the next-gen anchor. Multi-GPU setups (2x or 4x cards) scale model size but add complexity. The 24GB ceiling means 70B-class models in Q4 only run when split across CPU offload or multiple GPUs, with material speed penalty.
| Model | VRAM needed (Q4) | Tokens/sec (RTX 4070 12GB) | Notes |
|---|---|---|---|
| Llama 3.2 3B | 2 GB | ~120-160 | Real-time chat |
| Llama 3.1 8B | 5 GB | ~60-90 | Sweet spot for 12GB cards |
| Phi-3-medium 14B | 9 GB | ~35-50 | Fits comfortably in 12GB |
| Mistral Small 22B | 13 GB | ~12-20 | Spills to CPU on 12GB; needs RTX 4090 (24GB) for pure GPU |
| Qwen 2.5 32B | 20 GB | ~3-8 on 4070 | Needs RTX 4090 / 5090 for usable speed |
| Llama 3.3 70B Q4 | 40 GB | ~1-3 on 4070 | Needs 2x 24GB cards or unified-memory Mac |
CPU-only: the small-model floor
CPU-only inference is bottlenecked by memory bandwidth (DDR4 at roughly 50 GB/s, DDR5 at roughly 100 GB/s, versus dedicated VRAM at 500-1000 GB/s). Practical models stop at the 3-8B class with Q4 quantization. Phi-3-mini is the standout: Microsoft optimized it for edge inference, and it punches above its weight on CPU. Anything 13B+ is technically loadable on 16GB RAM but practically unusable for interactive chat.
| Model | RAM needed (Q4) | Tokens/sec (modern laptop CPU) | Notes |
|---|---|---|---|
| Phi-3-mini 3.8B | 3 GB | ~10-18 | Best CPU-only daily driver |
| Llama 3.2 1B | 1 GB | ~25-40 | Fast but quality drops noticeably |
| Llama 3.2 3B | 2.5 GB | ~8-14 | Solid quality for the size class |
| Gemma 2 2B | 2 GB | ~12-20 | Google's small-model entry |
| Llama 3.1 8B | 5 GB | ~3-6 | Borderline usable, painful for chat |
| Anything 13B+ | 8+ GB | <2 | Loadable, not interactive |
Cloud GPU VPS: bursty workloads + frontier models
Cloud GPU rental shifts the economics from capex to opex. Useful for evaluating frontier models (Llama 3.1 405B, DeepSeek 67B) that consumer hardware cannot load, for bursty training and fine-tuning, and for self-hosted production deployments where the team needs an always-on inference endpoint. The meter never stops, so steady-state inference at moderate volume usually costs more than a desktop build amortized over 24 months.
| GPU tier | Approx hourly rate | Largest comfortable model | Best for |
|---|---|---|---|
| RTX 3090 / 4090 (24GB) | $0.40-$0.80/hr | 30B class Q4 | Personal experimentation |
| RTX A6000 (48GB) | $0.80-$1.40/hr | 70B class Q4 | Small-team self-hosted |
| L40S (48GB) | $1.10-$1.80/hr | 70B class Q5 | Production inference |
| H100 80GB | $2.00-$3.50/hr | 405B Q4 with care | Frontier-model eval, fine-tuning |
| 2x H100 / H200 (160-282GB) | $4.50-$9.00/hr | 405B full precision | Research and training |
Tokens-per-second across the matrix
Three model classes (small / mid / large), three hardware tiers, eight tools. The chart below normalizes the achievable throughput for the most-cited combinations. Numbers are aggregates of llama.cpp upstream benchmarks, Ollama community reports, and Nesyona spot checks in May 2026.
Pricing reality: software free, hardware not
Every platform on this list is free to install. Six of eight are open-source under permissive licenses (Ollama MIT, GPT4All MIT, Jan AGPL, LocalAI MIT, Open WebUI MIT, AnythingLLM MIT). LM Studio is closed-source freeware. Msty is freemium with a paid Aurum tier. The cost surface that matters is hardware capex (one-time) and electricity (recurring), not software licensing.
| Platform | License | Software cost | Hardware floor | Best-fit hardware tier |
|---|---|---|---|---|
| Ollama | MIT (open source) | $0 | 16GB RAM / Apple Silicon or 8GB VRAM | All tiers; scales from laptop to multi-GPU |
| LM Studio | Closed-source freeware | $0 | 16GB RAM / Apple Silicon or 8GB VRAM | Mac and RTX desktops; smoothest GUI |
| GPT4All | MIT (open source) | $0 | 8GB RAM (CPU-only OK) | CPU-only laptops + entry hardware |
| Jan | AGPL (open source) | $0 | 16GB RAM / Apple Silicon or 8GB VRAM | Mac, Windows, Linux desktops |
| LocalAI | MIT (open source) | $0 | 32GB RAM + GPU recommended | Self-hosted server, Docker, cloud VPS |
| Open WebUI | MIT (open source) | $0 | 32GB RAM + Ollama backend | Self-hosted multi-user team chat |
| AnythingLLM | MIT (open source) | $0 desktop / $50+/mo cloud | 16GB RAM + LLM backend | RAG / document Q&A workspaces |
| Msty | Freemium | $0 / $69 lifetime Aurum | 16GB RAM / Apple Silicon or 8GB VRAM | Mac-native power-user, branching chat |
Project home pages and download surfaces: Ollama, LM Studio, GPT4All by Nomic, Jan, LocalAI, Open WebUI, AnythingLLM, and Msty.
Capability matrix: ten axes across all eight platforms
Ten capability axes covering interface category (CLI vs GUI vs API vs full stack), hardware support, model catalog, and feature depth. Read across the row for what a single tool covers; read down a column to see which tools cover a given capability. The "Privacy posture" column at the right captures the zero-data-egress reality each platform delivers when run with no outbound network connection.
| Tool | Interface | Apple Silicon | NVIDIA CUDA | AMD ROCm | CPU-only usable | OpenAI-API server | RAG built-in | Multi-user | Model catalog | Privacy posture |
|---|---|---|---|---|---|---|---|---|---|---|
| Ollama | CLI + HTTP API | Native | Native | Partial | Yes | Yes | No (use Open WebUI) | Single-user (multi via wrapper) | Huge (Ollama library + HF GGUF) | Zero egress when offline |
| LM Studio | Desktop GUI + API | Native | Native | Partial | Yes | Yes | No | Single-user | Huge (HF GGUF browser) | Closed source; egress only on explicit cloud |
| GPT4All | Desktop GUI | Native | Native | Partial | Best in class | Yes | LocalDocs (built-in) | Single-user | Curated catalog + custom GGUF | Zero egress when offline |
| Jan | Desktop GUI + API | Native | Native | Partial | Yes | Yes | Partial | Single-user | Curated + HF GGUF | Zero egress when offline |
| LocalAI | API server (OpenAI-compat) | Yes | Native | Yes | Yes | Native | Via embeddings API | Multi-user (API) | Huge (text + image + audio) | Zero egress when offline |
| Open WebUI | Web UI (self-hosted) | Via Ollama | Via Ollama | Via Ollama | Via Ollama | Yes (proxy) | Yes (built-in) | Yes (RBAC) | Yes (Ollama catalog) | Zero egress when offline |
| AnythingLLM | Desktop + Docker + web | Yes | Yes | Via backend | Via backend | Yes | Best in class | Yes (workspaces) | Yes (any backend) | Zero egress when offline |
| Msty | Desktop GUI | Native | Native | Partial | Yes | Yes | Knowledge stacks | Single-user | HF GGUF + cloud BYO key | Closed source; egress only on BYO cloud |
Ease-of-setup ร feature-depth tier ladder
Platforms ranked by the combined ease of getting a model running today and the depth of features available once you do. A high tier means the platform delivers a strong experience with minimal friction for the modal self-hosting user. A low tier does not mean the tool is bad; it means either the on-ramp is steeper or the feature ceiling is lower than the peers. Tier placement is independent of which hardware tier you sit on.
-
S-tier ยท Workhorse backends and the GUI flagships
Pick firstOllama, LM Studio, Open WebUIOllama is the default backend the rest of the local-AI ecosystem layers on top of: install once, pull a model with one command, hit the API from anything. LM Studio is the smoothest first-experience for non-CLI users and is the easiest place to evaluate models before committing one to an Ollama deployment. Open WebUI gives a self-hosted ChatGPT-style web frontend that, paired with Ollama, replaces a SaaS chat UI for an entire team. These three cover the largest share of real-world local-AI deployments; everything else is a specialization on top.
-
A-tier ยท Specialized power-user tools
Strong fit by use caseAnythingLLM, LocalAI, MstyAnythingLLM is the production pattern for company-knowledge chatbots and document-Q-and-A workspaces; the RAG and workspace surface is the most mature on this list. LocalAI is the OpenAI-API drop-in: point any OpenAI SDK at LocalAI's endpoint and the calls work, which collapses migration friction for teams porting an OpenAI-built app to self-hosted inference. Msty is the polished Mac-native power-user experience with split-chat, branching, and knowledge stacks that LM Studio does not match; Aurum unlocks the deeper feature set.
-
A-tier ยท Fully-open desktop alternative
Open-source-firstJanJan is the open-source answer to LM Studio for users who require an AGPL or otherwise fully-open desktop. Feature parity is close on the core chat and model-management surface; the UI lags LM Studio on polish but ships steady updates and a healthy extension ecosystem. For privacy-paranoid users who will not run closed-source binaries even when offline, Jan is the right pick.
-
B-tier ยท CPU-only on-ramp + curated catalog
Best entry on weak hardwareGPT4AllGPT4All is the easiest install for users on CPU-only hardware and the curated model list shields newcomers from quantization-format debates. LocalDocs (built-in RAG) is functional out of the box. The trade-off is feature ceiling: power users outgrow GPT4All faster than the other tools on this list. Use it to onboard, graduate to Ollama plus a UI as comfort grows.
-
C-tier ยท Out-of-scope tools commonly conflated with local AI
Not actually localChatGPT desktop app, Claude desktop, Perplexity desktopDesktop apps for cloud LLM services are not local AI. Every prompt and response transits the provider's infrastructure regardless of which app surfaces the chat. If the privacy-sovereignty axis matters at all, these are out of scope. Useful for cloud-LLM access with a nicer keyboard shortcut and tray icon, never for zero-data-egress workflows.
Who picks what: persona grid
Five recurring personas across the local-AI buyer set. Match yourself to the closest situation and use the pick as the first-cut recommendation; the deep dives below adjust for edge cases.
Decision tree: hardware first, then goal, then tool
Deep dives: when each tool is the right pick
Ollama: the workhorse backend everyone layers on
Strengths: one-command install on macOS, Linux, and Windows; a curated model library plus support for any Hugging Face GGUF; native Apple Metal and NVIDIA CUDA acceleration; HTTP API on localhost:11434 that every UI in the ecosystem speaks; vibrant community shipping wrappers, plugins, and integrations weekly. The closest thing to a default in the local-LLM stack as of May 2026, with over 100K GitHub stars. Weaknesses: CLI-first (which is a feature for developers and a friction wall for non-technical users), no built-in chat UI (intentional; pair with Open WebUI or LM Studio), AMD ROCm support trails NVIDIA materially. Best for: developers, teams building on top of local inference, anyone running a UI layer (Open WebUI, AnythingLLM, Msty) that expects an Ollama backend, and as the long-lived service on a headless server. Cost: free, open source MIT per ollama.com.
LM Studio: the smoothest desktop GUI
Strengths: polished cross-platform desktop app with the cleanest model-browser in the category, deep Hugging Face GGUF integration including quantization filter and hardware-fit warning, OpenAI-compatible local server with one-click on, parameter-tuning UI that exposes context length, temperature, and sampler choice without dropping to CLI. Weaknesses: closed source (deal-breaker for some privacy postures), single-user, slower release cadence on niche features than the open-source peers. Best for: Mac and Windows desktops where the user wants the smoothest first-experience, rapid model evaluation, and an OpenAI-compatible endpoint for app development without setting up Ollama. Cost: free freeware per lmstudio.ai.
GPT4All: the lightest install and the CPU-only champion
Strengths: the lowest-friction install on this list, curated model catalog that protects newcomers from quantization-format debates, LocalDocs built-in for RAG over a folder of files, runs respectably on CPU-only hardware where the alternatives crawl. Maintained by Nomic, with active development through 2025-26 and an OpenAI-compatible API server in recent versions. Weaknesses: smaller catalog than Ollama or LM Studio, feature ceiling lower than the power-user tools, less Hugging Face GGUF flexibility. Best for: CPU-only laptops, first-time local-AI users, low-spec hardware, and anyone who wants a no-decisions install that just works. Cost: free, open source MIT per nomic.ai/gpt4all.
Jan: the fully-open desktop alternative
Strengths: AGPL-licensed open-source desktop app, feature-comparable to LM Studio on the core chat and model-management surface, bring-your-own-key cloud-model option for users who want to mix local and remote in one UI, extension architecture. The right pick when LM Studio's closed-source posture is disqualifying. Weaknesses: UI polish lags LM Studio, smaller community, occasional release-cadence gaps. Best for: open-source-first desktop users, privacy-paranoid setups that will not run closed-source binaries even offline, and developers who want a hackable desktop frontend. Cost: free, open source AGPL per jan.ai.
LocalAI: the OpenAI-API drop-in for self-hosted servers
Strengths: exposes a faithful OpenAI-compatible REST API for text completion, chat completion, embeddings, image generation (Stable Diffusion), and audio (Whisper). Point any OpenAI SDK at LocalAI's endpoint and the calls just work, which collapses migration friction for teams porting an OpenAI-built app to self-hosted inference. Runs in Docker, on bare metal, on cloud VPS GPUs. NVIDIA CUDA and AMD ROCm both supported. Weaknesses: server-only (no first-party GUI; pair with Open WebUI or AnythingLLM), setup is more involved than desktop apps, model-fetch workflow assumes some comfort with config files. Best for: teams self-hosting an OpenAI-SDK-built app, server deployments on cloud VPS GPUs, and any stack that needs text plus image plus audio plus embeddings from one process. Cost: free, open source MIT per localai.io.
Open WebUI: the self-hosted ChatGPT-style web frontend
Strengths: the closest open-source equivalent to the ChatGPT web UI, with multi-user authentication and RBAC, built-in RAG over uploaded documents, plugin system, web-search integration, voice input and output, and a polished chat surface that holds its own next to commercial peers. Pairs natively with Ollama as the backend. The default "team chat replacement" pattern for orgs moving off SaaS chat tools. Weaknesses: requires a separate backend (Ollama is the canonical pairing), setup is web-app deployment work rather than desktop install, plugin ecosystem still maturing. Best for: teams that want a self-hosted ChatGPT-style UI for shared internal use, organizations with privacy or compliance constraints, and anyone replacing a SaaS chat subscription at team scale. Cost: free, open source MIT per openwebui.com.
AnythingLLM: the RAG and workspace flagship
Strengths: the most mature self-hosted RAG and document-Q-and-A workspace tool in the category, with document ingestion across PDFs, Word, Markdown, websites, GitHub repos, and Confluence; vector-store support across LanceDB, Pinecone, Weaviate, Chroma, and pgvector; multi-user workspaces with per-workspace knowledge bases; LLM backend agnostic (Ollama, LocalAI, LM Studio, OpenAI-compatible, or hosted APIs). Desktop app for solo users plus Docker for team deployment. Weaknesses: cloud-hosted paid tier exists alongside the free desktop and Docker editions (read the licensing carefully if commercializing), heavier resource footprint than a pure chat UI. Best for: company-knowledge chatbots, internal documentation Q&A, customer-support knowledge bases, research teams ingesting paper sets, and anyone whose primary local-AI use case is "chat with my documents" rather than open chat. Cost: free desktop and Docker (MIT), paid cloud tier per anythingllm.com.
Msty: the Mac-native power-user pick
Strengths: the most polished single-app experience on Apple Silicon, with split-chat and side-by-side model comparison, conversation branching that retains alternate threads, knowledge stacks for per-project RAG, real-time web search, and prompt library. Supports both local models (via embedded llama.cpp) and cloud models via bring-your-own-key. Aurum tier unlocks deeper features like advanced branching and additional integrations. Weaknesses: closed source (same caveat as LM Studio), single-user, freemium model means some features sit behind the Aurum paywall. Best for: Mac power-users on M-series silicon with 32GB+ unified memory who want a polished single-app for serious daily local-LLM work, prompt engineers comparing models, and researchers who want branching conversation trees natively. Cost: free + $69 lifetime Aurum tier per msty.app.
Known limitations across the field
No tool on this list is failure-free. Limitations are largely shared (model-quality ceiling, hardware capex, AMD GPU support gaps) and a few are tool-specific. None of these are deal-breakers; all of them are inputs to the procurement-diligence checklist for a serious local-AI deployment.
- Frontier-model ceiling. No consumer hardware in May 2026 runs GPT-4-class or Claude-3.7-Sonnet-class models at competitive quality. The local-LLM frontier (Llama 3.3 70B, Qwen 2.5 72B, DeepSeek 67B, Mixtral 8x22B) is impressive and pragmatic for many workloads but lags the closed frontier on hard reasoning, long-context fidelity, and tool use. Right-size the expectation: local AI replaces 70-90% of cloud-LLM workloads cleanly, not all of them.
- AMD ROCm support trails NVIDIA materially. Every tool on this list supports NVIDIA CUDA as a first-class target; AMD ROCm support ranges from partial (Ollama, LM Studio) to functional but rougher than NVIDIA (LocalAI). Buyers picking an AMD GPU specifically for local AI should validate ROCm posture against their target tool before purchase.
- Quantization confusion. Model files come in Q2, Q3, Q4_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8, and full-precision variants. The catalog is intimidating for newcomers. Default to Q4_K_M for the speed-quality balance on consumer hardware; move to Q5_K_M or Q6_K only if quality matters more than throughput and the hardware can fit it.
- Long-context throughput collapse. Marketing numbers (tokens-per-second) reflect short-context generation. Push context to 32K or 128K tokens and throughput drops sharply on every consumer architecture. Plan accordingly for document-Q-and-A workloads.
- Maintenance overhead. Self-hosted means self-maintained. Model updates, security patches, GPU driver updates, and CUDA-version upgrades land on the operator's plate. Cloud LLM APIs hide all of this; account for the engineering time in any cost comparison.
The privacy-sovereignty axis: why this category exists at all
Cloud LLM providers can offer no-training-on-input clauses, zero-retention policies, BAAs for healthcare data, SOC 2 reports, and tenant-isolated deployments. None of those eliminate the basic fact that the prompt and the response transit the provider's infrastructure and sit (briefly or persistently) on the provider's storage. A local LLM running on hardware the user owns with the network disabled cannot leak. There is no transmission surface, no provider-side log, no contract clause to interpret. For three buyer archetypes this is irreducible:
- Regulated data handlers. Attorneys (work product, privileged communications), healthcare practitioners (PHI), government contractors (classified or CUI), and trade-secret owners often face contractual or statutory restrictions that no SaaS contract clause fully satisfies. A local-only inference path collapses the entire compliance surface to "is the network disabled?" which is an architectural question, not a contractual one.
- Privacy-paranoid individuals. Some users will not run inference against their personal writing, conversations, or research through a third-party service regardless of policy language. Local AI is the only category that respects this preference architecturally.
- Cost-shaped privacy. Some users discover privacy second, after a cloud-spend bill arrives. Local inference happens to be cheaper at scale and happens to be private; both attributes survive the buying decision.