On this page
TL;DR
Self-hosting an LLM for business in 2026 is genuinely viable - the open-weight models are good enough and the runtimes are mature - but the model and the GPU are the cheap, easy part. The expensive part is everything around them: data loss prevention, identity, audit, and ongoing operations. Use Ollama for the fastest path to a working local model, vLLM when you need production throughput, and a chat layer such as AnythingLLM, LibreChat, or Open WebUI on top. Businesses self-host for three reasons: data control (prompts never leave the boundary), data residency (you choose the jurisdiction), and cost at scale (fixed infrastructure beats per-seat pricing above roughly 50-100 daily users). The trap is that a DIY stack delivers the chat but not the governance - DLP, SSO, RBAC, and immutable audit are not included, and building them is a multi-quarter programme. For regulated data or any organisation that has to answer to an auditor, a managed private deployment such as Areebi ships the governance layer integrated, which is why most businesses that pilot DIY end up needing it. Updated 2026-06-10.
Why businesses self-host an LLM
There are exactly three durable reasons a business self-hosts an LLM, and "it is cheaper" is only sometimes one of them. Being honest about which reason applies to you determines whether self-hosting is the right call or an expensive distraction.
Reason 1: Data control. When you self-host, prompts, responses, uploaded documents, and usage logs stay inside infrastructure you control. They never transit a third-party provider, are never retained under someone else's terms, and are never available for model training. For organisations whose realistic daily use involves customer records, financial data, legal matters, or source code, this is not a preference - it is the entire point. The cautionary tale is well known: Samsung banned generative AI tools company-wide in 2023 after engineers pasted internal source code into ChatGPT. Self-hosting removes that entire class of exposure.
Reason 2: Data residency. When you self-host, you choose the country, data centre, and boundary where inference happens. That matters because cross-border transfer of personal data is restricted under the GDPR (Regulation (EU) 2016/679), and the Australian Privacy Principles (APP 8) make organisations accountable for overseas disclosures of personal information. Public AI availability is itself a variable you do not control - Italian regulators temporarily blocked ChatGPT in 2023 on privacy grounds. A self-hosted model inside your own jurisdiction collapses the transfer analysis. See what is data residency for AI.
Reason 3: Cost at scale. Per-seat and per-token public AI pricing scales linearly with usage forever. Self-hosted infrastructure cost is mostly fixed. The crossover is real but later than DIY advocates claim - for organisations above roughly 50 to 100 daily active users, fixed infrastructure plus operations routinely undercuts per-seat enterprise AI subscriptions on a three-year view, which we model in the ChatGPT Enterprise pricing breakdown. Below that threshold, the economics usually favour a subscription, and self-hosting is justified by control and residency, not cost.
The risk side is not hypothetical either. The IBM Cost of a Data Breach Report 2025 found that one in five organisations suffered a breach involving shadow AI, and those breaches cost an average of USD 670,000 more than breaches without it. A sanctioned self-hosted assistant is one of the few remediations employees will actually adopt instead of routing around.
The realistic 2026 self-hosted LLM stack
A self-hosted LLM is not one product - it is a stack with three layers: an inference engine that runs the model, a chat and application layer that users interact with, and a governance layer that makes it safe for a business. The open-source ecosystem covers the first two well and the third barely at all, which is the single most important fact in this guide.
The inference layer loads model weights and serves tokens. The dominant choices are Ollama (developer-friendly, single-node, excellent for getting started and small-team use) and vLLM (high-throughput, production-grade serving with continuous batching, the right choice when you need to serve many concurrent users efficiently). Both run open-weight models such as Llama, Mistral, Qwen, and Gemma.
The chat and application layer sits on top of the inference engine and gives users a ChatGPT-style interface, conversation history, document upload, and often RAG over uploaded files. The leading open-source options are AnythingLLM, LibreChat, and Open WebUI. This is the layer people mean when they say "self-hosted ChatGPT."
The governance layer - DLP, SSO, RBAC, audit, policy - is the layer the open-source stack does not meaningfully provide. The chat layers offer basic multi-user support and sometimes simple roles, but real-time PII and PHI redaction, SAML, immutable per-user audit, and a no-code policy engine are absent. This is not a criticism of the projects - it is outside their scope. It is, however, the gap that turns a successful pilot into a stalled rollout, because the gap is exactly what security, legal, and audit ask about. We compare the full DIY assembly against an integrated platform in Areebi versus DIY open source.
Self-hosted LLM tools compared
The table below compares the five tools you will actually evaluate in 2026, split by the layer each occupies. Note that the inference engines (Ollama, vLLM) and the chat layers (AnythingLLM, LibreChat, Open WebUI) are complementary, not competing - a typical stack pairs one of each.
| Tool | Layer | Best for | Strengths | Governance gaps |
|---|---|---|---|---|
| Ollama | Inference | Fast start, small teams, local-only | Trivial install, broad model library, single-node simplicity | No throughput batching at scale; no governance |
| vLLM | Inference | Production throughput, many concurrent users | Continuous batching, high GPU efficiency, OpenAI-compatible API | Steeper ops; serving only, no UI or governance |
| AnythingLLM | Chat + RAG | Document chat with workspaces | Built-in RAG, workspace concept, multi-provider, desktop or server | Basic roles; no inline DLP, limited audit and SSO |
| LibreChat | Chat | ChatGPT-style multi-model UI | Many providers, plugins, familiar UX, active development | No real-time DLP; audit and enterprise SSO limited |
| Open WebUI | Chat + RAG | Polished self-hosted front end for Ollama | Clean UX, RAG, pairs naturally with Ollama, model management | No inline DLP or policy engine; audit basic |
A common, sensible DIY stack is Ollama or vLLM for inference plus Open WebUI or AnythingLLM for chat. That gets a small team a private, working assistant in a day. What it does not get them is anything they can show an auditor, any inline control over what employees paste in, or a tamper-evident record of who did what. Those gaps are the subject of the operational-cost section, and they are why the chat-layer "governance gaps" column matters more than the strengths column for a regulated business.
Project specifics evolve quickly; verify current capabilities at each project's repository: Ollama, vLLM, AnythingLLM, LibreChat, and Open WebUI.
Get your free AI Risk Score
Take our 2-minute assessment and get a personalised AI governance readiness report with specific recommendations for your organisation.
Start Free AssessmentHardware sizing basics
The first-order driver of self-hosted LLM hardware is the model's parameter count and quantisation, because together they determine how much GPU memory (VRAM) you need to hold the weights. Get this wrong and the model either will not load or runs unusably slowly by spilling to system RAM.
The useful rule of thumb: a model needs roughly 2 bytes of VRAM per parameter at 16-bit precision (FP16), and about 0.5-1 byte per parameter when quantised to 4-bit, plus headroom for the context window (the KV cache) and overhead. So a 7-billion-parameter model needs roughly 14 GB at FP16 or around 4-6 GB at 4-bit; a 70-billion-parameter model needs roughly 140 GB at FP16 or around 40-48 GB quantised. Hugging Face's optimisation documentation covers the memory mechanics in detail.
Translated into realistic 2026 hardware tiers:
- Small team / single workstation (7B-13B class): a single high-memory consumer or workstation GPU, or an Apple Silicon machine with unified memory, runs a quantised 7B-13B model credibly via Ollama or LM Studio. Good for a handful of users or a pilot.
- Department scale (mixed, up to ~34B): a single data-centre GPU with 48-80 GB of VRAM, or two consumer GPUs, serves a small department, especially with vLLM batching concurrent requests.
- Larger / 70B class: 70B-class models at quality typically need one or more data-centre GPUs (multiple high-VRAM cards or aggressive quantisation). This is where capex and power become real line items.
Three sizing factors people forget. Concurrency: VRAM holds the weights once, but each simultaneous user consumes KV-cache memory, so serving 50 concurrent users needs far more headroom than serving one - this is exactly what vLLM's continuous batching optimises. Context length: long context windows enlarge the KV cache substantially, sometimes more than the weights. Throughput versus latency: quantisation and batching trade quality and per-request latency for aggregate throughput, and the right balance depends on whether you are running interactive chat or batch processing. For most businesses, a model-agnostic platform that can route to the right-sized model per task beats over-provisioning one large model for every request - the logic behind a model-agnostic enterprise LLM strategy.
When a managed private deployment beats DIY
The decision is not "self-host or use public AI" - it is "self-host raw or self-host with the governance layer included." Both keep your data private; only one is something you can put in front of an auditor. A managed private deployment such as Areebi runs inside your own infrastructure - the same data-control and residency benefits as DIY - but ships the governance layer the open-source stack lacks.
DIY raw self-hosting is the right answer when: your use case is low-sensitivity, you have a platform team with spare capacity to operate it permanently, you have no external audit or regulatory obligations, and you genuinely need only chat over a model. For a developer team experimenting internally, an Ollama plus Open WebUI stack is excellent.
A managed private deployment wins when any of the following is true, which for most mid-market businesses is the common case:
- You handle regulated or customer data and therefore need inline DLP, not a promise that employees will be careful.
- You answer to an auditor or regulator and need a tamper-evident, per-user audit trail and compliance evidence aligned to SOC 2, HIPAA, or GDPR.
- You do not have a platform team to spare for permanent LLM operations, identity integration, and security hardening.
- You want model freedom across many providers rather than re-engineering each time a better open-weight model ships.
- You need residency or air-gap with governance intact - Areebi deploys via Docker, Kubernetes, VM, fully air-gapped, or local-only via Ollama or LM Studio.
The honest framing for a buyer: count the engineering months to build DLP, SSO and RBAC, immutable audit, a policy engine, and ongoing operations on top of the open-source stack, then compare that - plus the permanent operational burden - against a platform that ships them integrated and runs on your own infrastructure. The model layer is cheap and getting cheaper; the governance layer is where the cost and the liability live, and it is the same governance layer whether you build it or buy it. That comparison is what the on-premise AI chatbot buyer's guide turns into a procurement process.
Next steps
If you are evaluating self-hosting, separate the two decisions explicitly: which model and runtime, and which governance layer. The first is a fast, low-stakes experiment; the second determines whether the deployment ever leaves pilot.
- What is a private LLM? - the four deployment models and the control-versus-effort trade-off.
- What is an enterprise LLM? - the five controls that separate a model endpoint from an enterprise deployment, with a platform checklist.
- On-premise AI chatbot buyer's guide - the requirements checklist, evaluation criteria, vendor questions, and red flags.
- Areebi vs DIY open source - the full cost and capability comparison of building versus buying the governance layer.
- What is LLM security? - the runtime controls a self-hosted model still needs inside the boundary.
To see a governed private deployment running on infrastructure you control, book a demo or review pricing. The fastest way to de-risk the decision is to test the governance layer against your own data and your own auditors' questions before you commit a quarter of engineering to rebuilding it.
External sources
- IBM, Cost of a Data Breach Report 2025: ibm.com/reports/data-breach.
- Regulation (EU) 2016/679 (GDPR): eur-lex.europa.eu/eli/reg/2016/679/oj.
- Office of the Australian Information Commissioner, Australian Privacy Principles: oaic.gov.au/privacy/australian-privacy-principles.
- Hugging Face, LLM inference optimisation: huggingface.co/docs/transformers.
- Ollama project: github.com/ollama/ollama. vLLM project: github.com/vllm-project/vllm.
- AnythingLLM: github.com/Mintplex-Labs/anything-llm. LibreChat: github.com/danny-avila/LibreChat. Open WebUI: github.com/open-webui/open-webui.
Frequently Asked Questions
Is it cheaper to self-host an LLM than to pay for ChatGPT Enterprise?
It depends on scale. Per-seat public AI pricing scales linearly with headcount, while self-hosted infrastructure cost is mostly fixed, so above roughly 50 to 100 daily active users self-hosting can undercut per-seat subscriptions on a three-year view. Below that threshold, a subscription is usually cheaper and self-hosting is justified by data control and residency rather than cost. Crucially, the honest comparison must include the operational and governance costs - DLP, SSO, audit, and permanent operations - not just the GPU and model, because those hidden costs frequently dominate.
What is the best self-hosted ChatGPT alternative for business?
For the chat experience itself, AnythingLLM, LibreChat, and Open WebUI are the leading open-source options, typically paired with Ollama or vLLM for inference. They deliver a familiar ChatGPT-style interface with conversation history and often RAG over your documents. However, none of them provides the governance layer a business needs - real-time DLP, enterprise SSO and RBAC, immutable audit, and a policy engine. For regulated data or any organisation with audit obligations, a managed private deployment that includes the governance layer and runs on your own infrastructure is the more complete answer.
What hardware do I need to self-host an LLM?
It is driven by the model's parameter count and quantisation. As a rule of thumb, a model needs about 2 bytes of VRAM per parameter at 16-bit precision and roughly 0.5 to 1 byte per parameter at 4-bit, plus headroom for the context window and concurrency. A quantised 7B to 13B model runs on a single high-memory workstation GPU or Apple Silicon machine; a 70B-class model typically needs one or more data-centre GPUs with 40 GB or more of VRAM. Concurrency and long context windows enlarge the memory requirement substantially beyond the weights alone.
Is a self-hosted LLM automatically secure and compliant?
No. Self-hosting closes one important attack surface - the external data path to a third-party provider - but it does not deliver security or compliance by itself. Prompt injection, insecure output handling, over-permissive RAG retrieval, absent DLP, and missing audit all survive the move to self-hosted infrastructure. Privacy of hosting and governance of usage are different problems. A self-hosted model still needs inline DLP, SSO and RBAC, an immutable audit trail, and runtime security controls before it is safe and compliant for business use.
What is the difference between Ollama and vLLM?
Both are inference engines that run open-weight models, but they target different needs. Ollama prioritises developer experience and simplicity - it is trivial to install, has a broad model library, and is ideal for getting started, local-only use, and small teams on a single node. vLLM prioritises production throughput, using continuous batching to serve many concurrent users efficiently on GPU hardware, and exposes an OpenAI-compatible API. Use Ollama to start fast and for small-scale use; move to vLLM when you need to serve a department or more with good GPU efficiency.
When should we buy a managed private deployment instead of building one?
Buy when you handle regulated or customer data and need inline DLP, when you answer to an auditor and need a tamper-evident per-user audit trail, when you lack a platform team to operate the stack permanently, when you want model freedom across providers, or when you need residency or air-gap with governance intact. Build raw only when the use case is low-sensitivity, you have spare platform capacity, you have no audit obligations, and you need nothing more than chat over a model. For most mid-market businesses, the governance requirements push the decision toward a managed private deployment that runs on their own infrastructure.
Related Resources
Stay ahead of AI governance
Weekly insights on enterprise AI security, compliance updates, and governance best practices.
Stay ahead of AI governance
Weekly insights on enterprise AI security, compliance updates, and best practices.
About the Author
Areebi Research
The Areebi research team combines hands-on enterprise security work with deep AI governance research. Our analysis is informed by primary sources (NIST, ISO, OECD, federal registers, IAPP) and the operational realities of CISOs running AI programs in regulated industries today.
Ready to govern your AI?
See how Areebi can help your organization adopt AI securely and compliantly.