Self-Hosted vs BYOK Cloud for Regulated Teams
Self-hosted wins on absolute data residency. BYOK cloud wins on model quality and ops cost. The honest tradeoff math.
The two paths to "private" AI for a regulated team
If you run a clinic, a law firm, a financial advisory shop, or any team handling regulated data, you've probably been told the same two things. One: AI is going to change how you work. Two: you can't just paste patient records into ChatGPT. Both are true. The interesting question is what you do about it.
There are two real paths. Anything else is marketing.
Path one: self-host an open-weights model. You take Llama 3.3, Mistral, or Qwen, you run it on your own GPU (on-prem or in a VPC you control), and you serve it with something like Ollama, vLLM, or LM Studio. The data never leaves your network. There is no vendor in the data path.
Path two: BYOK cloud with a frontier model. You sign a BAA with a major vendor (Anthropic, OpenAI Enterprise, Azure OpenAI), you bring your own API key, and you use Claude, GPT, or Gemini under zero-retention terms. Your data hits the vendor's model briefly and comes back. Nothing is stored, nothing is trained on, nothing persists.
Both can satisfy HIPAA. Both can satisfy GDPR. The right one for you depends on your threat model, your team, and your tolerance for ops work. Let's get into the actual tradeoffs.
Self-hosted: what you get
The headline benefit of self-hosted is simple and powerful: absolute data residency. Your prompts and responses never touch a third party. Not for processing, not for logging, not for safety screening, not for anything.
That maps cleanly to a few real-world requirements:
- The strictest interpretations of HIPAA. Some compliance teams read the rule as "no PHI leaves the covered entity, period." A BAA doesn't satisfy that read. Self-hosted does.
- GDPR data residency. If your contract requires that EU data stay on EU soil, on infrastructure you control, self-hosted lets you draw that boundary cleanly.
- FedRAMP-aligned thinking. Government and quasi-government workloads with strict supply-chain assumptions tend to land here.
- Defense-in-depth. If a vendor compromise (breach, insider threat, subpoena) is in your threat model, removing the vendor removes the risk.
You also own the stack. The model weights are yours. You can fine-tune. You can run offline. You can run in an air-gapped network. Nobody can deprecate your model out from under you.
Self-hosted: what you give up
Here's where the brochures stop and the engineering starts.
Model quality. Llama 3.3 70B is genuinely impressive. For routine summarization, drafting, structured extraction, and Q&A, it's competitive with mid-tier hosted models. But it is not Claude Opus. It is not GPT-5. The gap shows up in long-document reasoning, complex multi-step instruction following, refusal calibration on edge cases, and anything that needs careful judgment. For a clinician who wants help reasoning over a complex chart, that gap is real.
Ops cost. Running a 70B model is not running a Postgres database. You need:
- GPU hardware. At minimum one A100 80GB, or two smaller cards. On-demand A100s run roughly $0.50 to $1.50 per hour. Reserved or owned brings that down significantly, but you eat the capital cost.
- Capacity planning. Concurrent users multiply VRAM needs. A 70B serving five clinicians at once is a different sizing exercise than one user at a time.
- Model updates. New Llama versions ship every few months. Each update is a deploy, a regression test, and a rollback plan.
- SRE on-call. When the inference server crashes at 2am, someone is paged. That someone is you, or someone you pay.
The total carrying cost for a small team is typically 20 to 40 engineering hours per month, plus hardware. If you don't already have an SRE function, you're standing one up. That's the honest math.
BYOK cloud: what you get
BYOK cloud flips the tradeoff. You give up absolute residency. You get four things back.
Frontier model quality. Claude Opus, GPT-5, Gemini Ultra. These are the best models in the world. The gap to the best open-weights model is meaningful for hard tasks, and it's growing, not shrinking. If your work involves complex reasoning, you feel that difference daily.
Zero ops. No GPUs to manage. No model updates to test. No capacity planning. No 2am pages. The vendor handles all of it. Your team focuses on the application, not the inference layer.
Fast model updates. When Anthropic ships a better Claude, you get it within hours of switching a model string. No deploy, no migration. With self-hosted, the same upgrade is a project.
BAA from a major vendor. Anthropic, OpenAI Enterprise, and Azure all sign BAAs and offer zero-retention terms. You inherit the vendor's SOC 2 posture, their physical security, their patch cadence. For most small teams, that's better than what they could build themselves.
This is the architecture Private Claude for Business uses. BYOK Claude under a BAA, zero data retention, no chat history stored anywhere.
BYOK cloud: what you give up
The trust boundary expands. Your data, briefly, is processed by a third party. That's the whole tradeoff in one sentence.
Even with a BAA, even with zero retention, even with strong contractual terms, the prompt and response transit the vendor's infrastructure. They run the model. They are, however briefly, in the data path. If your threat model says "no third party touches this data, ever," BYOK cloud does not satisfy it.
Some specific failure modes worth naming:
- Vendor breach. Rare but not zero. Anthropic, OpenAI, and Azure have strong security postures. Strong is not invulnerable.
- Subpoena during the retention window. Even with ZDR, operational logs typically live for a short window (Anthropic's is 7 days). Within that window, a court order could compel disclosure.
- Contract changes. A vendor can update their terms. You'd get notice, but it's a moving floor.
- Cross-border processing. Some EU contracts forbid any cross-Atlantic processing. Most US-based vendors process in the US by default. Azure and some Anthropic enterprise tiers offer EU-region processing, but you have to actively configure for it.
None of these are blockers for most teams. They are real for some. Know which group you're in. We cover the BAA and ZDR mechanics in detail in zero-retention AI for regulated teams.
The tradeoff matrix
Here's the side-by-side, with rough numbers for a small team scenario (5 to 20 users, moderate volume).
| Dimension | Self-hosted (Llama 3.3 70B) | BYOK cloud (Claude Opus + BAA) |
|---|---|---|
| Model quality | Good. Competitive with mid-tier hosted. | Frontier. Best available. |
| Data residency | Absolute. Data never leaves your network. | Vendor processes briefly under BAA + ZDR. |
| Ops cost (eng hrs/mo) | 20 to 40 hours. | 0 to 2 hours. |
| Monthly hosting cost | $500 to $2,500 (GPU + infra). | $50 to $500 (API usage only). |
| Model update cadence | Quarterly, manual deploy. | Continuous, change a string. |
| BAA | You sign your own (with hosting provider). | Signed with model vendor. |
| Audit-friendly | Excellent. Full local logs. | Good. Vendor SOC 2 + your access logs. |
| Time to deploy | 2 to 8 weeks. | Hours to days. |
The numbers are illustrative, not contractual. Your team, your volume, and your existing infrastructure shift them. The shape of the tradeoff is what matters.
The hybrid path
Nobody talks about this enough. You don't have to pick one architecture for everything.
The most defensible setup for a serious regulated team is a tiered approach based on data sensitivity:
- Tier one (most sensitive): active patient PHI in real-time clinical workflows, attorney-client privileged matter under active litigation, customer financial records during transaction. Use a self-hosted open model. The data does not leave your perimeter.
- Tier two (regulated but lower-stakes): drafting internal policies, training documentation, summarizing publicly available case law, running internal research. Use BYOK cloud under BAA. The model quality is worth the marginal trust boundary expansion.
- Tier three (non-sensitive): marketing copy, internal email drafts, code generation against open-source repos. Use whatever's fastest and cheapest. ChatGPT Pro is fine here.
The hard part isn't the routing. The hard part is the data classification work. Most teams have never formally classified their data. Doing that exercise is independently valuable, regardless of which AI architecture you pick.
The honest call
For 95% of small to mid-size regulated teams, BYOK cloud with a BAA is the right answer. The model quality and ops savings outweigh the marginal residency benefit. Self-hosted is correct for true zero-trust threat models, teams with existing GPU infrastructure, or contracts that explicitly forbid third-party processing. If you're not in one of those three buckets, BYOK is the better fit and you should stop torturing yourself about it.
Here's the reasoning in plain terms.
The marginal privacy benefit of self-hosted over BYOK-cloud-with-BAA-and-ZDR is real but small. Both keep your data off training datasets. Both prevent long-term retention. Both satisfy HIPAA's actual statutory requirements. The difference is whether a vendor processes your data in memory for a few seconds.
The cost difference is not small. Self-hosted is $500 to $2,500 a month in hosting plus 20 to 40 engineering hours, plus a meaningful step down in model quality. BYOK cloud is API costs plus near-zero ops, with the best models available.
If you have the threat model that justifies that cost, the math works. If you don't, you're paying ten thousand dollars a year and degrading your output quality to defend against a risk that your threat model doesn't actually flag. That's how teams end up with worse AI than the consumer using ChatGPT at home.
The teams that do well with self-hosted have one of three things: an existing SRE function, an existing GPU footprint, or a customer who explicitly contracts for it. The teams that do well with BYOK cloud have a clear-eyed read on what their compliance actually requires, signed paperwork that satisfies it, and the discipline to stop arguing with themselves about marginal cases.
Frequently asked questions
What does self-hosted AI actually mean for a regulated team?
It means running an open-weights model (like Llama 3.3, Mistral, or Qwen) on your own GPU infrastructure, on-prem or in your own VPC. No prompt or response ever leaves your network. Tools like Ollama, vLLM, and LM Studio handle the serving layer. You own the box, the model weights, and the data path.
How does BYOK cloud AI differ from self-hosted?
BYOK cloud means you bring your own API key to a hosted frontier model (Claude, GPT, Gemini) and the vendor signs a BAA with zero retention terms. Your data leaves your premises briefly to hit the model, then comes back. The model quality is much higher than open-weights, but a third party is in the data path under contract.
Is Llama 3.3 70B as good as Claude Opus?
No, not at the high end. Llama 3.3 70B is excellent at routine summarization, drafting, and Q&A and is competitive with mid-tier hosted models. It falls behind Claude Opus and GPT-5 on long-document reasoning, complex instruction following, and refusal calibration. For most everyday tasks the gap is small. For your hardest problems, the gap is real.
What does it cost to self-host an open model?
GPU hosting runs roughly $0.50 to $1.50 per hour for an on-demand A100, much cheaper on reserved or owned hardware. A 70B model needs at least one A100 80GB or two smaller cards. Beyond hardware, count engineering time: model updates, scaling, monitoring, and on-call. A small team typically spends 20 to 40 engineering hours a month keeping it running.
Can BYOK cloud meet HIPAA requirements?
Yes, if the vendor signs a BAA and offers zero data retention. Anthropic, OpenAI Enterprise, and Azure OpenAI all support BAAs with ZDR terms. Your prompts and responses are processed in memory and not retained. Many compliance teams accept this. Some, with stricter zero-trust postures, don't. The right answer depends on your threat model, not the regulation.
When is self-hosted the right call?
Self-hosted is the right call when your threat model treats any third-party vendor in the data path as unacceptable, when you already have GPU infrastructure and SRE talent, or when contractual data residency requirements forbid any cross-border processing. For most small to mid-size regulated teams, none of those apply, and BYOK cloud is the better fit.
Can I run a hybrid setup?
Yes, and many teams do. Use a self-hosted open model for the most sensitive class of data (active patient PHI, real-time clinical context). Use BYOK cloud for less-sensitive but still regulated work (drafting policies, training docs, internal research). The data classification work is the hard part, the routing is straightforward.
Does Private Claude support self-hosting?
Private Claude is a BYOK cloud product. We use Anthropic's developer API under a BAA with zero data retention. We don't host open-weights models. If you need a fully self-hosted setup, we can advise on architecture, but the deployment is on your infrastructure.
Private Claude for regulated teams.
BAA available. Zero data retention. Self-serve or deploy in your VPC. Talk to us about your compliance requirements.
Contact sales