When to self host your AI inference and when to rent it
The most common decision we walk operators through is whether to self host their AI inference or pay a hyperscaler. The Canadian press has framed this as a sovereignty story; sovereignty matters but it is rarely the deciding factor. The deciding factors are cost at your volume, latency at your use case, and the availability of operator skill on your team.
Cost crosses over around five to ten million tokens per day of inbound prompt traffic, depending on model size. Below that threshold, paying Anthropic or OpenAI is almost certainly cheaper than amortizing a GPU server over its expected useful life. Above that threshold, a workstation class GPU box you bought for ten thousand dollars amortizes faster than monthly API bills. The breakeven shifts as model prices drop, so we recommend redoing the calculation quarterly.
Latency matters when you have an interactive user staring at the screen waiting. Local inference on commodity GPUs typically returns first tokens in 150 to 400 milliseconds. Hyperscaler APIs add round trip network latency on top of their compute, often 600 to 1200 milliseconds before the first token streams. For batch workloads, none of this matters. For chat interfaces and agent loops, the difference is the difference between a system that feels alive and a system that feels broken.
Operator skill is the unglamorous variable that decides whether self hosting works. Running a GPU node in production requires comfort with Linux, with monitoring stacks, with model quantization, with debugging memory pressure on a card you cannot just reboot. If your team already operates other infrastructure, this is incremental. If your team has never carried a pager for a Linux box, self hosting your AI is a bigger commitment than it looks.
The Canadian regulatory layer adds a finger on the scale toward self hosting in two specific cases. First, public sector and regulated industries (healthcare, legal, financial services) where data residency obligations make non Canadian inference legally awkward. Second, organizations subject to PIPA in BC or PIPEDA federally that need to demonstrate granular control over personal information processing. Self hosting solves these by construction.
We help operators run this calculation as a paid Infrastructure Planning engagement. The answer is more often hybrid than either extreme: open models hosted locally for bulk paths, closed models via API for the workloads that need their accuracy. The architecture that survives is the one that does not lock you into either choice.