Self-Hosting Ollama with GPU Passthrough on Proxmox
Running local LLMs on consumer GPU hardware via a Proxmox VM with PCIe passthrough. IOMMU groups, VFIO binding, and the boring tuning that makes inference actually fast.
Self-Hosting Ollama with GPU Passthrough on Proxmox
Overview
Set up GPU passthrough on a Proxmox host so an LLM-serving VM gets exclusive access to a consumer NVIDIA card. Goal: run Ollama locally for personal use and prototyping without leaking prompts to a third-party API.
What worked
The hard parts were the boring parts: enabling IOMMU in the BIOS, identifying the right PCIe group, blacklisting nouveau, and binding the card to VFIO at boot. Once the VM saw the card cleanly, the Ollama setup itself was an evening.
What surprised me
The biggest performance gain wasn't from any model tuning — it was from setting OLLAMA_KEEP_ALIVE=-1 so the model stayed resident in VRAM. First-token latency went from ~3s to under 200ms.
Hardware
Mid-range desktop hardware: Ryzen 5950X, 64 GB RAM, RTX 3090 with 24 GB VRAM. The 3090's VRAM is the limiting factor — 70B models only fit at Q4 quantization. For 8B and 13B models, performance is excellent at higher quantizations.