Self-Hosting Ollama with GPU Passthrough on Proxmox

Overview

Set up GPU passthrough on a Proxmox host so an LLM-serving VM gets exclusive access to a consumer NVIDIA card. Goal: run Ollama locally for personal use and prototyping without leaking prompts to a third-party API.

What worked

The hard parts were the boring parts: enabling IOMMU in the BIOS, identifying the right PCIe group, blacklisting nouveau, and binding the card to VFIO at boot. Once the VM saw the card cleanly, the Ollama setup itself was an evening.

What surprised me

The biggest performance gain wasn't from any model tuning — it was from setting OLLAMA_KEEP_ALIVE=-1 so the model stayed resident in VRAM. First-token latency went from ~3s to under 200ms.

Hardware

Mid-range desktop hardware: Ryzen 5950X, 64 GB RAM, RTX 3090 with 24 GB VRAM. The 3090's VRAM is the limiting factor — 70B models only fit at Q4 quantization. For 8B and 13B models, performance is excellent at higher quantizations.