Back to Posts

    Self-Hosting Ollama with GPU Passthrough on Proxmox

    Posted September 14, 2025Updated April 22, 2026

    Running local LLMs on consumer GPU hardware via a Proxmox VM with PCIe passthrough. IOMMU groups, VFIO binding, and the boring tuning that makes inference actually fast.

    ProxmoxOllamaNVIDIAVFIOLinux
    // Impact: Local Llama 3.1 70B inference at 18 tok/s on consumer hardware

    Self-Hosting Ollama with GPU Passthrough on Proxmox

    Overview

    Set up GPU passthrough on a Proxmox host so an LLM-serving VM gets exclusive access to a consumer NVIDIA card. Goal: run Ollama locally for personal use and prototyping without leaking prompts to a third-party API.

    What worked

    The hard parts were the boring parts: enabling IOMMU in the BIOS, identifying the right PCIe group, blacklisting nouveau, and binding the card to VFIO at boot. Once the VM saw the card cleanly, the Ollama setup itself was an evening.

    What surprised me

    The biggest performance gain wasn't from any model tuning — it was from setting OLLAMA_KEEP_ALIVE=-1 so the model stayed resident in VRAM. First-token latency went from ~3s to under 200ms.

    Hardware

    Mid-range desktop hardware: Ryzen 5950X, 64 GB RAM, RTX 3090 with 24 GB VRAM. The 3090's VRAM is the limiting factor — 70B models only fit at Q4 quantization. For 8B and 13B models, performance is excellent at higher quantizations.

    // © 2026 Chisom Onuegbu