Migrating Vaultwarden (Bitwarden) behind WireGuard

Overview

Architected and implemented a distributed microservices system handling peak traffic of 100K+ requests per minute. The platform features event-driven communication, automated scaling, circuit breakers, and comprehensive observability with distributed tracing.

The project replaced a monolithic Rails application that had become a bottleneck for both development velocity and production reliability. The migration was done incrementally over 14 months with zero planned downtime.

Architecture

Service Decomposition

The domain was decomposed into 22 services along bounded context lines: catalogue, inventory, pricing, cart, checkout, order management, fulfilment, payments, notifications, search, reviews, and several supporting infrastructure services. Service boundaries were drawn to minimise cross-service transactions and maximise team autonomy.

Inter-Service Communication

Services communicate through two channels depending on the use case. Synchronous requests use gRPC with Protocol Buffers for type safety and efficient binary serialisation. Asynchronous events flow through Kafka topics — each service owns its own topics and is the sole writer to them, following the log-as-truth pattern.

Data Architecture

Each service owns its own datastore, chosen to fit its access patterns. The catalogue service uses MongoDB for flexible document storage. The order service uses PostgreSQL for strong consistency guarantees. The search service uses Elasticsearch. No service queries another service's database directly.

Infrastructure and Orchestration

All services run on Kubernetes with Helm charts for deployment configuration. The cluster autoscaler provisions nodes within 90 seconds of a scale-out trigger. Istio provides the service mesh layer, handling mTLS between services, traffic management, and telemetry collection without requiring application-level code.

Key Challenges

Resilient Inter-Service Communication

In a distributed system, every network call can fail. Every service client implements the circuit breaker pattern using a shared library that tracks error rates per upstream and opens the circuit when the error rate exceeds a threshold. Fallback behaviour (cached responses, degraded UI) is defined at the call site so failures degrade gracefully rather than cascading.

Distributed Transactions

The checkout flow spans the cart, inventory, pricing, payments, and order services. Achieving consistency without a distributed transaction coordinator required redesigning the flow as a saga — a sequence of local transactions with compensating transactions for rollback. The saga orchestrator runs as a durable workflow using a state machine persisted to PostgreSQL.

Data Consistency in Eventual Consistency Model

Accepting eventual consistency required explicit handling of the states between "event emitted" and "event processed." The order management service tracks each event it has emitted alongside its processing status, enabling it to retry failed downstream processors and detect when a consumer has fallen behind its expected processing time.

Deployment Orchestration Across 22 Services

Coordinating releases across 22 independently deployable services required a structured promotion process. Services are deployed to a staging environment automatically on merge to main. Production deployments are triggered manually with a canary rollout — 5% of traffic for 30 minutes, then 100% if error rates stay within baseline. Each service can be rolled back independently.

Features & Implementation

Event-Driven Architecture with Kafka

The Kafka cluster runs with a replication factor of 3 and a retention period of 7 days, allowing consumers to replay the event log for reprocessing or bootstrapping new services. Consumer groups are used for competing consumer patterns where a topic needs to be processed by multiple instances of the same service.

Kubernetes-Native Auto-Scaling

Horizontal Pod Autoscaler scales each service based on a combination of CPU utilisation and a custom metric — queue depth for Kafka consumer services, active connections for the API gateway. The scaling parameters are tuned per service based on load testing rather than using generic defaults.

Distributed Tracing with Jaeger

Every request carries a trace context propagated through gRPC metadata and Kafka message headers. Jaeger collects and indexes spans from all 22 services, making it possible to visualise the complete execution path of any request across service boundaries. Trace sampling is set to 100% for error requests and 1% for successful requests.

Health Monitoring and Alerting

Each service exposes a standardised health endpoint reporting the status of its dependencies. Prometheus scrapes metrics from all services and the infrastructure layer. Grafana dashboards provide service-level overviews and per-endpoint latency breakdowns. PagerDuty receives alerts when any SLO is at risk, with routing rules that page the owning team rather than a central operations team.

Outcomes

The migration completed 2 months ahead of the original 16-month estimate. Production results after full cutover:

99.99% uptime across the first 12 months of full operation, with all incidents contained to individual services
10x improvement in peak throughput, from 10K to 100K requests per minute without additional application servers
Deployment frequency increased from bi-weekly to multiple times per day as teams gained the ability to ship independently
Mean time to recovery dropped from 45 minutes to 6 minutes due to isolated blast radius and faster rollback

Lessons Learned

The hardest part of microservices is not the technology — it's the organisational alignment required to make independent teams truly independent. Conway's Law is real: the system architecture reflects the communication structure of the teams that build it. Getting the team boundaries right before the service boundaries was the prerequisite for everything else.

The second lesson was about observability-driven development. Building the tracing and metrics infrastructure in the first sprint, before any business logic, meant that every subsequent service was observable from day one. This investment paid back many times over during the debugging of distributed issues that would have been nearly impossible to diagnose with logs alone.