homelab
Six-node k3s cluster on Proxmox, fully GitOps-managed with Flux v2, triple-replicated DNS, edge TLS termination with CrowdSec, and a 3-2-1 backup strategy. Built to mirror production patterns — not as a toy, but as the platform I use every day.
6
Nodes
24
vCPUs
72 GB
RAM
450 GB
Storage
how it's built
Five layers, each isolated and independently replaceable.
compute
k3s-server-1
2 vCPU · 6 GB
prx-prod-1
k3s-server-2
2 vCPU · 6 GB
prx-prod-2
k3s-server-3
2 vCPU · 6 GB
prx-prod-3
k3s-worker-1
6 vCPU · 18 GB
prx-prod-1
k3s-worker-2
6 vCPU · 18 GB
prx-prod-2
k3s-worker-3
6 vCPU · 18 GB
prx-prod-3
k3s v1.35.4+k3s1 · Flannel VXLAN · kube-vip (control-plane VIP) · MetalLB 0.15.3 L2 ARP
authoritative dns
Three-node BIND9 platform running outside Kubernetes — dedicated VMs, Keepalived VRRP failover, and the same GitOps discipline as the cluster.
dns-01
Holds all zone files · Allows AXFR to slaves · Not exposed via VIP
dns-02
Holds VIPVRRP Master · priority 150
dns-03
StandbyVRRP Backup · priority 100
vrrp failover
A floating VRRP VIP migrates between dns-02 and dns-03 via Keepalived.
Keepalived checks the local BIND9 process every 2 seconds. Three consecutive failures trigger VIP failover to the standby node. DNS stays up — the master (dns-01) is never the serving path.
TSIG Authentication
Zone transfers are authenticated with a shared HMAC-SHA256 key. The key is deployed manually to all three servers and never stored in the repository.
GitLab CI Validation
Every commit runs named-checkconf, named-checkzone for each zone, and a PTR consistency check. The pipeline blocks on any failure before touching the servers.
GitOps — No Direct Edits
Zone files are never modified on servers directly. All changes go through PR → CI → merge. The git log is the full audit trail of every DNS change.
zones served
observability
kube-prometheus-stack with custom dashboards, 55 hand-written alert rules, and ntfy push notifications — everything visible, nothing silent.
116
Dashboard panels
55
Alert rules
50+
Scrape targets
5
ntfy channels
Prometheus ×2
2 replicas with deduplication, 15-day / 18 GB retention, WAL compression, 30-min out-of-order window.
Alertmanager ×3
3-node gossip cluster, 1 per worker, 5-day silence retention, PDB minAvailable 2 for quorum.
Grafana
OIDC login via Authentik, sidecar-discovered dashboards and datasources from all namespaces.
Loki + Alloy
Log aggregation with tsdb schema v13. Alloy DaemonSet ships logs from every pod, labelled by namespace, pod, container, node.
Blackbox Exporter
DNS probes (UDP + TCP), HTTP/2, and ICMP checks — all with 5-second timeouts and results fed into alert rules.
node-exporter + kube-state-metrics
Full host and cluster metrics via kube-prometheus-stack defaults, supplemented by 13 custom additional scrape jobs.
custom dashboards
Homelab Overview
93 panelsUnified view across all layers: Proxmox node health, k3s cluster state (pods, CPU, memory gauges), Traefik error rate and p99 latency, CrowdSec bans by attack category, DNS master/slave/VIP status, Proxmox VM table with per-VM resource usage, and top-10 pods by CPU and memory.
DNS Monitoring
23 panelsBIND9-specific: VIP owner and master/slave status, query rate and type distribution, latency histogram, cache hit ratio, DNSSEC validation counters, zone transfer tracking, serial drift detection, and per-server CPU/memory/network usage.
55 alert rules · 6 PrometheusRule CRDs
System
13 alertsCPU, memory, disk, load, NTP drift, fd exhaustion, predict_linear disk fill.
Traefik / Edge
8 alerts5xx error rate, p99 latency, CrowdSec health, cert expiry <14 days, unusual request spike.
GitLab & VMs
15 alertsPer-VM monitoring: OOM kill, read-only filesystem, high memory, metrics endpoint down.
Proxmox
8 alertsNode/cluster quorum, VM state, CPU/memory/storage thresholds, backup failure, recent restart.
DNS
10 alertsServer and VIP probes, query latency >0.5s, cache hit ratio <50%, zone serial drift, transfer failures.
Watchdog
1 alertsAlways-firing test alert that validates the full pipeline from Prometheus to ntfy on every reconcile.
notification routing
5 ntfy webhook receivers — routed by severity and category.
Alertmanager groups by alertname, category, and instance. Critical alerts repeat every hour; everything else every 4 hours. Resolved notifications always fire.
homelab-critical
1 h repeat
homelab-alerts
default · 4 h repeat
homelab-dns
DNS category
homelab-proxmox
Proxmox category
homelab-ceph
Ceph category
self-hosted
Everything self-hosted, all behind Authentik SSO.
GitLab EE
Self-hosted VCS, CI/CD pipelines, container registry
Flux v2
GitOps operator — cluster state pulled from Git
Renovate
Automated Helm chart and dependency updates
Authentik
OIDC provider and ForwardAuth relay for all services
Vaultwarden
Bitwarden-compatible self-hosted password manager
Infisical
Secrets management with Kubernetes native injection
Prometheus
Metrics collection, 2 replicas, 15-day retention
Grafana
Dashboards and visualisation, Authentik OIDC login
Loki + Alloy
Log aggregation with DaemonSet log shipping
Uptime Kuma
Service uptime monitoring and public status page
ntfy
Self-hosted push notification server
Code-server
VS Code in the browser, accessible from anywhere
Shlink
Self-hosted URL shortener with analytics
Homepage
Unified dashboard for all running services
Homelab Docs
MkDocs Material site, auto-deployed via GitLab CI
NetBox
IPAM and DCIM — network source of truth
AWX
Self-hosted Ansible controller — run playbooks via web UI, manage inventories, schedule jobs with RBAC
day 0
Before Flux can reconcile anything, the infrastructure has to exist. Terraform provisions VMs, Ansible configures them, AWX keeps ongoing operations scriptable and auditable.
day 0 · provisioning
Terraform + Ansible
VM defined in Terraform → provisioned on Proxmox → OS configured by Ansible → node joins k3s cluster.
day 2 · operations
Flux + GitLab CI
GitOps takes over — every application and configuration change flows through Git from this point on.
Terraform
Provisions all VMs on Proxmox — k3s nodes, DNS servers, Docker hosts, and supporting infrastructure. Manages VM specs, network interfaces, disk allocation, and cloud-init configuration. State stored in GitLab.
Ansible
Handles OS-level configuration across all nodes — package installation, sysctl tuning, user management, SSH hardening, and k3s installation with automated cluster join. Also runs ad-hoc operational tasks outside GitOps scope.
AWX
Self-hosted Ansible controller running in the cluster. Run playbooks via web UI, manage inventories, schedule recurring jobs, enforce RBAC. Playbooks sourced directly from GitLab — no SSH key distribution needed.
deploy flow
golden rule
Nothing reaches the cluster via kubectl apply in production.
Every change goes through Git. The repo is the cluster. The git log is the audit trail.
PR opened in GitLab
Change to a Helm release, ConfigMap, secret, or app manifest.
GitLab CI validates
YAML lint → Helm template render → kube-score policy check. Pipeline blocks on any failure.
PR merged to main
The only path to production. Requires passing CI and code review. No exceptions.
Flux detects the diff
Flux polls Git every 5 min (apps) or 1 h (infrastructure) and queues a reconcile.
Cluster converges
Flux applies the diff. Reloader watches ConfigMap/Secret changes and rolls affected pods automatically.
rationale
The choices that shaped the current design, and why.
01
k3s ships with embedded etcd and a footprint small enough for 6 VMs. Three control-plane nodes plus kube-vip give the same HA story as a cloud-managed control plane, without kubeadm certificate rotation complexity or the overhead of a separate etcd cluster.
02
Nothing reaches the cluster via kubectl apply in production. Every change goes through PR → GitLab CI (lint/validate/kube-score) → merge → Flux reconcile. The repo is the cluster. The audit trail is the git log.
03
kube-vip provides a stable API server VIP for control-plane HA. MetalLB handles LoadBalancer service IPs via L2 ARP from a local pool. One Traefik pod gets the public-facing IP — everything else is internal. Zero cloud dependency.
04
All stateful workloads use Longhorn with replication factor 2. Prometheus is the exception — RF=1 because two Prometheus replicas already deduplicate data, so halving the Longhorn I/O on time-series writes is a free win.
05
CrowdSec reads Traefik access logs, applies community detection scenarios (CVE patterns, HTTP bruteforce, scanner signatures), and blocks via the bouncer middleware. Configured fail-open at 2s — a slow CrowdSec never blocks legitimate traffic.
resilience
Tier 1
Longhorn RF=2
Synchronous replication across workers. Zero data loss on any single node failure. Included in the cluster.
Tier 2
TrueNAS NFS
Daily Longhorn snapshots at 02:00 UTC (7-day retention). Weekly full backups on Sundays via ZFS snapshots.
Tier 3
Backblaze B2
Offsite via TrueNAS Cloud Sync. Independent of on-prem hardware. Survives total site loss.
stack
let's work together
I design and operate production-grade infrastructure both here and at scale for global financial institutions. If you need someone who's done this for real, let's talk.