homelab

Production-grade infrastructure, at home.

Six-node k3s cluster on Proxmox, fully GitOps-managed with Flux v2, triple-replicated DNS, edge TLS termination with CrowdSec, and a 3-2-1 backup strategy. Built to mirror production patterns — not as a toy, but as the platform I use every day.

6

Nodes

24

vCPUs

72 GB

RAM

450 GB

Storage

how it's built

Architecture

Five layers, each isolated and independently replaceable.

01

Edge

Traefik v3TLS termination (Cloudflare DNS-01 ACME), routing, rate limiting
CrowdSecIPS — community threat intelligence, bouncer middleware on Traefik
AuthentikOIDC/SSO provider, ForwardAuth relay for all protected services
02

Compute

Proxmox (3 nodes)Anti-affinity: 1 control-plane + 1 worker per physical host
k3s control planes ×3Embedded etcd HA, kube-vip stable control-plane VIP
k3s workers ×3Workloads + Longhorn storage nodes, 18 GB RAM / 6 vCPU each
03

GitOps

Flux v2Reconciles cluster state from GitLab every 5 min (apps) / 1 h (infra)
GitLab EESelf-hosted VCS, CI/CD pipelines, container registry
RenovateAutomated Helm chart and image update PRs
04

Persistence

Longhorn 1.11.1Distributed block storage, RF=2 across workers, 450 GB total
TrueNAS (NFS)Daily snapshots (7-day retention) + weekly ZFS backups
Backblaze B2Offsite disaster recovery via TrueNAS Cloud Sync
05

Observability

Prometheus2 replicas, 15-day retention; Alertmanager 3-node gossip cluster
Grafana + LokiDashboards and log aggregation; Alloy DaemonSet on every node
AlertingDiscord + ntfy push notifications for critical events

compute

Cluster topology

Control Plane — HA with embedded etcd

k3s-server-1

2 vCPU · 6 GB

prx-prod-1

k3s-server-2

2 vCPU · 6 GB

prx-prod-2

k3s-server-3

2 vCPU · 6 GB

prx-prod-3

Workers — 150 GB Longhorn data disk each

k3s-worker-1

6 vCPU · 18 GB

prx-prod-1

k3s-worker-2

6 vCPU · 18 GB

prx-prod-2

k3s-worker-3

6 vCPU · 18 GB

prx-prod-3

k3s v1.35.4+k3s1 · Flannel VXLAN · kube-vip (control-plane VIP) · MetalLB 0.15.3 L2 ARP

authoritative dns

DNS Infrastructure

Three-node BIND9 platform running outside Kubernetes — dedicated VMs, Keepalived VRRP failover, and the same GitOps discipline as the cluster.

Master — Authoritative Source

dns-01

Holds all zone files · Allows AXFR to slaves · Not exposed via VIP

AXFR · TSIG HMAC-SHA256
Slaves — Serve DNS via VRRP VIP

dns-02

Holds VIP

VRRP Master · priority 150

dns-03

Standby

VRRP Backup · priority 100

vrrp failover

A floating VRRP VIP migrates between dns-02 and dns-03 via Keepalived.

Keepalived checks the local BIND9 process every 2 seconds. Three consecutive failures trigger VIP failover to the standby node. DNS stays up — the master (dns-01) is never the serving path.

TSIG Authentication

Zone transfers are authenticated with a shared HMAC-SHA256 key. The key is deployed manually to all three servers and never stored in the repository.

GitLab CI Validation

Every commit runs named-checkconf, named-checkzone for each zone, and a PTR consistency check. The pipeline blocks on any failure before touching the servers.

GitOps — No Direct Edits

Zone files are never modified on servers directly. All changes go through PR → CI → merge. The git log is the full audit trail of every DNS change.

zones served

rcrg.mePublic + internal records
Forward
cranganu.mePublic + internal records
Forward
k3slab.netCluster-internal hostnames
Forward
Infrastructure subnets (×4)DNS, management, cluster, services
Reverse (PTR)
Extended subnets (×4)Additional internal addressing
Reverse (PTR)

observability

Monitoring & Alerting

kube-prometheus-stack with custom dashboards, 55 hand-written alert rules, and ntfy push notifications — everything visible, nothing silent.

116

Dashboard panels

55

Alert rules

50+

Scrape targets

5

ntfy channels

Prometheus ×2

2 replicas with deduplication, 15-day / 18 GB retention, WAL compression, 30-min out-of-order window.

Alertmanager ×3

3-node gossip cluster, 1 per worker, 5-day silence retention, PDB minAvailable 2 for quorum.

Grafana

OIDC login via Authentik, sidecar-discovered dashboards and datasources from all namespaces.

Loki + Alloy

Log aggregation with tsdb schema v13. Alloy DaemonSet ships logs from every pod, labelled by namespace, pod, container, node.

Blackbox Exporter

DNS probes (UDP + TCP), HTTP/2, and ICMP checks — all with 5-second timeouts and results fed into alert rules.

node-exporter + kube-state-metrics

Full host and cluster metrics via kube-prometheus-stack defaults, supplemented by 13 custom additional scrape jobs.

custom dashboards

Homelab Overview

93 panels

Unified view across all layers: Proxmox node health, k3s cluster state (pods, CPU, memory gauges), Traefik error rate and p99 latency, CrowdSec bans by attack category, DNS master/slave/VIP status, Proxmox VM table with per-VM resource usage, and top-10 pods by CPU and memory.

DNS Monitoring

23 panels

BIND9-specific: VIP owner and master/slave status, query rate and type distribution, latency histogram, cache hit ratio, DNSSEC validation counters, zone transfer tracking, serial drift detection, and per-server CPU/memory/network usage.

55 alert rules · 6 PrometheusRule CRDs

System

13 alerts

CPU, memory, disk, load, NTP drift, fd exhaustion, predict_linear disk fill.

Traefik / Edge

8 alerts

5xx error rate, p99 latency, CrowdSec health, cert expiry <14 days, unusual request spike.

GitLab & VMs

15 alerts

Per-VM monitoring: OOM kill, read-only filesystem, high memory, metrics endpoint down.

Proxmox

8 alerts

Node/cluster quorum, VM state, CPU/memory/storage thresholds, backup failure, recent restart.

DNS

10 alerts

Server and VIP probes, query latency >0.5s, cache hit ratio <50%, zone serial drift, transfer failures.

Watchdog

1 alerts

Always-firing test alert that validates the full pipeline from Prometheus to ntfy on every reconcile.

notification routing

5 ntfy webhook receivers — routed by severity and category.

Alertmanager groups by alertname, category, and instance. Critical alerts repeat every hour; everything else every 4 hours. Resolved notifications always fire.

homelab-critical

1 h repeat

homelab-alerts

default · 4 h repeat

homelab-dns

DNS category

homelab-proxmox

Proxmox category

homelab-ceph

Ceph category

self-hosted

Running services

Everything self-hosted, all behind Authentik SSO.

GitOps & VCS

GitLab EE

Self-hosted VCS, CI/CD pipelines, container registry

Flux v2

GitOps operator — cluster state pulled from Git

Renovate

Automated Helm chart and dependency updates

Identity & Secrets

Authentik

OIDC provider and ForwardAuth relay for all services

Vaultwarden

Bitwarden-compatible self-hosted password manager

Infisical

Secrets management with Kubernetes native injection

Observability

Prometheus

Metrics collection, 2 replicas, 15-day retention

Grafana

Dashboards and visualisation, Authentik OIDC login

Loki + Alloy

Log aggregation with DaemonSet log shipping

Uptime Kuma

Service uptime monitoring and public status page

ntfy

Self-hosted push notification server

Applications

Code-server

VS Code in the browser, accessible from anywhere

Shlink

Self-hosted URL shortener with analytics

Homepage

Unified dashboard for all running services

Homelab Docs

MkDocs Material site, auto-deployed via GitLab CI

NetBox

IPAM and DCIM — network source of truth

AWX

Self-hosted Ansible controller — run playbooks via web UI, manage inventories, schedule jobs with RBAC

day 0

Provisioning & Automation

Before Flux can reconcile anything, the infrastructure has to exist. Terraform provisions VMs, Ansible configures them, AWX keeps ongoing operations scriptable and auditable.

day 0 · provisioning

Terraform + Ansible

VM defined in Terraform → provisioned on Proxmox → OS configured by Ansible → node joins k3s cluster.

day 2 · operations

Flux + GitLab CI

GitOps takes over — every application and configuration change flows through Git from this point on.

Terraform

Provisions all VMs on Proxmox — k3s nodes, DNS servers, Docker hosts, and supporting infrastructure. Manages VM specs, network interfaces, disk allocation, and cloud-init configuration. State stored in GitLab.

Ansible

Handles OS-level configuration across all nodes — package installation, sysctl tuning, user management, SSH hardening, and k3s installation with automated cluster join. Also runs ad-hoc operational tasks outside GitOps scope.

AWX

Self-hosted Ansible controller running in the cluster. Run playbooks via web UI, manage inventories, schedule recurring jobs, enforce RBAC. Playbooks sourced directly from GitLab — no SSH key distribution needed.

deploy flow

From PR to running pod

golden rule

Nothing reaches the cluster via kubectl apply in production.

Every change goes through Git. The repo is the cluster. The git log is the audit trail.

1

PR opened in GitLab

Change to a Helm release, ConfigMap, secret, or app manifest.

2

GitLab CI validates

YAML lint → Helm template render → kube-score policy check. Pipeline blocks on any failure.

3

PR merged to main

The only path to production. Requires passing CI and code review. No exceptions.

4

Flux detects the diff

Flux polls Git every 5 min (apps) or 1 h (infrastructure) and queues a reconcile.

5

Cluster converges

Flux applies the diff. Reloader watches ConfigMap/Secret changes and rolls affected pods automatically.

rationale

Engineering decisions

The choices that shaped the current design, and why.

01

01

k3s over full Kubernetes

k3s ships with embedded etcd and a footprint small enough for 6 VMs. Three control-plane nodes plus kube-vip give the same HA story as a cloud-managed control plane, without kubeadm certificate rotation complexity or the overhead of a separate etcd cluster.

02

02

Flux v2 — GitOps, no exceptions

Nothing reaches the cluster via kubectl apply in production. Every change goes through PR → GitLab CI (lint/validate/kube-score) → merge → Flux reconcile. The repo is the cluster. The audit trail is the git log.

03

03

MetalLB + kube-vip, no cloud provider

kube-vip provides a stable API server VIP for control-plane HA. MetalLB handles LoadBalancer service IPs via L2 ARP from a local pool. One Traefik pod gets the public-facing IP — everything else is internal. Zero cloud dependency.

04

04

Longhorn RF=2 with tiered backup

All stateful workloads use Longhorn with replication factor 2. Prometheus is the exception — RF=1 because two Prometheus replicas already deduplicate data, so halving the Longhorn I/O on time-series writes is a free win.

05

05

CrowdSec at the edge, fail-open

CrowdSec reads Traefik access logs, applies community detection scenarios (CVE patterns, HTTP bruteforce, scanner signatures), and blocks via the bouncer middleware. Configured fail-open at 2s — a slow CrowdSec never blocks legitimate traffic.

resilience

Backup & disaster recovery

3copies of dataLonghorn RF=2 + TrueNAS + B2
2different mediaLocal disk + NFS network
1offsite copyBackblaze B2, geographically separate

Tier 1

Longhorn RF=2

Synchronous replication across workers. Zero data loss on any single node failure. Included in the cluster.

Tier 2

TrueNAS NFS

Daily Longhorn snapshots at 02:00 UTC (7-day retention). Weekly full backups on Sundays via ZFS snapshots.

Tier 3

Backblaze B2

Offsite via TrueNAS Cloud Sync. Independent of on-prem hardware. Survives total site loss.

stack

Full stack

k3sFlux v2TerraformAnsibleProxmoxTraefik v3LonghornMetalLBkube-vipPrometheusGrafanaLokiAlloyAuthentikCrowdSecBIND9KeepalivedGitLab EETrueNASBackblaze B2Cloudflare

let's work together

Want to talk infrastructure?

I design and operate production-grade infrastructure both here and at scale for global financial institutions. If you need someone who's done this for real, let's talk.