Tuesday, September 16, 2025
Minimal K8s Stack for Multi-Tenant SaaS
Deploying many SaaS applications with enterprise reliability does not require heavyweight platforms. The stack below is a lightweight, proven combination that delivers autoscaling, GitOps, TLS/DNS automation, metrics/logs, and backups with minimal operational overhead. It favors defaults that are boring, stable, and easy to reason about.
Overview
Core control plane (ultra-lightweight)
- K3s (HA, embedded etcd) — compact Kubernetes with a small, steady footprint
- Traefik — ingress controller (ships with K3s)
- cert-manager — automatic TLS issuance (prefer DNS-01)
- ExternalDNS — manages DNS records from Ingress/HTTPRoutes
Platform capabilities
- Argo CD — GitOps continuous delivery (declarative, auditable rollouts)
- KEDA + HPA — autoscaling from CPU/memory and event/metrics triggers, including scale-to-zero
- VictoriaMetrics stack — time-series storage, Alertmanager, Grafana dashboards
- Loki (+ promtail/Vector) — efficient log aggregation
- Velero — cluster & volume backups to S3/compatible storage
- CloudNativePG — PostgreSQL operator: HA, failover, backups, PITR
- Longhorn (optional) — distributed block storage with snapshots/backups
Access
- Tailscale — private admin access to dashboards/SSH (keep pod-to-pod traffic on the CNI)
Why This Architecture Works
Boring by design
- Three K3s servers form an HA control plane with embedded etcd (no external database).
- Defaults are kept wherever viable (Traefik, metrics-server, kube DNS) to reduce moving parts.
- Each component is mainstream and commonly deployed in production.
Git as source of truth
- Argo CD continuously converges the cluster to what Git declares, enabling reproducible rollouts, easy rollbacks, and drift correction without granting developers cluster access.
Autoscaling that matches reality
- HPA covers CPU/memory bursts.
- KEDA scales on meaningful signals (queue depth, Prometheus queries, cron windows, HTTP rates) and supports scale-to-zero for cost-sensitive services.
Data safety first
- Velero protects cluster objects and volumes.
- CloudNativePG provides HA Postgres with continuous backups and point-in-time recovery to S3/compatible storage.
- Longhorn adds distributed volumes and scheduled S3 backups when block-storage replication is required.
Reference Architecture
Control plane (3× K3s servers)
- Argo CD (GitOps)
- KEDA + metrics-server (autoscaling)
- Traefik (ingress)
- cert-manager + ExternalDNS (TLS/DNS automation)
- VictoriaMetrics stack (vmagent/VictoriaMetrics/Grafana/Alertmanager)
- Loki (cluster logs)
- Velero (S3/compatible backups)
- CloudNativePG (Postgres operator)
Worker fleet (e.g., 15 servers)
- Application workloads (per namespace/app)
- Optional Longhorn data plane
- Tailscale agent for admin access
Networking note: avoid overlay-on-overlay for data plane traffic. Keep Tailscale for operator access; let the Kubernetes CNI carry pod traffic.
Multi-Tenancy & Isolation
Per application (or tenant), create a separate namespace with:
- ResourceQuotas and LimitRanges to prevent noisy-neighbor issues
- NetworkPolicies to default-deny and allow only what is needed
- Dedicated Ingress/hostnames, separate secrets
- A database cluster per app or per tenant group, managed by CloudNativePG
- Independent HPA/KEDA policies
Minimal default-deny policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
Deployment Workflow (GitOps)
- CI builds and pushes images (tagged).
- A Git repository stores Kubernetes manifests/Helm values.
- Argo CD monitors the repo and syncs changes automatically (or on PR merge).
- Traefik routes traffic to the new pods after readiness passes.
- KEDA/HPA adjust replicas to match demand.
Minimal Argo CD Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: saas-platform
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/infrastructure
targetRevision: main
path: kubernetes/apps
destination:
server: https://kubernetes.default.svc
namespace: apps
syncPolicy:
automated: { prune: true, selfHeal: true }
Autoscaling Patterns
CPU/memory with HPA for general web APIs. Event-driven with KEDA for queues, schedules, or request-rate targets.
Example KEDA ScaledObject (Prometheus rate):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-scaler
spec:
minReplicaCount: 2
maxReplicaCount: 50
scaleTargetRef:
name: api-deployment
triggers:
- type: prometheus
metadata:
serverAddress: http://victoria-metrics.monitoring.svc:8428
query: sum(rate(http_requests_total[30s]))
threshold: "100"
Guardrails
- Start with
replicas: 3
for stateless web tiers. - Use PodDisruptionBudgets to keep at least 2 pods available during maintenance/rollouts.
- Add topologySpreadConstraints to distribute replicas across nodes.
Observability & Alerting
Metrics & dashboards
- VictoriaMetrics scrapes at short intervals and stores long-term data efficiently.
- Grafana provides per-service and per-node dashboards.
- Recording rules pre-compute expensive SLO queries.
Alerting
- Alertmanager routes to Telegram/Slack/PagerDuty.
- Recommended symptom-based alerts: error rate over SLO, latency p95 over threshold, disk space low, certificate nearing expiration, backup job failed.
Logs
- Loki stores logs without heavy indexing.
- Promtail or Vector ships container logs; Grafana links metrics ↔ logs for fast triage.
Backup & Recovery
Cluster/PV — Velero backups to S3/compatible storage on a schedule (retain according to policy). Postgres — CloudNativePG clusters with:
- Synchronous or async replication (multiple replicas)
- Continuous WAL archiving to S3
- Periodic base backups
- Declarative restore jobs for point-in-time recovery
Optional block-storage layer — Longhorn for replicated volumes and automated S3 snapshot backups per volume.
Operational Defaults (that prevent 3 AM surprises)
- Replicas: web/API
replicas: 3
minimum; background workers sized by queue throughput. - Probes: readiness < 2s; liveness conservative to avoid flapping.
- Rollouts:
maxSurge: 1
,maxUnavailable: 0
for zero-downtime updates. - Budgets & spread: PDB
minAvailable: 2
; topology spread across nodes/hosts. - Resources: set both requests and limits for CPU/memory.
Minimal resource block:
resources:
requests: { cpu: "100m", memory: "256Mi" }
limits: { cpu: "1000m", memory: "512Mi" }
Security Essentials
- NetworkPolicies: default-deny, explicit allow lists.
- Secrets: consider External Secrets Operator, Sealed Secrets, or SOPS for Git-friendly encryption.
- Supply chain: image scanning (e.g., Trivy), admission policies (e.g., Gatekeeper), image signing (e.g., Cosign).
Migration Path (incremental, low risk)
- Stand up K3s HA and Traefik with cert-manager/ExternalDNS.
- Add Argo CD and migrate stateless apps via Helm/manifest repos.
- Layer KEDA for event-driven autoscaling.
- Move databases to CloudNativePG with backups/PITR.
- Complete observability with VictoriaMetrics stack and Loki.
- Optionally add Longhorn for replicated PVs.
Flows
- Git push updates desired state → Argo CD syncs → zero-downtime rollout.
- Traffic spikes → HPA/KEDA scale replicas.
- Node failures → workloads reschedule automatically.
- Data incidents → restore with Velero or recover databases to a point in time.
- Incidents → alerts arrive in Telegram/Slack with links to Grafana dashboards.
- Upgrades → K3s binary update + Helm chart bumps in Git.
This stack assembles lightweight components that are widely deployed in production and intentionally conservative in design. The result is predictable reliability with minimal day-to-day fuss: declarative deployments, meaningful autoscaling, automated TLS/DNS, comprehensive telemetry, and robust backups. It fits small servers, scales cleanly, and keeps operations boring—exactly what production infrastructure should be.