Recommendations

This page provides best practices and recommendations for different cluster sizes and use cases with hetzner-k3s.

Small to Medium Clusters (1-50 nodes)

The default configuration works well for small to medium-sized clusters, providing a simple, reliable setup with minimal configuration required.

Key Considerations

Private Network: Enabled by default for better security
CNI: Flannel for simplicity or Cilium for advanced features
Storage: hcloud-volumes for persistence
Load Balancers: Hetzner Load Balancers for production workloads
High Availability: 3 master nodes for production clusters

Recommended Configuration

hetzner_token: <your token>
cluster_name: my-cluster
kubeconfig_path: "./kubeconfig"
k3s_version: v1.32.0+k3s1

networking:
  ssh:
    port: 22
    use_agent: false
    public_key_path: "~/.ssh/id_ed25519.pub"
    private_key_path: "~/.ssh/id_ed25519"
  allowed_networks:
    ssh:
      - 0.0.0.0/0
    api:
      - 10.0.0.0/16  # Restrict to private network
  public_network:
    ipv4: true
    ipv6: true
  private_network:
    enabled: true
    subnet: 10.0.0.0/16
  cni:
    enabled: true
    encryption: false
    mode: flannel

masters_pool:
  instance_type: cpx21
  instance_count: 3  # For HA
  locations:
    - nbg1

worker_node_pools:
- name: workers
  instance_type: cpx31
  instance_count: 3
  location: nbg1
  autoscaling:
    enabled: true
    min_instances: 1
    max_instances: 5

protect_against_deletion: true
create_load_balancer_for_the_kubernetes_api: true

Large Clusters (50+ nodes)

For larger clusters, the default setup has some limitations that need to be addressed.

Limitations of Default Setup

Hetzner's private networks, used in hetzner-k3s' default configuration, only support up to 100 nodes. If your cluster is going to grow beyond that, you need to disable the private network in your configuration.

Large Cluster Architecture (Since v2.2.8)

Support for large clusters has significantly improved since version 2.2.8. The main changes include:

Custom Firewall: Instead of using Hetzner's firewall (which is slow to update), a custom firewall solution was implemented
IP Query Server: A simple container that checks the Hetzner API every 30 seconds to get the list of all node IPs
Automatic Updates: Firewall rules are automatically updated without manual intervention

Setting Up Large Clusters

Step 1: Set Up IP Query Server

The IP query server runs as a simple container. You can easily set it up on any Docker-enabled server using the docker-compose.yml file in the ip-query-server folder of this repository.

# docker-compose.yml
version: '3.8'
services:
  ip-query-server:
    build: ./ip-query-server
    ports:
      - "8080:80"
    environment:
      - HETZNER_TOKEN=your_token_here
  caddy:
    image: caddy:2
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
    depends_on:
      - ip-query-server

Replace example.com in the Caddyfile with your actual domain name and mail@example.com with your email address for Let's Encrypt certificates.

Step 2: Update Cluster Configuration

hetzner_token: <your token>
cluster_name: large-cluster
kubeconfig_path: "./kubeconfig"
k3s_version: v1.32.0+k3s1

networking:
  ssh:
    port: 22
    use_agent: true  # Recommended for large clusters
    public_key_path: "~/.ssh/id_ed25519.pub"
    private_key_path: "~/.ssh/id_ed25519"
  allowed_networks:
    ssh:
      - 0.0.0.0/0  # Required for public network access
    api:
      - 0.0.0.0/0  # Required when private network is disabled
  public_network:
    ipv4: true
    ipv6: true
    # Use custom IP query server for large clusters
    hetzner_ips_query_server_url: https://ip-query.example.com
    use_local_firewall: true  # Enable custom firewall
  private_network:
    enabled: false  # Disable private network for >100 nodes
  cni:
    enabled: true
    encryption: true  # Enable encryption for public network
    mode: cilium  # Better for large scale deployments

# Larger cluster CIDR ranges
cluster_cidr: 10.244.0.0/15  # Larger range for more pods
service_cidr: 10.96.0.0/16   # Larger range for more services
cluster_dns: 10.96.0.10

datastore:
  mode: etcd  # or external for very large clusters
  # external_datastore_endpoint: postgres://...

masters_pool:
  instance_type: cpx31
  instance_count: 3
  locations:
    - nbg1
    - hel1
    - fsn1

worker_node_pools:
- name: compute
  instance_type: cpx41
  location: nbg1
  autoscaling:
    enabled: true
    min_instances: 5
    max_instances: 50
- name: storage
  instance_type: cpx51
  location: hel1
  autoscaling:
    enabled: true
    min_instances: 3
    max_instances: 20

embedded_registry_mirror:
  enabled: true  # Recommended for large clusters

protect_against_deletion: true
create_load_balancer_for_the_kubernetes_api: true
k3s_upgrade_concurrency: 2  # Can upgrade more nodes simultaneously

Additional Large Cluster Considerations

Network Configuration

CIDR Sizing: Use larger cluster and service CIDR ranges to accommodate more pods and services
Encryption: Enable CNI encryption when using public networks
Firewall: The custom firewall automatically manages allowed IPs without opening ports to the public

High Availability Setup

For production large clusters, consider:

Multiple IP Query Servers: Set up 2-3 instances behind a load balancer for better availability
External Datastore: Use PostgreSQL instead of etcd for better scalability
Distributed Master Nodes: Place masters in different locations
Multiple Node Pools: Different instance types for different workloads

Cluster Sizing Guidelines

Development/Tiny Clusters (< 5 nodes)

masters_pool:
  instance_type: cpx11
  instance_count: 1  # Single master for testing
worker_node_pools:
- name: workers
  instance_type: cpx11
  instance_count: 1

Small Production Clusters (5-20 nodes)

masters_pool:
  instance_type: cpx21
  instance_count: 3  # HA masters
  locations:
    - fsn1
    - hel1
    - nbg1
worker_node_pools:
- name: workers
  instance_type: cpx31
  instance_count: 3
  autoscaling:
    enabled: true
    min_instances: 1
    max_instances: 5

Medium Production Clusters (20-50 nodes)

masters_pool:
  instance_type: cpx31
  instance_count: 3
  locations:
    - fsn1
    - hel1
    - nbg1
worker_node_pools:
- name: web
  instance_type: cpx31
  location: nbg1
  autoscaling:
    enabled: true
    min_instances: 3
    max_instances: 10
- name: backend
  instance_type: cpx41
  location: hel1
  autoscaling:
    enabled: true
    min_instances: 2
    max_instances: 8

Large Production Clusters (50-200+ nodes)

Use the large cluster configuration shown above with: - Multiple node pools for different workloads - Custom firewall and IP query server - Larger instance types for masters - External datastore if needed

Performance Optimization

Embedded Registry Mirror

In v2.0.0, there's a new option to enable the embedded registry mirror in k3s. You can find more details here. This feature uses Spegel to enable peer-to-peer distribution of container images across cluster nodes.

Benefits: - Faster pod startup times - Reduced external registry calls - Better reliability when external registries are inaccessible - Cost savings on egress bandwidth

Configuration:

embedded_registry_mirror:
  enabled: true

Note: Ensure your k3s version supports this feature before enabling.

Storage Selection

Use `hcloud-volumes` for:

Production databases where the app does not take care of replication already
Persistent application data
Content that must survive pod restarts
Applications requiring high availability

Use `local-path` for:

High-performance caching (Redis, Memcached)
High-performance databases (Postgres, MySQL) where the app takes care of replication already
Temporary file storage
Applications that can tolerate data loss
Maximum IOPS performance

CNI Selection

Flannel

Pros: Simple, lightweight, good for small clusters
Cons: Limited features, doesn't scale well to very large clusters
Best for: Small to medium clusters, simplicity

Cilium

Pros: Advanced features, better performance scales well
Cons: More complex setup, higher resource usage
Best for: Medium to large clusters, advanced networking needs

Security Recommendations

Network Security

Restrict SSH and API Access: Use CIDR restrictions in allowed_networks.api and allowed_networks.ssh
Use Private Networks: When possible, use private networks for cluster communication
Monitor Network Traffic: Implement network policies and monitoring

SSH Security

Use SSH Keys: hetzner-k3s configures nodes with SSH keys by default
SSH Agent: Enable use_agent: true for passphrase-protected keys
Key Rotation: Regularly rotate SSH keys if needed
Access Logs: Monitor SSH access logs

Cluster Security

RBAC: Implement proper role-based access control
Network Policies: Use Kubernetes network policies
Pod Security: Implement pod security standards
Regular Updates: Keep k3s and components updated

Cost Optimization

Instance Selection

Right-size Instances: Start smaller and scale up as needed
Use Autoscaling: Only pay for what you use

Storage Optimization

Clean Up Volumes: Regularly delete unused volumes
Use Local Storage: For temporary data where appropriate
Monitor Usage: Set up monitoring to identify unused storage

Network Optimization

Use Private Networks: Reduce egress costs
Optimize Images: Use smaller container images
Registry Mirror: Reduce registry egress costs

Monitoring and Observability

Essential Monitoring

Node Resources: CPU, memory, disk usage
Cluster Health: Node readiness, pod status
Network Traffic: Bandwidth usage, connection counts
Storage Performance: I/O operations, latency

Recommended Tools

Prometheus + Grafana: For metrics and dashboards
Loki: For log aggregation
Alertmanager: For alerting
Node Exporter: For node metrics