Skip to main content

Deploy Nebari on Bare Metal with K3s

This how-to guide covers deploying Nebari on bare metal infrastructure using K3s (a lightweight Kubernetes distribution). Choose the approach that best fits your needs:

🚀 Quick Start

Best for: Testing, development, learning

Time: 15-30 minutes

Servers: 1 node

Get Started →

🏭 Production Setup

Best for: Production workloads, HA deployments

Time: 2-3 hours

Servers: 3+ nodes

Get Started →
About This Guide

This replaces the deprecated nebari-slurm project, providing a modern Kubernetes-based approach for bare metal deployments. For cloud deployments, see Deploy on Existing Kubernetes.


Quick Start: Single-Node

Get Nebari running quickly on a single machine for testing, development, or small-scale use.

Prerequisites

System Requirements (click to expand)
  • One bare metal server or VM
  • Ubuntu 20.04+ (or compatible Linux distribution)
  • 8 vCPU / 32 GB RAM minimum
  • 200 GB disk space
  • Root or sudo access

Steps

  1. Install K3s:

    curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable traefik --disable servicelb" sh -
    Why These Flags?
    • --disable traefik: Nebari installs its own ingress controller
    • --disable servicelb: MetalLB will provide LoadBalancer services instead
  2. Verify installation:

    sudo k3s kubectl get nodes

    You should see your node in Ready state.

  3. Install MetalLB for LoadBalancer support:

    # Apply MetalLB manifest
    sudo k3s kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml

    # Wait for MetalLB pods to be ready
    sudo k3s kubectl wait --namespace metallb-system \
    --for=condition=ready pod \
    --selector=app=metallb \
    --timeout=90s
  4. Configure MetalLB IP pool:

    cat <<EOF | sudo k3s kubectl apply -f -
    apiVersion: metallb.io/v1beta1
    kind: IPAddressPool
    metadata:
    name: default-pool
    namespace: metallb-system
    spec:
    addresses:
    - 192.168.1.200-192.168.1.220 # Adjust to your network
    ---
    apiVersion: metallb.io/v1beta1
    kind: L2Advertisement
    metadata:
    name: default
    namespace: metallb-system
    spec:
    ipAddressPools:
    - default-pool
    EOF
    Choosing IP Addresses for Single-Node

    For single-node testing/development, you have two options:

    Option 1: Use the node's own IP (simplest for testing)

    • If your server is 192.168.1.50, use range 192.168.1.50-192.168.1.50

    • MetalLB will assign services to the same IP as your node

    • Good for: Quick testing, local development

    Option 2: Use a separate IP range (more production-like)

    • Select IPs in the same subnet: 192.168.1.200-192.168.1.220

    • IPs must not be assigned to other devices

    • IPs must not be in your DHCP range

    • Good for: Testing ingress routing, simulating production

    Example: If your server is 192.168.1.50, either use:

    • Simple: 192.168.1.50-192.168.1.50 (same as node)

    • Separate: 192.168.1.200-192.168.1.220 (dedicated range) :::5. Export kubeconfig:

    # Copy kubeconfig to standard location
    mkdir -p ~/.kube
    sudo cat /etc/rancher/k3s/k3s.yaml > ~/.kube/k3s-config
    chmod 600 ~/.kube/k3s-config
  5. Label the node (optional but recommended):

    # Get node name and apply labels
    NODE_NAME=$(sudo k3s kubectl get nodes -o jsonpath='{.items[0].metadata.name}')

    sudo k3s kubectl label node $NODE_NAME \
    node-role.nebari.io/group=general \
    node-role.nebari.io/group=user \
    node-role.nebari.io/group=worker
  6. Initialize Nebari:

    nebari init existing \
    --project my-nebari \
    --domain nebari.example.com \
    --auth-provider github
  7. Configure nebari-config.yaml:

    Click to see minimal configuration
    provider: existing
    kubeconfig_path: ~/.kube/k3s-config
    kubernetes_context: default

    local:
    kube_context: default
    node_selectors:
    general:
    key: node-role.nebari.io/group
    value: general
    user:
    key: node-role.nebari.io/group
    value: user
    worker:
    key: node-role.nebari.io/group
    value: worker
  8. Deploy Nebari:

    nebari deploy -c nebari-config.yaml

Next Steps

What's Next?

✅ Update DNS A record to point to your MetalLB IP ✅ Access Nebari at your configured domain ✅ For production workloads, continue to the Production Deployment section


Production Deployment

Deploy a high-availability Nebari cluster on multiple bare metal servers using automated configuration management.

When to use this:

  • ✅ Production workloads requiring high availability
  • ✅ Multiple servers for resource isolation
  • ✅ Need for automated cluster management
  • ✅ Growing user base requiring scalability

Architecture Overview

A production deployment uses:

K3s
Lightweight Kubernetes
KubeVIP
Virtual IP for HA
MetalLB
LoadBalancer implementation
Ansible
Automation tool

Prerequisites

Minimum 3 servers (recommended 6+ for production):

Node TypevCPURAMDiskCountPurpose
Control Plane (Primary)832 GB500 GB1K3s control + Nebari general workloads
Control Plane (Secondary)416 GB200 GB2K3s control (HA only)
Worker8+32+ GB200+ GB3+User sessions, Dask workers
Primary Control Plane Node

One control plane node should have significantly more resources (8 vCPU / 32 GB RAM minimum) because it will:

  • Run Kubernetes control plane components (API server, scheduler, controller manager)
  • Host Nebari's general workloads (JupyterHub, monitoring, databases)
  • Serve as the primary management node

The other control plane nodes can be smaller (4 vCPU / 16 GB RAM) as they primarily provide high availability for the Kubernetes API.

Network requirements:

  • All servers on same network subnet
  • Static IP addresses for all servers
  • One virtual IP address (for Kubernetes API)
  • IP range for MetalLB (5-20 addresses)

Understanding MetalLB IP Ranges:

MetalLB requires a range of IP addresses to assign to Kubernetes LoadBalancer services (like Nebari's ingress). Your networking setup determines how you configure this:

Scenario: All servers on a single internal network (e.g., 192.168.1.0/24)

Main Router (192.168.1.1)

├── K3s Nodes: 192.168.1.101-106
└── MetalLB Range: 192.168.1.200-220
  • Use IPs from the same subnet as your nodes
  • Ensure IPs are outside DHCP range
  • No additional routing needed
  • Example: metal_lb_ip_range: 192.168.1.200-192.168.1.220
Choosing the Right Approach
  • Testing/Development: Use simple internal network (Option 1)
  • Production on-premises: Use dedicated network interface (Option 2)
  • Colocation/Dedicated servers: Use routed public IPs (Option 3)

Step 1: Clone nebari-k3s

git clone https://github.com/nebari-dev/nebari-k3s.git
cd nebari-k3s

Step 2: Create Inventory

Create inventory.yml with your server details:

all:
vars:
ansible_user: ubuntu
ansible_ssh_private_key_file: ~/.ssh/id_rsa

children:
master:
hosts:
node1:
ansible_host: 192.168.1.101
node2:
ansible_host: 192.168.1.102
node3:
ansible_host: 192.168.1.103

node:
hosts:
node4:
ansible_host: 192.168.1.104
node5:
ansible_host: 192.168.1.105
node6:
ansible_host: 192.168.1.106

k3s_cluster:
children:
master:
node:

Step 3: Configure Variables

Create group_vars/all.yaml with minimal required configuration:

---
# K3s version
k3s_version: v1.30.2+k3s2

# SSH user
ansible_user: ubuntu

# Network interface (find with: ip addr show)
flannel_iface: ens192

# KubeVIP Configuration
kube_vip_interface: ens192
apiserver_endpoint: 192.168.1.100 # Virtual IP

# Cluster token (generate with: openssl rand -hex 20)
k3s_token: your-secure-cluster-token

# MetalLB IP range
metal_lb_ip_range: 192.168.1.200-192.168.1.220
💡 Network Interface Reference

Common interface names by platform:

  • VMware: ens192, ens160
  • AWS/Basic: eth0, eth1
  • Dell/HP servers: eno1, eno2

Find your interface:

ip addr show

Step 4: Deploy K3s Cluster

Deployment Time

⏱️ 10-20 minutes depending on network speed and number of nodes

Run the Ansible playbook:

ansible-playbook -i inventory.yml playbook.yaml
What gets installed?

✅ K3s on all nodes (control plane + workers)
✅ KubeVIP for control plane HA
✅ MetalLB for LoadBalancer services
✅ Proper node labels and configurations

Step 5: Sync Kubeconfig

Copy kubeconfig from cluster to your local machine:

export SSH_USER="root"
export SSH_HOST="192.168.1.101" # Any control plane node
export SSH_KEY_FILE="~/.ssh/id_rsa"

make kubeconfig-sync

Verify cluster access:

kubectl get nodes -o wide

Step 6: Label Nodes

Label nodes for Nebari workload scheduling:

# Control plane nodes (general workloads)
kubectl label nodes node1 node2 node3 \
node-role.nebari.io/group=general

# User workload nodes
kubectl label nodes node4 \
node-role.nebari.io/group=user

# Dask worker nodes
kubectl label nodes node5 node6 \
node-role.nebari.io/group=worker

Verify labels:

kubectl get nodes --show-labels

Step 7: Initialize Nebari

nebari init existing \
--project my-nebari \
--domain nebari.example.com \
--auth-provider github

Step 8: Configure Nebari

Edit nebari-config.yaml:

project_name: my-nebari
provider: existing
domain: nebari.example.com

certificate:
type: lets-encrypt
acme_email: admin@example.com

security:
authentication:
type: GitHub
config:
client_id: <your-github-oauth-client-id>
client_secret: <your-github-oauth-client-secret>

local:
kube_context: default
node_selectors:
general:
key: node-role.nebari.io/group
value: general
user:
key: node-role.nebari.io/group
value: user
worker:
key: node-role.nebari.io/group
value: worker

profiles:
jupyterlab:
- display_name: Small Instance
description: 2 CPU / 8 GB RAM
default: true
kubespawner_override:
cpu_limit: 2
cpu_guarantee: 1.5
mem_limit: 8G
mem_guarantee: 5G

- display_name: Medium Instance
description: 4 CPU / 16 GB RAM
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G

dask_worker:
Small Worker:
worker_cores_limit: 2
worker_cores: 1.5
worker_memory_limit: 8G
worker_memory: 5G
Medium Worker:
worker_cores_limit: 4
worker_cores: 3
worker_memory_limit: 16G
worker_memory: 10G

Step 9: Deploy Nebari

nebari deploy -c nebari-config.yaml

Step 10: Verify Deployment

# Check if all pods are running
kubectl get pods -A

All pods should be in Running or Completed state.

🎉 Final Step

Update your DNS A record to point to one of the MetalLB IP addresses, then access Nebari at your configured domain.


Reference

Configuration Variables

VariableRequiredDefaultDescription
k3s_version✅ Yes-K3s version (e.g., v1.30.2+k3s2)
ansible_user✅ Yes-SSH user with passwordless sudo
flannel_iface✅ Yes-Network interface for pod networking
kube_vip_interface✅ Yes-Network interface for virtual IP
kube_vip_tag_version❌ Nov0.8.2KubeVIP container version
apiserver_endpoint✅ Yes-Virtual IP for Kubernetes API
k3s_token✅ Yes-Cluster auth token (alphanumeric)
metal_lb_ip_range✅ Yes-IP range for LoadBalancer services
metal_lb_type❌ NonativeMetalLB type (native/frr)
metal_lb_mode❌ Nolayer2MetalLB mode (layer2/bgp)

Node Selector Labels

📊 general
Purpose: Core services (JupyterHub, monitoring)
Typical nodes: Control plane nodes
node-role.nebari.io/group=general

👥 user
Purpose: User JupyterLab sessions
Typical nodes: Dedicated user nodes
node-role.nebari.io/group=user

⚙️ worker
Purpose: Dask workers, batch jobs
Typical nodes: High-resource worker nodes
node-role.nebari.io/group=worker


Troubleshooting

Pods Not Scheduling

Symptom

🚨 Pods remain in Pending state

Quick diagnosis:

kubectl describe pod <pod-name> -n <namespace>
Common Causes & Solutions

1. Node labels don't match selectors

# Check actual labels
kubectl get nodes --show-labels

# Compare with nebari-config.yaml node_selectors
# Fix: Apply correct labels
kubectl label node <node-name> node-role.nebari.io/group=<value>

2. Insufficient resources

# Check node resources
kubectl describe nodes
kubectl top nodes # Requires metrics-server

# Fix: Add more nodes or adjust resource requests

3. Node taints

# Check for taints
kubectl get nodes -o json | jq '.items[].spec.taints'

# Fix: Remove unwanted taints
kubectl taint nodes <node-name> <taint-key>-

LoadBalancer Service Pending

Symptom

🚨 Service stuck in Pending with no external IP

Quick diagnosis:

kubectl get svc -A | grep LoadBalancer
kubectl get pods -n metallb-system
MetalLB Troubleshooting Steps

1. Verify MetalLB is running

# Check MetalLB pods
kubectl get pods -n metallb-system

# All pods should be Running

2. Check MetalLB configuration

# Verify IP pool
kubectl get ipaddresspool -n metallb-system -o yaml

# Verify L2 advertisement
kubectl get l2advertisement -n metallb-system -o yaml

3. Check for IP conflicts

# Ping IPs in your range to check if already in use
ping 192.168.1.200

# Check MetalLB logs
kubectl logs -n metallb-system -l app=metallb --tail=50

4. Common fixes

  • Ensure IP range doesn't overlap with DHCP
  • Verify IPs are in same subnet as nodes
  • Check firewall rules allow ARP traffic

API Server Unreachable

Symptom

🚨 Cannot connect to cluster with kubectl

Quick diagnosis:

# Test virtual IP connectivity
ping <apiserver_endpoint>

# Test API server port
telnet <apiserver_endpoint> 6443
KubeVIP Troubleshooting Steps

1. Check KubeVIP status

# SSH to a control plane node
ssh ubuntu@<control-plane-ip>

# Check KubeVIP pods
sudo k3s kubectl get pods -n kube-system | grep kube-vip

# Check KubeVIP logs
sudo k3s kubectl logs -n kube-system <kube-vip-pod>

2. Verify network configuration

# Check if virtual IP is assigned
ip addr show | grep <apiserver_endpoint>

# Verify correct interface
ip addr show <kube_vip_interface>

3. Common fixes

  • Verify kube_vip_interface matches actual network interface
  • Ensure virtual IP is in same subnet as nodes
  • Check firewall allows traffic on port 6443
  • Verify ARP is enabled (kube_vip_arp: true)

Advanced Topics

Custom Data Directory

For production with dedicated storage volumes, configure K3s to use custom data directories.

Why: Separate OS and application data, use high-performance storage, better disk management.

Add to group_vars/all.yaml:

extra_server_args: >-
--data-dir /mnt/k3s-data
[... other args ...]

extra_agent_args: >-
--data-dir /mnt/k3s-data

Prepare storage on each node:

# For standard disk
sudo mkfs.ext4 /dev/sdb
sudo mkdir -p /mnt/k3s-data
echo '/dev/sdb /mnt/k3s-data ext4 defaults 0 0' | sudo tee -a /etc/fstab
sudo mount -a

# For LVM with XFS (better for large files)
sudo lvcreate -L 1400G -n k3s-data ubuntu-vg
sudo mkfs.xfs /dev/ubuntu-vg/k3s-data
UUID=$(sudo blkid -s UUID -o value /dev/ubuntu-vg/k3s-data)
echo "UUID=$UUID /mnt/k3s-data xfs defaults 0 2" | sudo tee -a /etc/fstab
sudo mount -a

Storage Configuration

K3s includes a local-path storage provisioner suitable for development. For production:

Options:

  • Local storage: Use K3s default local-path storage class
  • NFS: Configure NFS server and use NFS storage class
  • Ceph/Rook: Distributed storage for multi-node persistent volumes
  • Cloud CSI: If hybrid cloud, use provider-specific CSI drivers

Example NFS configuration:

# In nebari-config.yaml
default_storage_class: nfs-client

Migrating User Data

When migrating from existing systems:

  1. Copy data to storage node:

    rsync -avhP -e ssh /old/home/ user@k3s-node:/mnt/k3s-data/backup/home/
  2. Check JupyterHub UIDs:

    kubectl exec -it jupyter-<username> -- id
  3. Adjust ownership if needed:

    sudo chown -R <jupyter-uid>:<jupyter-gid> /mnt/k3s-data/backup/home/<username>

Scaling the Cluster

Add worker nodes:

  1. Add nodes to inventory.yml
  2. Run playbook targeting new nodes:
    ansible-playbook -i inventory.yml playbook.yaml --limit new-node
  3. Label new nodes for Nebari

Upgrade K3s:

  1. Update k3s_version in group_vars/all.yaml
  2. Run playbook:
    ansible-playbook -i inventory.yml playbook.yaml
warning

Always test upgrades in non-production first. Backup data before upgrading.


Next Steps

👨‍💻 Environment Management
Configure conda environments and packages
→ Learn more
📊 Monitoring
Set up Prometheus and Grafana
→ Learn more
💾 Backups
Configure backup strategies
→ Learn more
⚡ Distributed Computing
Explore Dask for parallel processing
→ Learn more

Additional Resources

Quick Start: Single-Node Setup

If you're just getting started or want to test Nebari on bare metal, you can deploy K3s directly on a single machine. This is perfect for development, testing, or small-scale deployments.

Install K3s on a Single Node

  1. Prepare your machine (Ubuntu 20.04+ or similar):

    # Update system packages
    sudo apt update && sudo apt upgrade -y
  2. Install K3s with the default installer:

    curl -sfL https://get.k3s.io | sh -

    This single command downloads and installs K3s, sets it up as a systemd service, and configures everything needed to run Kubernetes.

  3. Verify K3s is running:

    sudo k3s kubectl get nodes

    You should see your node in a "Ready" state.

  4. Get your kubeconfig for Nebari deployment:

    sudo cat /etc/rancher/k3s/k3s.yaml

    Copy this kubeconfig content. You'll need to:

    • Save it to a file (e.g., ~/.kube/k3s-config)
    • Replace 127.0.0.1 with your server's actual IP address if deploying from another machine
  5. Configure Nebari to use your K3s cluster:

    In your nebari-config.yaml:

    provider: existing

    kubeconfig_path: ~/.kube/k3s-config # Path to the kubeconfig file you saved

    kubernetes_context: default # The context name from your kubeconfig

    # Optional: Configure node groups
    default_node_groups:
    general:
    instance: general-instance
    min_nodes: 1
    max_nodes: 1
    user:
    instance: user-instance
    min_nodes: 1
    max_nodes: 1
    worker:
    instance: worker-instance
    min_nodes: 1
    max_nodes: 1

    node_selectors:
    general:
    node-role.nebari.io/group: general
    user:
    node-role.nebari.io/group: user
    worker:
    node-role.nebari.io/group: worker
  6. Label your node (optional but recommended):

    # Get your node name
    sudo k3s kubectl get nodes

    # Apply labels
    sudo k3s kubectl label node <your-node-name> node-role.nebari.io/group=general
    sudo k3s kubectl label node <your-node-name> node-role.nebari.io/group=user
    sudo k3s kubectl label node <your-node-name> node-role.nebari.io/group=worker
  7. Install MetalLB for LoadBalancer support:

    Nebari requires LoadBalancer services for ingress. Install MetalLB to provide this on bare metal:

    # Download and apply MetalLB manifest
    kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml

    # Wait for MetalLB to be ready
    kubectl wait --namespace metallb-system \
    --for=condition=ready pod \
    --selector=app=metallb \
    --timeout=90s

    Configure MetalLB with an IP address pool (adjust IP range to match your network):

    cat <<EOF | kubectl apply -f -
    apiVersion: metallb.io/v1beta1
    kind: IPAddressPool
    metadata:
    name: default-pool
    namespace: metallb-system
    spec:
    addresses:
    - 192.168.1.200-192.168.1.220 # Adjust to your available IP range
    ---
    apiVersion: metallb.io/v1beta1
    kind: L2Advertisement
    metadata:
    name: default
    namespace: metallb-system
    spec:
    ipAddressPools:
    - default-pool
    EOF
    IP Address Range

    The IP range should be:

    • In the same subnet as your K3s node
    • Not used by DHCP or other devices
    • Accessible from your network
    • Reserve at least 5-10 addresses for Nebari services
  8. Deploy Nebari:

    nebari deploy -c nebari-config.yaml

When to Move to Production

This single-node setup works great for testing, but for production you'll want:

  • High availability: Multiple control plane nodes so your cluster survives node failures
  • Better resource isolation: Separate nodes for different workloads (user sessions, background jobs, system services)
  • Easier management: Automated provisioning and configuration with Ansible
  • Load balancing: Proper ingress and service load balancing with MetalLB

That's where the production setup comes in - continue reading to learn how to deploy a production-ready, highly available Nebari cluster.


Production Setup with nebari-k3s

For production deployments, we recommend using nebari-k3s - an Ansible-based solution that sets up a production-ready K3s cluster with KubeVIP and MetalLB.

What You'll Be Using

K3s - A lightweight, certified Kubernetes distribution designed for resource-constrained environments and edge computing. Unlike full Kubernetes, K3s:

  • Uses a single binary of less than 100MB
  • Has lower memory and CPU requirements
  • Is easier to install and maintain
  • Is perfect for bare metal deployments where you want Kubernetes without the complexity

Ansible - An automation tool that uses simple YAML files (playbooks) to configure and manage servers. In this guide, Ansible:

  • Installs and configures K3s on all your nodes
  • Sets up networking components (KubeVIP, MetalLB)
  • Ensures consistent configuration across your cluster
  • Runs from your local machine or a control node

KubeVIP - Provides a highly-available virtual IP address for your Kubernetes API server. This means:

  • Your cluster remains accessible even if a control plane node fails
  • All nodes use a single IP to access the Kubernetes API
  • Essential for multi-master (HA) setups

MetalLB - A load balancer implementation for bare metal clusters. Since cloud providers automatically provide load balancers but bare metal doesn't, MetalLB:

  • Assigns external IP addresses to Kubernetes services
  • Enables Nebari's ingress to be accessible from outside the cluster
  • Uses Layer 2 (ARP) or BGP to advertise service IPs

Overview

The nebari-k3s project provides Ansible playbooks to:

  • Deploy a lightweight K3s Kubernetes cluster on bare metal servers
  • Configure KubeVIP for high-availability control plane
  • Set up MetalLB for load balancing
  • Prepare the cluster for Nebari deployment

This approach is ideal for:

  • On-premises deployments
  • Organizations with existing bare metal infrastructure
  • HPC environments transitioning from traditional batch systems
  • Cost-sensitive deployments requiring full hardware control
info

This solution replaces the deprecated nebari-slurm project, providing a modern Kubernetes-based alternative for bare metal deployments.

Prerequisites

Infrastructure Requirements

  • Minimum 3 bare metal servers (recommended for HA):

    • Control plane nodes: 8 vCPU / 32 GB RAM minimum
    • Worker nodes: 4 vCPU / 16 GB RAM minimum per node
    • 200 GB disk space per node
  • Network requirements:

    • All nodes on the same subnet
    • Static IP addresses assigned to each node
    • SSH access to all nodes
    • IP range reserved for MetalLB load balancer
    • Virtual IP address for the Kubernetes API server

Software Requirements

On your local machine (where you'll run Ansible):

  • Python 3.8+
  • Ansible 2.10+
  • kubectl
  • SSH key access to all nodes

On bare metal nodes:

  • Ubuntu 20.04+ or compatible Linux distribution
  • Passwordless sudo access for the SSH user
Running Ansible

Ansible requires a Linux/Unix environment. If your workstation runs Windows:

  • Use WSL2 (Windows Subsystem for Linux)
  • Deploy from one of your Linux nodes (e.g., the first control plane node)
  • Use a Linux VM or container

The deployment examples below assume you're running from a Linux environment with direct SSH access to all cluster nodes.

Step 1: Clone nebari-k3s Repository

git clone https://github.com/nebari-dev/nebari-k3s.git
cd nebari-k3s

Step 2: Configure Inventory

Create an Ansible inventory file describing your cluster:

# inventory.yml
all:
vars:
ansible_user: ubuntu
ansible_ssh_private_key_file: ~/.ssh/id_rsa

# K3s configuration
k3s_version: v1.28.5+k3s1
apiserver_endpoint: "192.168.1.100" # Virtual IP for API server

# KubeVIP configuration
kube_vip_tag_version: "v0.7.0"
kube_vip_interface: "ens5" # Network interface for VIP (default: ens5)
kube_vip_lb_ip_range: "192.168.1.200-192.168.1.220" # IPs for services

# MetalLB configuration
metal_lb_ip_range:
- "192.168.1.200-192.168.1.220"

children:
master:
hosts:
node1:
ansible_host: 192.168.1.101
node2:
ansible_host: 192.168.1.102
node3:
ansible_host: 192.168.1.103

node:
hosts:
node4:
ansible_host: 192.168.1.104
node5:
ansible_host: 192.168.1.105
node6:
ansible_host: 192.168.1.106

k3s_cluster:
children:
master:
node:

Advanced Configuration with Custom Data Directory

For production deployments, especially when using dedicated storage volumes, configure K3s to use a custom data directory. This is particularly important when:

  • You have multiple disks (OS disk and separate data disk)
  • You want to use high-performance storage for Kubernetes data
  • You need to manage disk space separately for system and application data

Create or update your group_vars/all.yaml:

---
# K3s version to install
# Check https://github.com/k3s-io/k3s/releases for available versions
k3s_version: v1.30.2+k3s2

# Ansible connection user (must have passwordless sudo on all nodes)
ansible_user: ubuntu

# Network interface used by flannel CNI for pod networking
# Run 'ip addr show' on your nodes to find the correct interface
flannel_iface: ens192

# ============ KubeVIP Configuration ============
# KubeVIP provides a virtual IP for the Kubernetes API server (HA)

# Enable ARP broadcasts for virtual IP
kube_vip_arp: true

# Network interface where the virtual IP will be configured
# Must match the interface with connectivity to other nodes
kube_vip_interface: ens192

# KubeVIP container image version
kube_vip_tag_version: v0.8.2

# Virtual IP address for Kubernetes API server
# This IP must be:
# - In the same subnet as your nodes
# - Not currently in use by any other device
# - Accessible from all nodes
apiserver_endpoint: 192.168.1.100

# ============ Cluster Security ============
# Shared secret token for K3s cluster nodes to authenticate
# IMPORTANT: Must be alphanumeric only (no special characters)
# Generate a secure random token: openssl rand -hex 20
k3s_token: your-secure-cluster-token

# ============ K3s Server Arguments ============
# Additional arguments passed to K3s server nodes (control plane)
extra_server_args: >-
--tls-san {{ apiserver_endpoint }}
--disable servicelb
--disable traefik
--write-kubeconfig-mode 644
--flannel-iface={{ flannel_iface }}
--data-dir /mnt/k3s-data

# --tls-san: Add virtual IP to API server TLS certificate
# --disable servicelb: Disable built-in load balancer (we use MetalLB)
# --disable traefik: Disable built-in ingress (Nebari installs its own)
# --write-kubeconfig-mode 644: Make kubeconfig readable
# --flannel-iface: Network interface for pod networking
# --data-dir: Custom location for K3s data (optional, see Step 2.1)

# ============ K3s Agent Arguments ============
# Additional arguments passed to K3s agent nodes (workers)
extra_agent_args: >-
--flannel-iface={{ flannel_iface }}
--data-dir /mnt/k3s-data

# ============ MetalLB Configuration ============
# MetalLB provides LoadBalancer services on bare metal

# MetalLB type: 'native' (recommended) or 'frr'
metal_lb_type: native

# MetalLB mode: 'layer2' (simple ARP-based) or 'bgp' (requires BGP router)
metal_lb_mode: layer2

# MetalLB speaker image version
metal_lb_speaker_tag_version: v0.14.8

# MetalLB controller image version
metal_lb_controller_tag_version: v0.14.8

# IP address range for LoadBalancer services
# Can be a string or list: "192.168.1.200-192.168.1.220" or ["192.168.1.200-192.168.1.220"]
# These IPs will be assigned to Nebari's ingress and other LoadBalancer services
# Requirements:
# - Must be in the same subnet as your nodes
# - Must not overlap with DHCP ranges or other static IPs
# - Reserve enough IPs for all services (typically 5-10 is sufficient)
metal_lb_ip_range: 192.168.1.200-192.168.1.220 # Can also be a list: ["192.168.1.200-192.168.1.220"]

Variable Reference Summary

VariableRequiredDefaultDescription
k3s_versionYes-K3s version to install
ansible_userYes-SSH user with sudo access
flannel_ifaceYes-Network interface for pod networking
kube_vip_interfaceYes-Network interface for virtual IP
kube_vip_tag_versionNov0.8.2KubeVIP image version
kube_vip_arpNotrueEnable ARP for virtual IP
apiserver_endpointYes-Virtual IP for Kubernetes API
k3s_tokenYes-Cluster authentication token (alphanumeric)
extra_server_argsNo-Additional K3s server arguments
extra_agent_argsNo-Additional K3s agent arguments
metal_lb_typeNonativeMetalLB implementation type
metal_lb_modeNolayer2MetalLB operating mode
metal_lb_ip_rangeYes-IP range for LoadBalancer services
metal_lb_speaker_tag_version: v0.14.8
metal_lb_controller_tag_version: v0.14.8
metal_lb_ip_range: 192.168.1.200-192.168.1.220 # Can also be a list: ["192.168.1.200-192.168.1.220"]

:::warning[Important: Custom Data Directory]
If you specify `--data-dir /mnt/k3s-data`, you **must** ensure this directory exists and is properly mounted on **all** nodes before running the Ansible playbook. See Step 2.1 below.
:::

### Step 2.1: Prepare Storage (Required for Custom Data Directory)

If you're using a custom data directory with dedicated storage volumes, prepare them on each node:

#### For worker nodes with separate data disks:

```bash
# On each node, identify the data disk
lsblk

# Format the disk (example: /dev/sdb - verify your disk name!)
sudo mkfs.ext4 /dev/sdb

# Create mount point
sudo mkdir -p /mnt/k3s-data

# Add to fstab for persistence
echo '/dev/sdb /mnt/k3s-data ext4 defaults 0 0' | sudo tee -a /etc/fstab

# Mount the disk
sudo mount -a

# Verify
df -h /mnt/k3s-data

For control plane with large storage requirements (using LVM):

If your control plane node needs flexible storage management (e.g., for backups, persistent volumes):

# Check available volume groups
sudo vgs

# Create logical volume (example: 1.4TB from existing volume group)
sudo lvcreate -L 1400G -n k3s-data ubuntu-vg

# Format with XFS for better performance with large files
sudo mkfs.xfs /dev/ubuntu-vg/k3s-data

# Create mount point
sudo mkdir -p /mnt/k3s-data

# Add to fstab using UUID for reliability
UUID=$(sudo blkid -s UUID -o value /dev/ubuntu-vg/k3s-data)
echo "UUID=$UUID /mnt/k3s-data xfs defaults 0 2" | sudo tee -a /etc/fstab

# Mount
sudo mount -a

# Verify
df -h /mnt/k3s-data
lsblk
Storage Recommendations
  • XFS: Better for large files and high I/O workloads (recommended for nodes with databases or large datasets)
  • ext4: General purpose, good default choice for most workloads
  • Leave space for expansion: Don't allocate 100% of available storage to allow for future growth
  • Consistent paths: Use the same mount point (/mnt/k3s-data) on all nodes

Step 2.2: Verify Network Interfaces

Ensure you're using the correct network interface names in your configuration:

# On each node, list network interfaces
ip addr show

# Common interface names:
# - ens192, ens160 (VMware)
# - eth0, eth1 (AWS, some bare metal)
# - eno1, eno2 (Dell, HP servers)

Update flannel_iface and kube_vip_interface in your group_vars/all.yaml to match your actual interface names.

Step 3: Run Ansible Playbook

Deploy the K3s cluster:

ansible-playbook -i inventory.yml playbook.yaml

This will:

  1. Install K3s on all nodes
  2. Configure the control plane with high availability
  3. Deploy KubeVIP for API server load balancing
  4. Install and configure MetalLB for service load balancing
  5. Set up proper node labels and taints
Known Issue: Multi-Master Join

There's a known issue in nebari-k3s where additional master nodes may fail to join the cluster correctly due to the IP filtering task returning multiple IPs. If you encounter this:

  1. Check that additional master nodes are running K3s:

    ssh user@node2 "sudo systemctl status k3s"
  2. Verify they can reach the first master node:

    ssh user@node2 "curl -k https://192.168.1.101:6443/ping"
  3. If a node is running but not joined, you may need to manually re-run the join command on that node or investigate the Ansible task that filters the flannel interface IP.

Step 4: Sync Kubeconfig

After the playbook completes, sync the kubeconfig to your local machine:

# Set environment variables
export SSH_USER="root" # Default: root (change if using different user)
export SSH_HOST="192.168.1.101" # IP of any master node
export SSH_KEY_FILE="~/.ssh/id_rsa"

# Sync kubeconfig
make kubeconfig-sync

Verify cluster access:

kubectl get nodes -o wide

You should see all your nodes in a Ready state.

Step 5: Label Nodes for Nebari

Nebari requires specific node labels for scheduling workloads. For optimal resource utilization and proper workload distribution, use the recommended node-role.nebari.io/group label:

# Label control plane/general nodes
kubectl label nodes node1 node2 node3 \
node-role.nebari.io/group=general

# Label user workload nodes
kubectl label nodes node4 \
node-role.nebari.io/group=user

# Label Dask worker nodes
kubectl label nodes node5 node6 \
node-role.nebari.io/group=worker
Node Labeling Best Practices
  • Consistent labeling: Using node-role.nebari.io/group as the label key ensures consistent behavior across all Nebari components
  • Multiple roles: A node can have multiple roles if needed (e.g., both user and worker on the same node)
  • Control plane nodes: Typically labeled as general to host core Nebari services
  • Resource optimization: Proper labeling enables Horizontal Pod Autoscaling (HPA) to fully utilize your cluster resources

Alternative labeling schemes (legacy):

# These also work but are less recommended
kubectl label nodes node1 node-role.kubernetes.io/general=true

Verify your labels:

kubectl get nodes --show-labels

Step 6: Initialize Nebari Configuration

Now initialize Nebari for deployment on your existing cluster:

nebari init existing \
--project my-nebari \
--domain nebari.example.com \
--auth-provider github

Step 7: Configure Nebari for Bare Metal

Edit the generated nebari-config.yaml to configure it for your K3s cluster:

project_name: my-nebari
provider: existing
domain: nebari.example.com

certificate:
type: lets-encrypt
acme_email: admin@example.com
acme_server: https://acme-v02.api.letsencrypt.org/directory

security:
authentication:
type: GitHub
config:
client_id: <github-oauth-app-client-id>
client_secret: <github-oauth-app-client-secret>
oauth_callback_url: https://nebari.example.com/hub/oauth_callback

local:
# Specify the kubectl context name from your kubeconfig
kube_context: "default" # Or the context name from your K3s cluster

# Configure node selectors to match your labeled nodes
node_selectors:
general:
key: node-role.nebari.io/group
value: general
user:
key: node-role.nebari.io/group
value: user
worker:
key: node-role.nebari.io/group
value: worker

# Configure default profiles
profiles:
jupyterlab:
- display_name: Small Instance
description: 2 CPU / 8 GB RAM
default: true
kubespawner_override:
cpu_limit: 2
cpu_guarantee: 1.5
mem_limit: 8G
mem_guarantee: 5G

- display_name: Medium Instance
description: 4 CPU / 16 GB RAM
kubespawner_override:
cpu_limit: 4
cpu_guarantee: 3
mem_limit: 16G
mem_guarantee: 10G

dask_worker:
Small Worker:
worker_cores_limit: 2
worker_cores: 1.5
worker_memory_limit: 8G
worker_memory: 5G
worker_threads: 2

Medium Worker:
worker_cores_limit: 4
worker_cores: 3
worker_memory_limit: 16G
worker_memory: 10G
worker_threads: 4

# Optional: Configure storage class
# default_storage_class: local-path # K3s default storage class

Important Configuration Notes

Kubernetes Context

The kube_context field must match the context name in your kubeconfig. To find available contexts:

kubectl config get-contexts

Use the name from the NAME column in the output.

Node Selectors

Node selectors tell Nebari where to schedule different types of workloads:

  • general: Core Nebari services (JupyterHub, monitoring, etc.)
  • user: User JupyterLab pods
  • worker: Dask worker pods for distributed computing

Make sure the key and value match the labels you applied to your nodes in Step 5.

Step 8: Deploy Nebari

Deploy Nebari to your K3s cluster:

nebari deploy --config nebari-config.yaml

During deployment, you'll be prompted to update your DNS records. Add an A record pointing your domain to one of the MetalLB IP addresses.

Step 9: Verify Deployment

Once deployment completes, verify all components are running:

kubectl get pods -A
kubectl get ingress -A

Access Nebari at https://nebari.example.com and log in with your configured authentication provider.

Troubleshooting

Pods Not Scheduling

If pods remain in Pending state:

kubectl describe pod <pod-name> -n <namespace>

Common issues:

  • Node selector mismatch: Verify labels match between nebari-config.yaml and actual node labels
  • Insufficient resources: Ensure nodes have enough CPU/memory available
  • Taints: Check if nodes have taints that prevent scheduling

LoadBalancer Services Pending

If services of type LoadBalancer remain in Pending state:

kubectl get svc -A | grep LoadBalancer

Verify MetalLB is running:

kubectl get pods -n metallb-system

Check MetalLB configuration:

kubectl get ipaddresspool -n metallb-system
kubectl get l2advertisement -n metallb-system

API Server Unreachable

If you cannot connect to the cluster:

  1. Verify KubeVIP is running on control plane nodes:

    ssh ubuntu@192.168.1.101 "sudo k3s kubectl get pods -n kube-system | grep kube-vip"
  2. Check if the virtual IP is responding:

    ping 192.168.1.100
  3. Verify the network interface is correct in your inventory configuration

Storage Considerations

K3s includes a default local-path storage provisioner that works well for development. For production:

  • Local storage: K3s local-path provisioner (default)
  • Network storage: Configure NFS, Ceph, or other storage classes
  • Cloud storage: If running in a hybrid environment, configure cloud CSI drivers

Example NFS storage class configuration:

# Add to nebari-config.yaml under theme.jupyterhub
storage_class_name: nfs-client

Storage Considerations

K3s includes a default local-path storage provisioner that works well for development. For production:

  • Local storage: K3s local-path provisioner (default)
  • Network storage: Configure NFS, Ceph, or other storage classes
  • Cloud storage: If running in a hybrid environment, configure cloud CSI drivers

Example NFS storage class configuration:

# Add to nebari-config.yaml under theme.jupyterhub
storage_class_name: nfs-client

Migrating Existing User Data

If you're migrating from an existing system (e.g., Slurm cluster), you can pre-populate user data:

  1. Copy data to the storage node (typically a control plane node with large storage):

    # From old system to new K3s storage
    rsync -avhP -e ssh /old/home/ user@k3s-node:/mnt/k3s-data/backup/home/
  2. Note about user IDs: User IDs in JupyterHub pods may differ from your existing system. After Nebari deployment:

    • Check the UID used by JupyterHub: kubectl exec -it jupyter-<username> -- id
    • Adjust file ownership if needed:
      # On the storage node
      sudo chown -R <jupyter-uid>:<jupyter-gid> /mnt/k3s-data/backup/home/<username>
  3. Create persistent volume for user data (if using custom storage):

    apiVersion: v1
    kind: PersistentVolume
    metadata:
    name: user-data-pv
    spec:
    capacity:
    storage: 1000Gi
    accessModes:
    - ReadWriteMany
    hostPath:
    path: /mnt/k3s-data/users
User Data Best Practices
  • Test data migration with a single user first
  • Verify file permissions match JupyterHub pod UIDs
  • Consider using NFS or similar for multi-node access to user data
  • Keep backups of original data during migration

Scaling Your Cluster

Adding Worker Nodes

  1. Add new nodes to your Ansible inventory
  2. Run the playbook targeting only new nodes:
    ansible-playbook -i inventory.yml playbook.yaml --limit new-node
  3. Label the new nodes for Nebari workloads

Upgrading K3s

To upgrade your K3s cluster:

  1. Update k3s_version in your inventory
  2. Run the playbook:
    ansible-playbook -i inventory.yml playbook.yaml
warning

Test upgrades in a non-production environment first. Always backup your data before upgrading.

Next Steps

Additional Resources