Deploy Nebari on Bare Metal with K3s

This how-to guide covers deploying Nebari on bare metal infrastructure using K3s (a lightweight Kubernetes distribution). Choose the approach that best fits your needs:

🚀 Quick Start

Best for: Testing, development, learning

Time: 15-30 minutes

Servers: 1 node

Get Started →

🏭 Production Setup

Best for: Production workloads, HA deployments

Time: 2-3 hours

Servers: 3+ nodes

Get Started →

About This Guide

This replaces the deprecated nebari-slurm project, providing a modern Kubernetes-based approach for bare metal deployments. For cloud deployments, see Deploy on Existing Kubernetes.

Quick Start: Single-Node

Get Nebari running quickly on a single machine for testing, development, or small-scale use.

Prerequisites

System Requirements (click to expand)

One bare metal server or VM
Ubuntu 20.04+ (or compatible Linux distribution)
8 vCPU / 32 GB RAM minimum
200 GB disk space
Root or sudo access

Steps

Install K3s:
```
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--disable traefik --disable servicelb" sh -
```
Why These Flags?
- --disable traefik: Nebari installs its own ingress controller
- --disable servicelb: MetalLB will provide LoadBalancer services instead
Verify installation:
```
sudo k3s kubectl get nodes
```
You should see your node in Ready state.

Install MetalLB for LoadBalancer support:

Using kubectl
Using Helm

# Apply MetalLB manifest
sudo k3s kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml

# Wait for MetalLB pods to be ready
sudo k3s kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=90s

# Add MetalLB Helm repository
helm repo add metallb https://metallb.github.io/metallb
helm repo update

# Install MetalLB
helm install metallb metallb/metallb --namespace metallb-system --create-namespace

Configure MetalLB IP pool:
```
cat <<EOF | sudo k3s kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.1.200-192.168.1.220  # Adjust to your network
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
spec:
  ipAddressPools:
    - default-pool
EOF
```
Choosing IP Addresses for Single-Node
For single-node testing/development, you have two options:
Option 1: Use the node's own IP (simplest for testing)
- If your server is 192.168.1.50, use range 192.168.1.50-192.168.1.50
- MetalLB will assign services to the same IP as your node
- Good for: Quick testing, local development
Option 2: Use a separate IP range (more production-like)
- Select IPs in the same subnet: 192.168.1.200-192.168.1.220
- IPs must not be assigned to other devices
- IPs must not be in your DHCP range
- Good for: Testing ingress routing, simulating production
Example: If your server is 192.168.1.50, either use:
- Simple: 192.168.1.50-192.168.1.50 (same as node)
- Separate: 192.168.1.200-192.168.1.220 (dedicated range) :::5. Export kubeconfig:
- Local Deployment
- Remote Deployment
# Copy kubeconfig to standard location mkdir -p ~/.kube sudo cat /etc/rancher/k3s/k3s.yaml > ~/.kube/k3s-config chmod 600 ~/.kube/k3s-config

Label the node (optional but recommended):

# Get node name and apply labels
NODE_NAME=$(sudo k3s kubectl get nodes -o jsonpath='{.items[0].metadata.name}')

sudo k3s kubectl label node $NODE_NAME \
  node-role.nebari.io/group=general \
  node-role.nebari.io/group=user \
  node-role.nebari.io/group=worker

Initialize Nebari:

nebari init existing \
  --project my-nebari \
  --domain nebari.example.com \
  --auth-provider github

Configure nebari-config.yaml:

Click to see minimal configuration

provider: existing
kubeconfig_path: ~/.kube/k3s-config
kubernetes_context: default

local:
  kube_context: default
  node_selectors:
    general:
      key: node-role.nebari.io/group
      value: general
    user:
      key: node-role.nebari.io/group
      value: user
    worker:
      key: node-role.nebari.io/group
      value: worker

Deploy Nebari:
```
nebari deploy -c nebari-config.yaml
```

Next Steps

What's Next?

✅ Update DNS A record to point to your MetalLB IP ✅ Access Nebari at your configured domain ✅ For production workloads, continue to the Production Deployment section

Production Deployment

Deploy a high-availability Nebari cluster on multiple bare metal servers using automated configuration management.

When to use this:

✅ Production workloads requiring high availability
✅ Multiple servers for resource isolation
✅ Need for automated cluster management
✅ Growing user base requiring scalability

Architecture Overview

A production deployment uses:

K3s
Lightweight Kubernetes

KubeVIP
Virtual IP for HA

MetalLB
LoadBalancer implementation

Ansible
Automation tool

Prerequisites

Infrastructure
Control Machine
Server Requirements

Minimum 3 servers (recommended 6+ for production):

Node Type	vCPU	RAM	Disk	Count	Purpose
Control Plane (Primary)	8	32 GB	500 GB	1	K3s control + Nebari general workloads
Control Plane (Secondary)	4	16 GB	200 GB	2	K3s control (HA only)
Worker	8+	32+ GB	200+ GB	3+	User sessions, Dask workers

Primary Control Plane Node

One control plane node should have significantly more resources (8 vCPU / 32 GB RAM minimum) because it will:

Run Kubernetes control plane components (API server, scheduler, controller manager)
Host Nebari's general workloads (JupyterHub, monitoring, databases)
Serve as the primary management node

The other control plane nodes can be smaller (4 vCPU / 16 GB RAM) as they primarily provide high availability for the Kubernetes API.

Network requirements:

All servers on same network subnet
Static IP addresses for all servers
One virtual IP address (for Kubernetes API)
IP range for MetalLB (5-20 addresses)

Understanding MetalLB IP Ranges:

MetalLB requires a range of IP addresses to assign to Kubernetes LoadBalancer services (like Nebari's ingress). Your networking setup determines how you configure this:

Simple/Internal Network
Dedicated Network Interface
Routed Public IPs

Scenario: All servers on a single internal network (e.g., 192.168.1.0/24)

Main Router (192.168.1.1)
    │
    ├── K3s Nodes: 192.168.1.101-106
    └── MetalLB Range: 192.168.1.200-220

Use IPs from the same subnet as your nodes
Ensure IPs are outside DHCP range
No additional routing needed
Example: metal_lb_ip_range: 192.168.1.200-192.168.1.220

Scenario: Bare metal servers with multiple network interfaces, using a dedicated network for services

Main Network (eth0/ens192)     Service Network (eth1/ens224)
192.168.1.0/24                 10.0.100.0/24
    │                              │
    ├── Node Management IPs        ├── MetalLB Range: 10.0.100.50-70
    └── Kubernetes API VIP         └── Exposed to external network/firewall

Add a second network interface to each node
Configure MetalLB to use the service network range
Route this network through your datacenter's edge router/firewall
Allows separation of cluster traffic from user-facing services
Example: metal_lb_ip_range: 10.0.100.50-10.0.100.70

Why use this?

Security: Separate control plane from user-facing services
Network policies: Apply different firewall rules to service IPs
Scalability: Easier to route/load-balance across multiple clusters
Production standard: Matches typical datacenter network design

Scenario: Bare metal with routed public IPs (colocation, dedicated servers)

Internet
    │
Datacenter Router (routes 203.0.113.0/28)
    │
    ├── Node Internal: 192.168.1.101-106
    └── MetalLB Public: 203.0.113.1-14

Use public IPs routed to your server rack
Coordinate with your datacenter for IP allocation
Configure proper firewall rules
Example: metal_lb_ip_range: 203.0.113.1-203.0.113.14

Choosing the Right Approach

Testing/Development: Use simple internal network (Option 1)
Production on-premises: Use dedicated network interface (Option 2)
Colocation/Dedicated servers: Use routed public IPs (Option 3)

Where you run Ansible:

Linux/Unix environment (use WSL2 on Windows)
Python 3.8+
Ansible 2.10+
kubectl
SSH key access to all servers

Install requirements:

# Ubuntu/Debian
sudo apt install python3-pip kubectl
pip3 install ansible

# macOS
brew install ansible kubectl

On all cluster servers:

Ubuntu 20.04+ or compatible OS
Passwordless sudo for SSH user
Open ports: 6443, 10250, 2379-2380

Setup SSH access:

# Generate SSH key if needed
ssh-keygen -t ed25519 -C "nebari-cluster"

# Copy to all servers
ssh-copy-id ubuntu@192.168.1.101
ssh-copy-id ubuntu@192.168.1.102
# ... repeat for all nodes

Step 1: Clone nebari-k3s

git clone https://github.com/nebari-dev/nebari-k3s.git
cd nebari-k3s

Step 2: Create Inventory

Simple Inventory
With Host Variables

Create inventory.yml with your server details:

all:
  vars:
    ansible_user: ubuntu
    ansible_ssh_private_key_file: ~/.ssh/id_rsa

  children:
    master:
      hosts:
        node1:
          ansible_host: 192.168.1.101
        node2:
          ansible_host: 192.168.1.102
        node3:
          ansible_host: 192.168.1.103

    node:
      hosts:
        node4:
          ansible_host: 192.168.1.104
        node5:
          ansible_host: 192.168.1.105
        node6:
          ansible_host: 192.168.1.106

    k3s_cluster:
      children:
        master:
        node:

For servers with different configurations:

all:
  vars:
    ansible_user: ubuntu
    ansible_ssh_private_key_file: ~/.ssh/id_rsa

  children:
    master:
      hosts:
        node1:
          ansible_host: 192.168.1.101
          flannel_iface: ens192
        node2:
          ansible_host: 192.168.1.102
          flannel_iface: ens192
        node3:
          ansible_host: 192.168.1.103
          flannel_iface: ens160  # Different interface

    node:
      hosts:
        node4:
          ansible_host: 192.168.1.104
          node_labels:
            - "workload=user"
        node5:
          ansible_host: 192.168.1.105
          node_labels:
            - "workload=dask"

    k3s_cluster:
      children:
        master:
        node:

Step 3: Configure Variables

Basic Configuration
Advanced Configuration

Create group_vars/all.yaml with minimal required configuration:

---
# K3s version
k3s_version: v1.30.2+k3s2

# SSH user
ansible_user: ubuntu

# Network interface (find with: ip addr show)
flannel_iface: ens192

# KubeVIP Configuration
kube_vip_interface: ens192
apiserver_endpoint: 192.168.1.100  # Virtual IP

# Cluster token (generate with: openssl rand -hex 20)
k3s_token: your-secure-cluster-token

# MetalLB IP range
metal_lb_ip_range: 192.168.1.200-192.168.1.220

Create group_vars/all.yaml with full options:

---
# K3s version
k3s_version: v1.30.2+k3s2

# SSH user with sudo access
ansible_user: ubuntu

# Network interface for pod networking
flannel_iface: ens192

# === KubeVIP Configuration (Control Plane HA) ===
kube_vip_arp: true
kube_vip_interface: ens192
kube_vip_tag_version: v0.8.2
apiserver_endpoint: 192.168.1.100  # Virtual IP for API server

# === Cluster Security ===
# Generate with: openssl rand -hex 20
k3s_token: your-secure-cluster-token

# === K3s Arguments ===
extra_server_args: >-
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik
  --write-kubeconfig-mode 644
  --flannel-iface={{ flannel_iface }}
  --data-dir /mnt/k3s-data

extra_agent_args: >-
  --flannel-iface={{ flannel_iface }}
  --data-dir /mnt/k3s-data

# === MetalLB Configuration (LoadBalancer) ===
metal_lb_type: native
metal_lb_mode: layer2
metal_lb_speaker_tag_version: v0.14.8
metal_lb_controller_tag_version: v0.14.8
metal_lb_ip_range: 192.168.1.200-192.168.1.220

Custom Data Directory

If using --data-dir /mnt/k3s-data, ensure this directory exists and is properly mounted on all nodes. See Advanced Topics for storage setup.

💡 Network Interface Reference

Common interface names by platform:

VMware: ens192, ens160
AWS/Basic: eth0, eth1
Dell/HP servers: eno1, eno2

Find your interface:

ip addr show

Step 4: Deploy K3s Cluster

Deployment Time

⏱️ 10-20 minutes depending on network speed and number of nodes

Run the Ansible playbook:

ansible-playbook -i inventory.yml playbook.yaml

What gets installed?

✅ K3s on all nodes (control plane + workers)
✅ KubeVIP for control plane HA
✅ MetalLB for LoadBalancer services
✅ Proper node labels and configurations

Step 5: Sync Kubeconfig

Copy kubeconfig from cluster to your local machine:

export SSH_USER="root"
export SSH_HOST="192.168.1.101"  # Any control plane node
export SSH_KEY_FILE="~/.ssh/id_rsa"

make kubeconfig-sync

Verify cluster access:

kubectl get nodes -o wide

Step 6: Label Nodes

Label nodes for Nebari workload scheduling:

# Control plane nodes (general workloads)
kubectl label nodes node1 node2 node3 \
  node-role.nebari.io/group=general

# User workload nodes
kubectl label nodes node4 \
  node-role.nebari.io/group=user

# Dask worker nodes
kubectl label nodes node5 node6 \
  node-role.nebari.io/group=worker

Verify labels:

kubectl get nodes --show-labels

Step 7: Initialize Nebari

nebari init existing \
  --project my-nebari \
  --domain nebari.example.com \
  --auth-provider github

Step 8: Configure Nebari

Edit nebari-config.yaml:

project_name: my-nebari
provider: existing
domain: nebari.example.com

certificate:
  type: lets-encrypt
  acme_email: admin@example.com

security:
  authentication:
    type: GitHub
    config:
      client_id: <your-github-oauth-client-id>
      client_secret: <your-github-oauth-client-secret>

local:
  kube_context: default
  node_selectors:
    general:
      key: node-role.nebari.io/group
      value: general
    user:
      key: node-role.nebari.io/group
      value: user
    worker:
      key: node-role.nebari.io/group
      value: worker

profiles:
  jupyterlab:
    - display_name: Small Instance
      description: 2 CPU / 8 GB RAM
      default: true
      kubespawner_override:
        cpu_limit: 2
        cpu_guarantee: 1.5
        mem_limit: 8G
        mem_guarantee: 5G

    - display_name: Medium Instance
      description: 4 CPU / 16 GB RAM
      kubespawner_override:
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G

  dask_worker:
    Small Worker:
      worker_cores_limit: 2
      worker_cores: 1.5
      worker_memory_limit: 8G
      worker_memory: 5G
    Medium Worker:
      worker_cores_limit: 4
      worker_cores: 3
      worker_memory_limit: 16G
      worker_memory: 10G

Step 9: Deploy Nebari

nebari deploy -c nebari-config.yaml

Step 10: Verify Deployment

Quick Check
Detailed Verification
Troubleshooting

# Check if all pods are running
kubectl get pods -A

All pods should be in Running or Completed state.

# 1. Check all pods
kubectl get pods -A

# 2. Check ingress services
kubectl get ingress -A

# 3. Verify LoadBalancer IPs assigned
kubectl get svc -A | grep LoadBalancer

# 4. Check Nebari namespaces
kubectl get pods -n nebari
kubectl get pods -n jhub

# Check pod status with details
kubectl get pods -A -o wide

# Check events for errors
kubectl get events -A --sort-by='.lastTimestamp'

# Check specific pod logs
kubectl logs <pod-name> -n <namespace>

# Describe problematic pods
kubectl describe pod <pod-name> -n <namespace>

🎉 Final Step

Update your DNS A record to point to one of the MetalLB IP addresses, then access Nebari at your configured domain.

Reference

Configuration Variables

Variable	Required	Default	Description
`k3s_version`	✅ Yes	-	K3s version (e.g., v1.30.2+k3s2)
`ansible_user`	✅ Yes	-	SSH user with passwordless sudo
`flannel_iface`	✅ Yes	-	Network interface for pod networking
`kube_vip_interface`	✅ Yes	-	Network interface for virtual IP
`kube_vip_tag_version`	❌ No	v0.8.2	KubeVIP container version
`apiserver_endpoint`	✅ Yes	-	Virtual IP for Kubernetes API
`k3s_token`	✅ Yes	-	Cluster auth token (alphanumeric)
`metal_lb_ip_range`	✅ Yes	-	IP range for LoadBalancer services
`metal_lb_type`	❌ No	native	MetalLB type (native/frr)
`metal_lb_mode`	❌ No	layer2	MetalLB mode (layer2/bgp)

Node Selector Labels

📊 general
Purpose: Core services (JupyterHub, monitoring)
Typical nodes: Control plane nodes
node-role.nebari.io/group=general

👥 user
Purpose: User JupyterLab sessions
Typical nodes: Dedicated user nodes
node-role.nebari.io/group=user

⚙️ worker
Purpose: Dask workers, batch jobs
Typical nodes: High-resource worker nodes
node-role.nebari.io/group=worker

Troubleshooting

Pods Not Scheduling

Symptom

🚨 Pods remain in Pending state

Quick diagnosis:

kubectl describe pod <pod-name> -n <namespace>

Common Causes & Solutions

1. Node labels don't match selectors

# Check actual labels
kubectl get nodes --show-labels

# Compare with nebari-config.yaml node_selectors
# Fix: Apply correct labels
kubectl label node <node-name> node-role.nebari.io/group=<value>

2. Insufficient resources

# Check node resources
kubectl describe nodes
kubectl top nodes  # Requires metrics-server

# Fix: Add more nodes or adjust resource requests

3. Node taints

# Check for taints
kubectl get nodes -o json | jq '.items[].spec.taints'

# Fix: Remove unwanted taints
kubectl taint nodes <node-name> <taint-key>-

LoadBalancer Service Pending

Symptom

🚨 Service stuck in Pending with no external IP

Quick diagnosis:

kubectl get svc -A | grep LoadBalancer
kubectl get pods -n metallb-system

MetalLB Troubleshooting Steps

1. Verify MetalLB is running

# Check MetalLB pods
kubectl get pods -n metallb-system

# All pods should be Running

2. Check MetalLB configuration

# Verify IP pool
kubectl get ipaddresspool -n metallb-system -o yaml

# Verify L2 advertisement
kubectl get l2advertisement -n metallb-system -o yaml

3. Check for IP conflicts

# Ping IPs in your range to check if already in use
ping 192.168.1.200

# Check MetalLB logs
kubectl logs -n metallb-system -l app=metallb --tail=50

4. Common fixes

Ensure IP range doesn't overlap with DHCP
Verify IPs are in same subnet as nodes
Check firewall rules allow ARP traffic

API Server Unreachable

Symptom

🚨 Cannot connect to cluster with kubectl

Quick diagnosis:

# Test virtual IP connectivity
ping <apiserver_endpoint>

# Test API server port
telnet <apiserver_endpoint> 6443

KubeVIP Troubleshooting Steps

1. Check KubeVIP status

# SSH to a control plane node
ssh ubuntu@<control-plane-ip>

# Check KubeVIP pods
sudo k3s kubectl get pods -n kube-system | grep kube-vip

# Check KubeVIP logs
sudo k3s kubectl logs -n kube-system <kube-vip-pod>

2. Verify network configuration

# Check if virtual IP is assigned
ip addr show | grep <apiserver_endpoint>

# Verify correct interface
ip addr show <kube_vip_interface>

3. Common fixes

Verify kube_vip_interface matches actual network interface
Ensure virtual IP is in same subnet as nodes
Check firewall allows traffic on port 6443
Verify ARP is enabled (kube_vip_arp: true)

Advanced Topics

Custom Data Directory

For production with dedicated storage volumes, configure K3s to use custom data directories.

Why: Separate OS and application data, use high-performance storage, better disk management.

Add to group_vars/all.yaml:

extra_server_args: >-
  --data-dir /mnt/k3s-data
  [... other args ...]

extra_agent_args: >-
  --data-dir /mnt/k3s-data

Prepare storage on each node:

# For standard disk
sudo mkfs.ext4 /dev/sdb
sudo mkdir -p /mnt/k3s-data
echo '/dev/sdb /mnt/k3s-data ext4 defaults 0 0' | sudo tee -a /etc/fstab
sudo mount -a

# For LVM with XFS (better for large files)
sudo lvcreate -L 1400G -n k3s-data ubuntu-vg
sudo mkfs.xfs /dev/ubuntu-vg/k3s-data
UUID=$(sudo blkid -s UUID -o value /dev/ubuntu-vg/k3s-data)
echo "UUID=$UUID /mnt/k3s-data xfs defaults 0 2" | sudo tee -a /etc/fstab
sudo mount -a

Storage Configuration

K3s includes a local-path storage provisioner suitable for development. For production:

Options:

Local storage: Use K3s default local-path storage class
NFS: Configure NFS server and use NFS storage class
Ceph/Rook: Distributed storage for multi-node persistent volumes
Cloud CSI: If hybrid cloud, use provider-specific CSI drivers

Example NFS configuration:

# In nebari-config.yaml
default_storage_class: nfs-client

Migrating User Data

When migrating from existing systems:

Copy data to storage node:

rsync -avhP -e ssh /old/home/ user@k3s-node:/mnt/k3s-data/backup/home/

Check JupyterHub UIDs:

kubectl exec -it jupyter-<username> -- id

Adjust ownership if needed:

sudo chown -R <jupyter-uid>:<jupyter-gid> /mnt/k3s-data/backup/home/<username>

Scaling the Cluster

Add worker nodes:

Add nodes to inventory.yml

Run playbook targeting new nodes:

ansible-playbook -i inventory.yml playbook.yaml --limit new-node

Label new nodes for Nebari

Upgrade K3s:

Update k3s_version in group_vars/all.yaml

Run playbook:

ansible-playbook -i inventory.yml playbook.yaml

warning

Always test upgrades in non-production first. Backup data before upgrading.

Next Steps

👨‍💻 Environment Management
Configure conda environments and packages
→ Learn more

📊 Monitoring
Set up Prometheus and Grafana
→ Learn more

💾 Backups
Configure backup strategies
→ Learn more

⚡ Distributed Computing
Explore Dask for parallel processing
→ Learn more

Additional Resources

Quick Start: Single-Node Setup

If you're just getting started or want to test Nebari on bare metal, you can deploy K3s directly on a single machine. This is perfect for development, testing, or small-scale deployments.

Install K3s on a Single Node

Prepare your machine (Ubuntu 20.04+ or similar):

# Update system packages
sudo apt update && sudo apt upgrade -y

Install K3s with the default installer:
```
curl -sfL https://get.k3s.io | sh -
```
This single command downloads and installs K3s, sets it up as a systemd service, and configures everything needed to run Kubernetes.
Verify K3s is running:
```
sudo k3s kubectl get nodes
```
You should see your node in a "Ready" state.
Get your kubeconfig for Nebari deployment:
```
sudo cat /etc/rancher/k3s/k3s.yaml
```
Copy this kubeconfig content. You'll need to:
- Save it to a file (e.g., ~/.kube/k3s-config)
- Replace 127.0.0.1 with your server's actual IP address if deploying from another machine

Configure Nebari to use your K3s cluster:

In your nebari-config.yaml:

provider: existing

kubeconfig_path: ~/.kube/k3s-config  # Path to the kubeconfig file you saved

kubernetes_context: default  # The context name from your kubeconfig

# Optional: Configure node groups
default_node_groups:
  general:
    instance: general-instance
    min_nodes: 1
    max_nodes: 1
  user:
    instance: user-instance
    min_nodes: 1
    max_nodes: 1
  worker:
    instance: worker-instance
    min_nodes: 1
    max_nodes: 1

node_selectors:
  general:
    node-role.nebari.io/group: general
  user:
    node-role.nebari.io/group: user
  worker:
    node-role.nebari.io/group: worker

Label your node (optional but recommended):

# Get your node name
sudo k3s kubectl get nodes

# Apply labels
sudo k3s kubectl label node <your-node-name> node-role.nebari.io/group=general
sudo k3s kubectl label node <your-node-name> node-role.nebari.io/group=user
sudo k3s kubectl label node <your-node-name> node-role.nebari.io/group=worker

Install MetalLB for LoadBalancer support:

Nebari requires LoadBalancer services for ingress. Install MetalLB to provide this on bare metal:

# Download and apply MetalLB manifest
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml

# Wait for MetalLB to be ready
kubectl wait --namespace metallb-system \
  --for=condition=ready pod \
  --selector=app=metallb \
  --timeout=90s

Configure MetalLB with an IP address pool (adjust IP range to match your network):

cat <<EOF | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.1.200-192.168.1.220  # Adjust to your available IP range
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system
spec:
  ipAddressPools:
    - default-pool
EOF

IP Address Range

The IP range should be:

In the same subnet as your K3s node
Not used by DHCP or other devices
Accessible from your network
Reserve at least 5-10 addresses for Nebari services

Deploy Nebari:
```
nebari deploy -c nebari-config.yaml
```

When to Move to Production

This single-node setup works great for testing, but for production you'll want:

High availability: Multiple control plane nodes so your cluster survives node failures
Better resource isolation: Separate nodes for different workloads (user sessions, background jobs, system services)
Easier management: Automated provisioning and configuration with Ansible
Load balancing: Proper ingress and service load balancing with MetalLB

That's where the production setup comes in - continue reading to learn how to deploy a production-ready, highly available Nebari cluster.

Production Setup with nebari-k3s

For production deployments, we recommend using nebari-k3s - an Ansible-based solution that sets up a production-ready K3s cluster with KubeVIP and MetalLB.

What You'll Be Using

K3s - A lightweight, certified Kubernetes distribution designed for resource-constrained environments and edge computing. Unlike full Kubernetes, K3s:

Uses a single binary of less than 100MB
Has lower memory and CPU requirements
Is easier to install and maintain
Is perfect for bare metal deployments where you want Kubernetes without the complexity

Ansible - An automation tool that uses simple YAML files (playbooks) to configure and manage servers. In this guide, Ansible:

Installs and configures K3s on all your nodes
Sets up networking components (KubeVIP, MetalLB)
Ensures consistent configuration across your cluster
Runs from your local machine or a control node

KubeVIP - Provides a highly-available virtual IP address for your Kubernetes API server. This means:

Your cluster remains accessible even if a control plane node fails
All nodes use a single IP to access the Kubernetes API
Essential for multi-master (HA) setups

MetalLB - A load balancer implementation for bare metal clusters. Since cloud providers automatically provide load balancers but bare metal doesn't, MetalLB:

Assigns external IP addresses to Kubernetes services
Enables Nebari's ingress to be accessible from outside the cluster
Uses Layer 2 (ARP) or BGP to advertise service IPs

Overview

The nebari-k3s project provides Ansible playbooks to:

Deploy a lightweight K3s Kubernetes cluster on bare metal servers
Configure KubeVIP for high-availability control plane
Set up MetalLB for load balancing
Prepare the cluster for Nebari deployment

This approach is ideal for:

On-premises deployments
Organizations with existing bare metal infrastructure
HPC environments transitioning from traditional batch systems
Cost-sensitive deployments requiring full hardware control

info

This solution replaces the deprecated nebari-slurm project, providing a modern Kubernetes-based alternative for bare metal deployments.

Prerequisites

Infrastructure Requirements

Minimum 3 bare metal servers (recommended for HA):
- Control plane nodes: 8 vCPU / 32 GB RAM minimum
- Worker nodes: 4 vCPU / 16 GB RAM minimum per node
- 200 GB disk space per node
Network requirements:
- All nodes on the same subnet
- Static IP addresses assigned to each node
- SSH access to all nodes
- IP range reserved for MetalLB load balancer
- Virtual IP address for the Kubernetes API server

Software Requirements

On your local machine (where you'll run Ansible):

Python 3.8+
Ansible 2.10+
kubectl
SSH key access to all nodes

On bare metal nodes:

Ubuntu 20.04+ or compatible Linux distribution
Passwordless sudo access for the SSH user

Running Ansible

Ansible requires a Linux/Unix environment. If your workstation runs Windows:

Use WSL2 (Windows Subsystem for Linux)
Deploy from one of your Linux nodes (e.g., the first control plane node)
Use a Linux VM or container

The deployment examples below assume you're running from a Linux environment with direct SSH access to all cluster nodes.

Step 1: Clone nebari-k3s Repository

git clone https://github.com/nebari-dev/nebari-k3s.git
cd nebari-k3s

Step 2: Configure Inventory

Create an Ansible inventory file describing your cluster:

# inventory.yml
all:
  vars:
    ansible_user: ubuntu
    ansible_ssh_private_key_file: ~/.ssh/id_rsa

    # K3s configuration
    k3s_version: v1.28.5+k3s1
    apiserver_endpoint: "192.168.1.100"  # Virtual IP for API server

    # KubeVIP configuration
    kube_vip_tag_version: "v0.7.0"
    kube_vip_interface: "ens5"  # Network interface for VIP (default: ens5)
    kube_vip_lb_ip_range: "192.168.1.200-192.168.1.220"  # IPs for services

    # MetalLB configuration
    metal_lb_ip_range:
      - "192.168.1.200-192.168.1.220"

  children:
    master:
      hosts:
        node1:
          ansible_host: 192.168.1.101
        node2:
          ansible_host: 192.168.1.102
        node3:
          ansible_host: 192.168.1.103

    node:
      hosts:
        node4:
          ansible_host: 192.168.1.104
        node5:
          ansible_host: 192.168.1.105
        node6:
          ansible_host: 192.168.1.106

    k3s_cluster:
      children:
        master:
        node:

Advanced Configuration with Custom Data Directory

For production deployments, especially when using dedicated storage volumes, configure K3s to use a custom data directory. This is particularly important when:

You have multiple disks (OS disk and separate data disk)
You want to use high-performance storage for Kubernetes data
You need to manage disk space separately for system and application data

Create or update your group_vars/all.yaml:

---
# K3s version to install
# Check https://github.com/k3s-io/k3s/releases for available versions
k3s_version: v1.30.2+k3s2

# Ansible connection user (must have passwordless sudo on all nodes)
ansible_user: ubuntu

# Network interface used by flannel CNI for pod networking
# Run 'ip addr show' on your nodes to find the correct interface
flannel_iface: ens192

# ============ KubeVIP Configuration ============
# KubeVIP provides a virtual IP for the Kubernetes API server (HA)

# Enable ARP broadcasts for virtual IP
kube_vip_arp: true

# Network interface where the virtual IP will be configured
# Must match the interface with connectivity to other nodes
kube_vip_interface: ens192

# KubeVIP container image version
kube_vip_tag_version: v0.8.2

# Virtual IP address for Kubernetes API server
# This IP must be:
# - In the same subnet as your nodes
# - Not currently in use by any other device
# - Accessible from all nodes
apiserver_endpoint: 192.168.1.100

# ============ Cluster Security ============
# Shared secret token for K3s cluster nodes to authenticate
# IMPORTANT: Must be alphanumeric only (no special characters)
# Generate a secure random token: openssl rand -hex 20
k3s_token: your-secure-cluster-token

# ============ K3s Server Arguments ============
# Additional arguments passed to K3s server nodes (control plane)
extra_server_args: >-
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik
  --write-kubeconfig-mode 644
  --flannel-iface={{ flannel_iface }}
  --data-dir /mnt/k3s-data

# --tls-san: Add virtual IP to API server TLS certificate
# --disable servicelb: Disable built-in load balancer (we use MetalLB)
# --disable traefik: Disable built-in ingress (Nebari installs its own)
# --write-kubeconfig-mode 644: Make kubeconfig readable
# --flannel-iface: Network interface for pod networking
# --data-dir: Custom location for K3s data (optional, see Step 2.1)

# ============ K3s Agent Arguments ============
# Additional arguments passed to K3s agent nodes (workers)
extra_agent_args: >-
  --flannel-iface={{ flannel_iface }}
  --data-dir /mnt/k3s-data

# ============ MetalLB Configuration ============
# MetalLB provides LoadBalancer services on bare metal

# MetalLB type: 'native' (recommended) or 'frr'
metal_lb_type: native

# MetalLB mode: 'layer2' (simple ARP-based) or 'bgp' (requires BGP router)
metal_lb_mode: layer2

# MetalLB speaker image version
metal_lb_speaker_tag_version: v0.14.8

# MetalLB controller image version
metal_lb_controller_tag_version: v0.14.8

# IP address range for LoadBalancer services
# Can be a string or list: "192.168.1.200-192.168.1.220" or ["192.168.1.200-192.168.1.220"]
# These IPs will be assigned to Nebari's ingress and other LoadBalancer services
# Requirements:
# - Must be in the same subnet as your nodes
# - Must not overlap with DHCP ranges or other static IPs
# - Reserve enough IPs for all services (typically 5-10 is sufficient)
metal_lb_ip_range: 192.168.1.200-192.168.1.220  # Can also be a list: ["192.168.1.200-192.168.1.220"]

Variable Reference Summary

Variable	Required	Default	Description
`k3s_version`	Yes	-	K3s version to install
`ansible_user`	Yes	-	SSH user with sudo access
`flannel_iface`	Yes	-	Network interface for pod networking
`kube_vip_interface`	Yes	-	Network interface for virtual IP
`kube_vip_tag_version`	No	v0.8.2	KubeVIP image version
`kube_vip_arp`	No	true	Enable ARP for virtual IP
`apiserver_endpoint`	Yes	-	Virtual IP for Kubernetes API
`k3s_token`	Yes	-	Cluster authentication token (alphanumeric)
`extra_server_args`	No	-	Additional K3s server arguments
`extra_agent_args`	No	-	Additional K3s agent arguments
`metal_lb_type`	No	native	MetalLB implementation type
`metal_lb_mode`	No	layer2	MetalLB operating mode
`metal_lb_ip_range`	Yes	-	IP range for LoadBalancer services
metal_lb_speaker_tag_version: v0.14.8
metal_lb_controller_tag_version: v0.14.8
metal_lb_ip_range: 192.168.1.200-192.168.1.220 # Can also be a list: ["192.168.1.200-192.168.1.220"]

:::warning[Important: Custom Data Directory]
If you specify `--data-dir /mnt/k3s-data`, you **must** ensure this directory exists and is properly mounted on **all** nodes before running the Ansible playbook. See Step 2.1 below.
:::

### Step 2.1: Prepare Storage (Required for Custom Data Directory)

If you're using a custom data directory with dedicated storage volumes, prepare them on each node:

#### For worker nodes with separate data disks:

```bash
# On each node, identify the data disk
lsblk

# Format the disk (example: /dev/sdb - verify your disk name!)
sudo mkfs.ext4 /dev/sdb

# Create mount point
sudo mkdir -p /mnt/k3s-data

# Add to fstab for persistence
echo '/dev/sdb /mnt/k3s-data ext4 defaults 0 0' | sudo tee -a /etc/fstab

# Mount the disk
sudo mount -a

# Verify
df -h /mnt/k3s-data

For control plane with large storage requirements (using LVM):

If your control plane node needs flexible storage management (e.g., for backups, persistent volumes):

# Check available volume groups
sudo vgs

# Create logical volume (example: 1.4TB from existing volume group)
sudo lvcreate -L 1400G -n k3s-data ubuntu-vg

# Format with XFS for better performance with large files
sudo mkfs.xfs /dev/ubuntu-vg/k3s-data

# Create mount point
sudo mkdir -p /mnt/k3s-data

# Add to fstab using UUID for reliability
UUID=$(sudo blkid -s UUID -o value /dev/ubuntu-vg/k3s-data)
echo "UUID=$UUID /mnt/k3s-data xfs defaults 0 2" | sudo tee -a /etc/fstab

# Mount
sudo mount -a

# Verify
df -h /mnt/k3s-data
lsblk

Storage Recommendations

XFS: Better for large files and high I/O workloads (recommended for nodes with databases or large datasets)
ext4: General purpose, good default choice for most workloads
Leave space for expansion: Don't allocate 100% of available storage to allow for future growth
Consistent paths: Use the same mount point (/mnt/k3s-data) on all nodes

Step 2.2: Verify Network Interfaces

Ensure you're using the correct network interface names in your configuration:

# On each node, list network interfaces
ip addr show

# Common interface names:
# - ens192, ens160 (VMware)
# - eth0, eth1 (AWS, some bare metal)
# - eno1, eno2 (Dell, HP servers)

Update flannel_iface and kube_vip_interface in your group_vars/all.yaml to match your actual interface names.

Step 3: Run Ansible Playbook

Deploy the K3s cluster:

ansible-playbook -i inventory.yml playbook.yaml

This will:

Install K3s on all nodes
Configure the control plane with high availability
Deploy KubeVIP for API server load balancing
Install and configure MetalLB for service load balancing
Set up proper node labels and taints

Known Issue: Multi-Master Join

There's a known issue in nebari-k3s where additional master nodes may fail to join the cluster correctly due to the IP filtering task returning multiple IPs. If you encounter this:

Check that additional master nodes are running K3s:
```
ssh user@node2 "sudo systemctl status k3s"
```

Verify they can reach the first master node:

ssh user@node2 "curl -k https://192.168.1.101:6443/ping"

If a node is running but not joined, you may need to manually re-run the join command on that node or investigate the Ansible task that filters the flannel interface IP.

Step 4: Sync Kubeconfig

After the playbook completes, sync the kubeconfig to your local machine:

# Set environment variables
export SSH_USER="root"  # Default: root (change if using different user)
export SSH_HOST="192.168.1.101"  # IP of any master node
export SSH_KEY_FILE="~/.ssh/id_rsa"

# Sync kubeconfig
make kubeconfig-sync

Verify cluster access:

kubectl get nodes -o wide

You should see all your nodes in a Ready state.

Step 5: Label Nodes for Nebari

Nebari requires specific node labels for scheduling workloads. For optimal resource utilization and proper workload distribution, use the recommended node-role.nebari.io/group label:

# Label control plane/general nodes
kubectl label nodes node1 node2 node3 \
  node-role.nebari.io/group=general

# Label user workload nodes
kubectl label nodes node4 \
  node-role.nebari.io/group=user

# Label Dask worker nodes
kubectl label nodes node5 node6 \
  node-role.nebari.io/group=worker

Node Labeling Best Practices

Consistent labeling: Using node-role.nebari.io/group as the label key ensures consistent behavior across all Nebari components
Multiple roles: A node can have multiple roles if needed (e.g., both user and worker on the same node)
Control plane nodes: Typically labeled as general to host core Nebari services
Resource optimization: Proper labeling enables Horizontal Pod Autoscaling (HPA) to fully utilize your cluster resources

Alternative labeling schemes (legacy):

# These also work but are less recommended
kubectl label nodes node1 node-role.kubernetes.io/general=true

Verify your labels:

kubectl get nodes --show-labels

Step 6: Initialize Nebari Configuration

Now initialize Nebari for deployment on your existing cluster:

nebari init existing \
  --project my-nebari \
  --domain nebari.example.com \
  --auth-provider github

Step 7: Configure Nebari for Bare Metal

Edit the generated nebari-config.yaml to configure it for your K3s cluster:

project_name: my-nebari
provider: existing
domain: nebari.example.com

certificate:
  type: lets-encrypt
  acme_email: admin@example.com
  acme_server: https://acme-v02.api.letsencrypt.org/directory

security:
  authentication:
    type: GitHub
    config:
      client_id: <github-oauth-app-client-id>
      client_secret: <github-oauth-app-client-secret>
      oauth_callback_url: https://nebari.example.com/hub/oauth_callback

local:
  # Specify the kubectl context name from your kubeconfig
  kube_context: "default"  # Or the context name from your K3s cluster

  # Configure node selectors to match your labeled nodes
  node_selectors:
    general:
      key: node-role.nebari.io/group
      value: general
    user:
      key: node-role.nebari.io/group
      value: user
    worker:
      key: node-role.nebari.io/group
      value: worker

# Configure default profiles
profiles:
  jupyterlab:
    - display_name: Small Instance
      description: 2 CPU / 8 GB RAM
      default: true
      kubespawner_override:
        cpu_limit: 2
        cpu_guarantee: 1.5
        mem_limit: 8G
        mem_guarantee: 5G

    - display_name: Medium Instance
      description: 4 CPU / 16 GB RAM
      kubespawner_override:
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G

  dask_worker:
    Small Worker:
      worker_cores_limit: 2
      worker_cores: 1.5
      worker_memory_limit: 8G
      worker_memory: 5G
      worker_threads: 2

    Medium Worker:
      worker_cores_limit: 4
      worker_cores: 3
      worker_memory_limit: 16G
      worker_memory: 10G
      worker_threads: 4

# Optional: Configure storage class
# default_storage_class: local-path  # K3s default storage class

Important Configuration Notes

Kubernetes Context

The kube_context field must match the context name in your kubeconfig. To find available contexts:

kubectl config get-contexts

Use the name from the NAME column in the output.

Node Selectors

Node selectors tell Nebari where to schedule different types of workloads:

general: Core Nebari services (JupyterHub, monitoring, etc.)
user: User JupyterLab pods
worker: Dask worker pods for distributed computing

Make sure the key and value match the labels you applied to your nodes in Step 5.

Step 8: Deploy Nebari

Deploy Nebari to your K3s cluster:

nebari deploy --config nebari-config.yaml

During deployment, you'll be prompted to update your DNS records. Add an A record pointing your domain to one of the MetalLB IP addresses.

Step 9: Verify Deployment

Once deployment completes, verify all components are running:

kubectl get pods -A
kubectl get ingress -A

Access Nebari at https://nebari.example.com and log in with your configured authentication provider.

Troubleshooting

Pods Not Scheduling

If pods remain in Pending state:

kubectl describe pod <pod-name> -n <namespace>

Common issues:

Node selector mismatch: Verify labels match between nebari-config.yaml and actual node labels
Insufficient resources: Ensure nodes have enough CPU/memory available
Taints: Check if nodes have taints that prevent scheduling

LoadBalancer Services Pending

If services of type LoadBalancer remain in Pending state:

kubectl get svc -A | grep LoadBalancer

Verify MetalLB is running:

kubectl get pods -n metallb-system

Check MetalLB configuration:

kubectl get ipaddresspool -n metallb-system
kubectl get l2advertisement -n metallb-system

API Server Unreachable

If you cannot connect to the cluster:

Verify KubeVIP is running on control plane nodes:

ssh ubuntu@192.168.1.101 "sudo k3s kubectl get pods -n kube-system | grep kube-vip"

Check if the virtual IP is responding:
```
ping 192.168.1.100
```
Verify the network interface is correct in your inventory configuration

Storage Considerations

K3s includes a default local-path storage provisioner that works well for development. For production:

Local storage: K3s local-path provisioner (default)
Network storage: Configure NFS, Ceph, or other storage classes
Cloud storage: If running in a hybrid environment, configure cloud CSI drivers

Example NFS storage class configuration:

# Add to nebari-config.yaml under theme.jupyterhub
storage_class_name: nfs-client

Storage Considerations

K3s includes a default local-path storage provisioner that works well for development. For production:

Local storage: K3s local-path provisioner (default)
Network storage: Configure NFS, Ceph, or other storage classes
Cloud storage: If running in a hybrid environment, configure cloud CSI drivers

Example NFS storage class configuration:

# Add to nebari-config.yaml under theme.jupyterhub
storage_class_name: nfs-client

Migrating Existing User Data

If you're migrating from an existing system (e.g., Slurm cluster), you can pre-populate user data:

Copy data to the storage node (typically a control plane node with large storage):

# From old system to new K3s storage
rsync -avhP -e ssh /old/home/ user@k3s-node:/mnt/k3s-data/backup/home/

Note about user IDs: User IDs in JupyterHub pods may differ from your existing system. After Nebari deployment:
- Check the UID used by JupyterHub: kubectl exec -it jupyter-<username> -- id
- Adjust file ownership if needed:
```
# On the storage node
sudo chown -R <jupyter-uid>:<jupyter-gid> /mnt/k3s-data/backup/home/<username>
```

Create persistent volume for user data (if using custom storage):

apiVersion: v1
kind: PersistentVolume
metadata:
  name: user-data-pv
spec:
  capacity:
    storage: 1000Gi
  accessModes:
    - ReadWriteMany
  hostPath:
    path: /mnt/k3s-data/users

User Data Best Practices

Test data migration with a single user first
Verify file permissions match JupyterHub pod UIDs
Consider using NFS or similar for multi-node access to user data
Keep backups of original data during migration

Scaling Your Cluster

Adding Worker Nodes

Add new nodes to your Ansible inventory

Run the playbook targeting only new nodes:

ansible-playbook -i inventory.yml playbook.yaml --limit new-node

Label the new nodes for Nebari workloads

Upgrading K3s

To upgrade your K3s cluster:

Update k3s_version in your inventory

Run the playbook:

ansible-playbook -i inventory.yml playbook.yaml

warning

Test upgrades in a non-production environment first. Always backup your data before upgrading.

🚀 Quick Start

🏭 Production Setup

Quick Start: Single-Node​

Prerequisites​

Steps​

Next Steps​

Production Deployment​

Architecture Overview​

Prerequisites​

Step 1: Clone nebari-k3s​

Step 2: Create Inventory​

Step 3: Configure Variables​

Step 4: Deploy K3s Cluster​

Step 5: Sync Kubeconfig​

Step 6: Label Nodes​

Step 7: Initialize Nebari​

Step 8: Configure Nebari​

Step 9: Deploy Nebari​

Step 10: Verify Deployment​

Reference​

Configuration Variables​

Node Selector Labels​

Troubleshooting​

Pods Not Scheduling​

LoadBalancer Service Pending​

API Server Unreachable​

Advanced Topics​

Custom Data Directory​

Storage Configuration​

Migrating User Data​

Scaling the Cluster​

Next Steps​

Additional Resources​

Quick Start: Single-Node Setup​

Install K3s on a Single Node​

When to Move to Production​

Production Setup with nebari-k3s​

What You'll Be Using​

Overview​

Prerequisites​

Infrastructure Requirements​

Software Requirements​

Step 1: Clone nebari-k3s Repository​

Step 2: Configure Inventory​

Advanced Configuration with Custom Data Directory​

Variable Reference Summary​

For control plane with large storage requirements (using LVM):​

Step 2.2: Verify Network Interfaces​

Step 3: Run Ansible Playbook​

Step 4: Sync Kubeconfig​

Step 5: Label Nodes for Nebari​

Step 6: Initialize Nebari Configuration​

Step 7: Configure Nebari for Bare Metal​

Important Configuration Notes​

Kubernetes Context​

Node Selectors​

Step 8: Deploy Nebari​

Step 9: Verify Deployment​

Troubleshooting​

Pods Not Scheduling​

LoadBalancer Services Pending​

API Server Unreachable​

Storage Considerations​

Storage Considerations​

Migrating Existing User Data​

Scaling Your Cluster​

Adding Worker Nodes​

Upgrading K3s​

Next Steps​

Additional Resources​

Quick Start: Single-Node

Prerequisites

Steps

Next Steps

Production Deployment

Architecture Overview

Prerequisites

Step 1: Clone nebari-k3s

Step 2: Create Inventory

Step 3: Configure Variables

Step 4: Deploy K3s Cluster

Step 5: Sync Kubeconfig

Step 6: Label Nodes

Step 7: Initialize Nebari

Step 8: Configure Nebari

Step 9: Deploy Nebari

Step 10: Verify Deployment

Reference

Configuration Variables

Node Selector Labels

Troubleshooting

Pods Not Scheduling

LoadBalancer Service Pending

API Server Unreachable

Advanced Topics

Custom Data Directory

Storage Configuration

Migrating User Data

Scaling the Cluster

Next Steps

Additional Resources

Quick Start: Single-Node Setup

Install K3s on a Single Node

When to Move to Production

Production Setup with nebari-k3s

What You'll Be Using

Overview

Prerequisites

Infrastructure Requirements

Software Requirements

Step 1: Clone nebari-k3s Repository

Step 2: Configure Inventory

Advanced Configuration with Custom Data Directory

Variable Reference Summary

For control plane with large storage requirements (using LVM):

Step 2.2: Verify Network Interfaces

Step 3: Run Ansible Playbook

Step 4: Sync Kubeconfig

Step 5: Label Nodes for Nebari

Step 6: Initialize Nebari Configuration

Step 7: Configure Nebari for Bare Metal

Important Configuration Notes

Kubernetes Context

Node Selectors

Step 8: Deploy Nebari

Step 9: Verify Deployment

Troubleshooting

Pods Not Scheduling

LoadBalancer Services Pending

API Server Unreachable

Storage Considerations

Storage Considerations

Migrating Existing User Data

Scaling Your Cluster

Adding Worker Nodes

Upgrading K3s

Next Steps

Additional Resources