Healing Rancher when a cluster node changes external IP

Healing Rancher when a cluster node changes external IP

Healing Rancher when a cluster node changes external IP
Photo by Juliana e Mariana Amorim / Unsplash

Whenever one of your Rancher cluster nodes change its external IP (I am looking at you AWS!), here is how you can manually fix it.

In this case a primary (and the only) control plane (aka k8s master) node got its external IP changed. Rancher v2.6.3, RKE v1.3.7.

Say your good old IP was 2.22.222.222, first find it:

$ rancher kubectl get crd | grep -v ^NAME | awk '{print $1}' | while read CRD; do echo === $CRD ===; rancher kubectl get $CRD -A -o yaml | grep '2.22.222.222'; done

=== clusters.management.cattle.io ===
    apiEndpoint: https://2.22.222.222:6443
        - address: 2.22.222.222
        "https://2.22.222.222:6443/api/v1/namespaces/kube-system?timeout=45s": context

=== clusters.provisioning.cattle.io ===
        "https://2.22.222.222:6443/api/v1/namespaces/kube-system?timeout=45s": context

=== nodes.management.cattle.io ===
      rke.cattle.io/external-ip: 2.22.222.222
      address: 2.22.222.222
  1. First fix the external IP in nodes.management.cattle.io CRD
kubectl -n <cluster> edit nodes.management.cattle.io <your machine>

Replace all occurrences of 2.22.222.222 to your cluster node's new external IP and save.

Expected lines in the logs:

$ kubectl -n cattle-system logs rancher-567cdddb7c-b77jj -f
2022/05/01 22:58:20 [INFO] Starting /v1, Kind=ConfigMap controller
2022/05/01 22:58:20 [INFO] Starting /v1, Kind=ConfigMap controller
2022/05/01 22:58:22 [INFO] Registering istio for cluster "<cluster>"
2022/05/01 22:58:22 [INFO] Starting cluster controllers for <cluster>
2022/05/01 22:58:22 [INFO] Starting management.cattle.io/v3, Kind=SamlToken controller
...

2. Fix the external IP in clusters.management.cattle.io CRD

Important: Do this only after you've fixed the IP in nodes.management.cattle.io CRD, otherwise things won't work and you will have to do them over again.

rancher kubectl edit clusters.management.cattle.io <cluster>

Replace all occurrences of 2.22.222.222 to your cluster node's new external IP and save.

Expected lines in the logs:

$ kubectl -n cattle-system logs rancher-567cdddb7c-b77jj -f
2022/05/01 22:53:10 [INFO] Provisioning cluster [<cluster>]
2022/05/01 22:53:10 [INFO] Updating cluster [<cluster>]
...
2022/05/01 22:53:38 [INFO] Updated cluster [<cluster>] with node version [13]
2022/05/01 22:53:38 [INFO] Provisioned cluster [<cluster>]
2022/05/01 22:53:38 [INFO] checking cluster [<cluster>] for worker nodes upgrade

To recap, by fixing the IP in nodes.management.cattle.io CRD the node gets registered within Rancher and then, after fixing IP in clusters.management.cattle.io CRD, the node gets re-provisioned.
That's pretty much it!

If you get "Failed to reconcile etcd plane: Etcd plane nodes are replaced. Stopping provisioning. Please restore your cluster from backup." error in Rancher (either Rancher web UI or the rancher pod logs), that's because you have fixed the IPs in CRD's in a wrong order. Do not worry, you do not have to restore your cluster from the backup - simply fix the IP in CRD's again but in the right order.

After that your cluster should be Green in Rancher UI.

If you still seeing some errors such as "Unknown schema for type: management.cattle.io.cluster" when opening your cluster in Rancher UI, then simply Logout & Login there.

If Rancher UI gives "503 Service Temporarily Unavailable", change "replicas: 3" to "replicas: 1" in the Rancher deployment:

kubectl -n cattle-system edit deployment rancher

This might not be the optimal fix, but it worked for me where I only had a single Rancher node. You may find a different solution such as https://forums.rancher.com/t/solved-rancher-inaccessible-503-service-temporarily-unavailable/36865

When the cluster and Rancher UI are back, don't forget to download a new kubeconfig for your cluster from Rancher UI since it'll get the new external IP there and the x509 certificate (the certificate-authority-data field in the kubeconfig file).

If nothing is helping yet, you might want to try bouncing the cattle agents on the node that got a new external IP. In order to make kubectl against that node, you will have to correct the kubeconfig. To make it - set the new external IP at the server: line, remove the certificate-authority-data: field entirely, and add insecure-skip-tls-verify: true at the same indent as server:.
After that you can talk to your K8s cluster. Bounce the cattle agents there and you might need to repeat the CRD steps from above to push things forward:

kubectl -n cattle-system delete pods -l app=cattle-cluster-agent
kubectl -n cattle-system delete pods -l app=cattle-agent