How to Set Up Failover for DigitalOcean Static Routes Operator

0 Shares
0
0
0
0

Introduction

The main purpose of the static route operator is to provide more flexibility and control over network traffic in your Kubernetes environment. It allows you to fine-tune the routing configuration to meet application requirements and optimize network performance. It is deployed as a DaemonSet. Hence, it runs on every node of your DigitalOcean managed Kubernetes cluster.

In this tutorial, you will learn to manage the routing table of each worker node based on the CRD specification and set up a failover gateway.

The main goal of this tutorial is to show how to manage the routing table of each worker node based on the CRD specification and set up a failover gateway.

Prerequisites
  • A DigitalOcean-managed Kubernetes cluster that you have access to.
  • Kubectl CLI installed on your local machine (configured to point to your DigitalOcean managed Kubernetes cluster)
  • NAT GW Droplets (2 or higher) are configured and implemented as detailed here.
  • Create a system for detecting failures in the Gateway Droplet that is tailored to the user's needs and ensures clear and accurate detection with minimal false alarms. Use monitoring services such as Prometheus or Nagios, set up health check endpoints in the Droplet, or alerting tools such as Alertmanager for notifications. For this purpose, you can use a monitoring stack from our marketplace.

Below is the architecture diagram:

Deploying Kubernetes static routes operator

Deploy the latest version of the static routes operator to your DigitalOcean managed Kubernetes cluster using kubectl:

kubectl apply -f https://raw.githubusercontent.com/digitalocean/k8s-staticroute-operator/main/releases/v1/k8s-staticroute-operator-v1.0.0.yaml
Check that Operator Pods are up and running:

Let's check if the operator pods are up and running.

“ bash kubectl get staticroutes -o wide -n staticroutes

The output looks similar to the below:
```bash
[secondary_label Output]
NAME AGE DESTINATIONS GATEWAY
static-route-ifconfig.me 119s ["XX.XX.XX.XX"] XX.XX.XX.XX
static-route-ipinfo.io 111s ["XX.XX.XX.XX"] XX.XX.XX.XX

Now let's check the operator reports and no exceptions should be reported:

kubectl logs -f ds/k8s-staticroute-operator -n static-routes

You should observe the following output:

Output
Found 2 pods, using pod/k8s-staticroute-operator-498vv
[2023-05-15 14:12:32,282] kopf._core.reactor.r [DEBUG ] Starting Kopf 1.35.6.
[2023-05-15 14:12:32,282] kopf._core.engines.a [INFO ] Initial authentication has been initiated.
[2023-05-15 14:12:32,283] kopf.activities.auth [DEBUG ] Activity 'login_via_pykube' is invoked.
[2023-05-15 14:12:32,285] kopf.activities.auth [DEBUG ] Pykube is configured in cluster with service account.
[2023-05-15 14:12:32,286] kopf.activities.auth [INFO ] Activity 'login_via_pykube' succeeded.
[2023-05-15 14:12:32,286] kopf.activities.auth [DEBUG ] Activity 'login_via_client' is invoked.
[2023-05-15 14:12:32,287] kopf.activities.auth [DEBUG ] Client is configured in cluster with service account.
[2023-05-15 14:12:32,288] kopf.activities.auth [INFO ] Activity 'login_via_client' succeeded.
[2023-05-15 14:12:32,288] kopf._core.engines.a [INFO ] Initial authentication has finished.
[2023-05-15 14:12:32,328] kopf._cogs.clients.w [DEBUG ] Starting the watch-stream for customresourcedefinitions.v1.apiextensions.k8s.io cluster-wide.
[2023-05-15 14:12:32,330] kopf._cogs.clients.w [DEBUG ] Starting the watch-stream for staticroutes.v1.networking.digitalocean.com cluster-wide.

To mitigate the impact of a gateway failure, it is recommended to have a standby gateway ready to fail over if necessary. Although true high availability (HA) is not currently supported by the operator, performing a failover helps minimize the duration of service disruption.

Let's say you have a designated destination IP address, 34.160.111.145, which represents the active or primary gateway, with IP address 10.116.0.4, responsible for forwarding traffic. This is stored in the primar.yaml file.

apiVersion: networking.digitalocean.com/v1
kind: StaticRoute
metadata:
name: primary
spec:
destinations: 
- "34.160.111.145"
gateway: "10.116.0.4"

Additionally, you will have a standby or secondary gateway with IP address 10.116.0.12 that is ready to handle traffic from the same destination IP address. The StaticRoute definition in secondary.yaml is identical to the primary definition, except for the gateway IP address (and object name). This is stored in the secondary.yaml file.

apiVersion: networking.digitalocean.com/v1
kind: StaticRoute
metadata:
name: secondary
spec:
destinations: 
- "34.160.111.145"
gateway: "10.116.0.12"

Then the actual failure process involves the following steps:

  • Detect that the active gateway with IP address 10.116.0.5 is down.
  • Delete the currently active StaticRoute.
  • Apply the ready-made StaticRoute.

Delete Active StaticRoute

Now let's delete the currently active StaticRoute.

kubectl delete -f primary.yaml

Wait 30-60 seconds to give each operator instance enough time to process the object deletion. That is, respond by deleting the path from all nodes.

Apply Standby StaticRoute

Let's enable the secondary StaticRoute.

 kubectl apply -f secondary.yaml

The operator must select the new standby StaticRoute and insert the corresponding routing table entries. After that, the failover is complete.

Setup test

Each CRD instance creates a static route to two websites that report your public IP – ifconfig.me/ip and ipinfo.io/ip. A typical static route definition looks like this:

apiVersion: networking.digitalocean.com/v1
kind: StaticRoute
metadata:
name: static-route-ifconfig.me
spec:
destinations: 
- "34.160.111.145"
gateway: "10.116.0.5"

To test the settings, download a sample manifest from the example location:

Example for ifconfig.me and ipinfo.io-

curl -O https://raw.githubusercontent.com/digitalocean/k8s-staticroute-operator/main/examples/static-route-ifconfig.me.yaml
curl -O https://raw.githubusercontent.com/digitalocean/k8s-staticroute-operator/main/examples/static-route-ipinfo.io.yaml

After downloading the manifests, replace each manifest file with <>. Then, apply each manifest using kubectl:

kubectl apply -f static-route-ifconfig.me.yaml
kubectl apply -f static-route-ipinfo.io.yaml

Finally, check if the curl-test pod responds to your NAT Gateway's public IP for each route:

kubectl exec -it curl-test -- curl ifconfig.me/ip
kubectl exec -it curl-test -- curl ipinfo.io/ip

You should use the same test during the failover test. During the primary gateway droplet failover, the result should give the NAT GW the public IP of the primary droplet and during the secondary droplet/failover, the result should give the public IP of the secondary droplet NAT Gateway.

Troubleshooting

You should check the StaticRoute object: if an error occurs, first look for the error in the static route event for each node where the rule applies.

kubectl get StaticRoute <static-route-name> -o yaml

Check logs: To dig deeper, you can check the static route operator logs for errors.

kubectl logs -f ds/k8s-staticroute-operator -n static-routes

Erase

To remove the operator and associated resources, please run the following kubectl command (make sure to use the same release version as you used during the installation):

kubectl delete -f deploy https://raw.githubusercontent.com/digitalocean/k8s-staticroute-operator/main/releases/v1/k8s-staticroute-operator-v1.0.0.yaml

Output similar to:

customresourcedefinition.apiextensions.k8s.io "staticroutes.networking.digitalocean.com" deleted
serviceaccount "k8s-staticroute-operator" deleted
clusterrole.rbac.authorization.k8s.io "k8s-staticroute-operator" deleted
clusterrolebinding.rbac.authorization.k8s.io "k8s-staticroute-operator" deleted
daemonset.apps "k8s-staticroute-operator" deleted

Now, if you test the same curl command, you will get the IP of the worker node as output:

kubectl exec -it curl-test -- curl ifconfig.me/ip
kubectl exec -it curl-test -- curl ipinfo.io/ip 

Now check the public IP of the worker node:

kubectl get nodes -o wide

Result

Implementing failover capabilities, even if true high availability (HA) is not fully supported, is a recommended approach to minimize the impact of gateway failures.

Organizations can significantly reduce the duration of service disruptions by having a standby gateway to fail over if needed.

It is important to have a backup gateway in place and ensure smooth transition in the event of a failure. While implementation may vary depending on specific needs, prioritizing failure readiness can help maintain reliable, uninterrupted service delivery.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like