Apache Airflow on AWS EKS with Spot Instances

Apache Airflow on AWS EKS with Spot Instances
Photo by Luke Chesser / Unsplash

Intro

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows.

Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

Also, Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that you can use to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications.

Problem

Running Airflow in a single instance can be painful. On the other side running Airflow in EKS with on-demand instances can be expensive. In this post, we'll focus on how we manage Airflow Worker nodes with EKS spot instances.

Required Tools

AWS EKS Setup

I'm not going to cover lots of details in this article about the EKS setup. I'll set up basic EKS with 2 different node groups.
If you want to learn more detail about  EKS setup you can visit this article.

💡
I assume you already configured aws-cli.

Just before starting the installation, we need to set some environment variables.

export AWS_REGION=#<-- enter your aws region
export ACCOUNT_ID=#<-- enter your aws account id
export EKS_CLUSTER_NAME=#<-- enter your eks cluster name

Create a deployment yaml file with eksctl.

cat << EOF > ekscluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
 name: ${EKS_CLUSTER_NAME}
 region: ${AWS_REGION}
 version: "1.24"

availabilityZones: ["${AWS_REGION}a", "${AWS_REGION}b", "${AWS_REGION}c"]

managedNodeGroups:
- name: on-demand-node-group
  desiredCapacity: 2
  minSize: 2
  maxSize: 4
  labels:
    lifecycle: OnDemand
  iam:
    withAddonPolicies:
      autoScaler: true
 
- name: spot-node-group
  desiredCapacity: 2
  minSize: 1
  maxSize: 4
  instanceTypes: ["m5.large","m4.large","m5a.large"]
  spot: true
  labels:
    lifecycle: Ec2Spot
EOF

Execute yaml file with eksctl.

eksctl create cluster -f ekscluster.yaml 

You can observe your cluster create status from AWS Console. After successful installation

kubectl get nodes --label-columns=lifecycle --selector=lifecycle=Ec2Spot

will give you something like below

NAME                                            STATUS   ROLES    AGE     VERSION               LIFECYCLE
ip-10-40-41-184.eu-central-1.compute.internal   Ready    <none>   5h32m   v1.24.7-eks-fb459a0   Ec2Spot
ip-10-40-69-108.eu-central-1.compute.internal   Ready    <none>   3h8m    v1.24.7-eks-fb459a0   Ec2Spot

Next, we need to taint the Spot instances with PreferNoSchedule. This taint ensures only the pods that tolerate this taint run on Spot instances.

for node in `kubectl get nodes --label-columns=lifecycle --selector=lifecycle=Ec2Spot -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}'` ; do
  kubectl taint nodes $node spotInstance=true:PreferNoSchedule
done

After this operation, you may want to describe any of the spot instances. For to do that

kubectl describe node "your spot node name"

will give you something like below

Name:               ip-10-40-41-184.eu-central-1.compute.internal
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.large
                    beta.kubernetes.io/os=linux
                    eks.amazonaws.com/capacityType=SPOT
                    eks.amazonaws.com/nodegroup-image=ami-06fe8fd245108ccc9
CreationTimestamp:  Wed, 15 Feb 2023 14:01:48 +0300
.....

Install Apache Airflow

In order to build airflow image and bootstrap components, you need to download some script from my repository

git clone [email protected]:habil/habil-dev-blog.git
cd airflow-on-eks-spot-instance/scripts
./setup_infra.sh

This script will install

  • Cluster AutoScaler
  • EFS CSI Driver

and create

  • Amazon ECR repository
  • EFS filesystem
  • EFS access point
  • Mount points in three Availability Zones
💡
This article will not install any RDS. If you want to create one, you need to uncomment related lines in setup_infra.sh or if you already setup one before, just change ../kube/secrets.yaml line 29.

After infra setup, now we can install Airflow. In order to start, we need to build & publish our Airflow image.

cd docker

aws ecr get-login-password \
  --region $AWS_REGION | \
  docker login \
  --username AWS \
  --password-stdin \
  $AIRFLOW_REPOSITORY
    
docker build -t $AIRFLOW_REPOSITORY .
docker push ${AIRFLOW_REPOSITORY}:latest
💡
I added a code editor plugin to the Airflow container as a bonus.

Start the installation with the deploy.sh

cd ../kube
./deploy.sh

After successful installation, you can reach the airflow login page via

echo "http://$(kubectl get service airflow -n airflow \
  -o jsonpath="{.status.loadBalancer.ingress[].hostname}"):8080\login"

Assigning Airflow workers to Spot instances

If you look at the configmap previously executed, you can see the nodeAffinity and tolerations that Airflow will apply to every worker pod it creates.

Testing

Trigger any Dag and check the pods.

💡
If you want to import sample Dags, you also uncomment ../kube/build/configmaps.yml line 30
for node in $(kubectl get nodes \
  --selector=lifecycle=Ec2Spot \
  -o=jsonpath='{.items[*].metadata.name}')
do (kubectl get pods -n airflow \
   --field-selector spec.nodeName="$node")
done

Conclusion

With this setup, whether your workflow is an ETL job, a media processing pipeline, or a machine learning workload, an Airflow worker runs it. If you run these workers on Spot instances, you can reduce the cost of running your Airflow cluster by up to 90%.

Sources: