Apache Airflow on AWS EKS with Spot Instances
Intro
Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows.
Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Also, Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that you can use to run Kubernetes on AWS without needing to install, operate, and maintain your own Kubernetes control plane or nodes. Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications.
Problem
Running Airflow in a single instance can be painful. On the other side running Airflow in EKS with on-demand instances can be expensive. In this post, we'll focus on how we manage Airflow Worker nodes with EKS spot instances.
Required Tools
AWS EKS Setup
I'm not going to cover lots of details in this article about the EKS setup. I'll set up basic EKS with 2 different node groups.
If you want to learn more detail about EKS setup you can visit this article.
Just before starting the installation, we need to set some environment variables.
export AWS_REGION=#<-- enter your aws region
export ACCOUNT_ID=#<-- enter your aws account id
export EKS_CLUSTER_NAME=#<-- enter your eks cluster name
Create a deployment yaml file with eksctl.
cat << EOF > ekscluster.yaml
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: ${EKS_CLUSTER_NAME}
region: ${AWS_REGION}
version: "1.24"
availabilityZones: ["${AWS_REGION}a", "${AWS_REGION}b", "${AWS_REGION}c"]
managedNodeGroups:
- name: on-demand-node-group
desiredCapacity: 2
minSize: 2
maxSize: 4
labels:
lifecycle: OnDemand
iam:
withAddonPolicies:
autoScaler: true
- name: spot-node-group
desiredCapacity: 2
minSize: 1
maxSize: 4
instanceTypes: ["m5.large","m4.large","m5a.large"]
spot: true
labels:
lifecycle: Ec2Spot
EOF
Execute yaml file with eksctl.
eksctl create cluster -f ekscluster.yaml
You can observe your cluster create status from AWS Console. After successful installation
kubectl get nodes --label-columns=lifecycle --selector=lifecycle=Ec2Spot
will give you something like below
NAME STATUS ROLES AGE VERSION LIFECYCLE
ip-10-40-41-184.eu-central-1.compute.internal Ready <none> 5h32m v1.24.7-eks-fb459a0 Ec2Spot
ip-10-40-69-108.eu-central-1.compute.internal Ready <none> 3h8m v1.24.7-eks-fb459a0 Ec2Spot
Next, we need to taint the Spot instances with PreferNoSchedule
. This taint ensures only the pods that tolerate this taint run on Spot instances.
for node in `kubectl get nodes --label-columns=lifecycle --selector=lifecycle=Ec2Spot -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}'` ; do
kubectl taint nodes $node spotInstance=true:PreferNoSchedule
done
After this operation, you may want to describe any of the spot instances. For to do that
kubectl describe node "your spot node name"
will give you something like below
Name: ip-10-40-41-184.eu-central-1.compute.internal
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.large
beta.kubernetes.io/os=linux
eks.amazonaws.com/capacityType=SPOT
eks.amazonaws.com/nodegroup-image=ami-06fe8fd245108ccc9
CreationTimestamp: Wed, 15 Feb 2023 14:01:48 +0300
.....
Install Apache Airflow
In order to build airflow image and bootstrap components, you need to download some script from my repository
git clone [email protected]:habil/habil-dev-blog.git
cd airflow-on-eks-spot-instance/scripts
./setup_infra.sh
This script will install
- Cluster AutoScaler
- EFS CSI Driver
and create
- Amazon ECR repository
- EFS filesystem
- EFS access point
- Mount points in three Availability Zones
After infra setup, now we can install Airflow. In order to start, we need to build & publish our Airflow image.
cd docker
aws ecr get-login-password \
--region $AWS_REGION | \
docker login \
--username AWS \
--password-stdin \
$AIRFLOW_REPOSITORY
docker build -t $AIRFLOW_REPOSITORY .
docker push ${AIRFLOW_REPOSITORY}:latest
Start the installation with the deploy.sh
cd ../kube
./deploy.sh
After successful installation, you can reach the airflow login page via
echo "http://$(kubectl get service airflow -n airflow \
-o jsonpath="{.status.loadBalancer.ingress[].hostname}"):8080\login"
Assigning Airflow workers to Spot instances
If you look at the configmap previously executed, you can see the nodeAffinity and tolerations that Airflow will apply to every worker pod it creates.
Testing
Trigger any Dag and check the pods.
for node in $(kubectl get nodes \
--selector=lifecycle=Ec2Spot \
-o=jsonpath='{.items[*].metadata.name}')
do (kubectl get pods -n airflow \
--field-selector spec.nodeName="$node")
done
Conclusion
With this setup, whether your workflow is an ETL job, a media processing pipeline, or a machine learning workload, an Airflow worker runs it. If you run these workers on Spot instances, you can reduce the cost of running your Airflow cluster by up to 90%.
Sources:
- https://aws.amazon.com/blogs/containers/amazon-eks-now-supports-provisioning-and-managing-ec2-spot-instances-in-managed-node-groups/
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html
- https://aws.amazon.com/blogs/big-data/orchestrate-big-data-workflows-with-apache-airflow-genie-and-amazon-emr-part-1/
- https://aws.amazon.com/blogs/containers/running-airflow-workflow-jobs-on-amazon-eks-spot-nodes/