Setup Mini Data Lake and Platform on M1 Mac — Part 6

Jun Li
8 min readJan 4, 2022
Photo by Shahadat Rahman on Unsplash

In this part, I’ll walk you through the installation of tools for data processing including apache spark and apache airflow.

Apache Spark

Download spark 3.2.0 pre-built for apache hadoop 3.3 and later with scala 2.13 from here. Unzip it to your local file system. In this article, I will show you running spark on top of local MinIO using kubernetes. It assumes you already have docker desktop with kubernetes installed mentioned in Part 2.

  1. Install scala 2.13 and pyspark 3.2.0. You can also install delta-spark which supports delta-lake.
$ brew install scala@2.13$ pip install pyspark==3.2.0$ pip install delta-spark=1.1.0

2. Put spark home to environment variable and add bin folder to path under ~/.zshrc and then source it.

export SPARK_HOME=<spark home folder>export PATH=$PATH:$SPARK_HOME/bin

3. In iTerm2, navigate to $SPARK_HOME, we will build spark job runner images. You can build two images, one is for java/scala, the other is for pyspark.

First, modify default Dockerfile for kubernetes under $SPARK_HOME/kubernetes/dockerfiles/spark/. It uses jre-11 by default, you can change it if you prefer other jre version. You also need to create folder ‘/tmp/spark-events’ which is used for spark events in Dockerfile. See full version of Dockerfile of my case.

Dockerfile for kubernetes

Second, you can modify Dockerfile for python under $SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/. As it depends on the image from above, you don’t need to base from other images. I changed a bit for the default Dockerfile, that I installed python 3.8.12 from scratch. Also, I installed pandas and pyarrow. If you need more python libraries in the future while running your pyspark jobs, you can modify it and build new images or provide your packages for pyspark virtual environment during runtime. Here is the Dockerfile for python in my version.

Dockerfile for kubernetes on spark-py

Once you have done that, you can build these two images together by using spark docker-image-tool. You should point to the python docker file.

$ $SPARK_HOME/bin/docker-image-tool.sh -r <your docker repo> -t <your image tag> -p $SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

Then docker will automatically create two images. One is <docker repo>/spark:<tag>, the other is <docker repo>/spark-py:<tag>. If you prefer running both scala and pyspark, use spark-py one. Otherwise use the spark one.

4. Configure kubernetes

We need to create a new service account ‘spark’, and bind it to cluster role.

$ kubectl create serviceaccount spark$ kubectl create clusterrolebinding spark-role --clusterrole=edit  --serviceaccount=default:spark --namespace=default

Optionally, you can enable kubernetes dashboard.

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.4/aio/deploy/recommended.yaml$ kubectl get pods --all-namespaces 
# it should be in kubernetes-dashboard namespace
# start kubernetes dashboard
$ kubectl proxy

You can use default token to access kubernetes dashboard at http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/#/login.

$ kubectl -n kube-system describe secret default

5. Configure spark

You can set kubernetes as default master, with MinIO as S3 endpoint at $SPARK_HOME/conf/spark-defaults.conf. See my version of the spark-defaults.conf. You should use your local network IP like “198.168.X.X” depends on your wifi router settings as the “spark.hadoop.fs.s3a.endpoint” rather than “localhost” here.

spark-defaults.conf

spark-defaults.conf

To enable spark access to MinIO/S3, you also need to put the aws-java-sdk.bundle and hadoop-aws jars to $SPARK_HOME/jars which you can download from here and click ‘Download hadoop-aws.jar (3.3.1)’ link. You also need to place delta-core_2.13–1.1.0.jar to the jars folder which you can download from here and click ‘Download delta-core_2.13.jar (1.1.0)’ link. This will enable you work with delta lake if this is needed for your development.

6. Start history server

You can start your spark history server locally, the default event log dir is ‘/tmp/spark-events’. The history server can be accessed from http://localhost:18080 by default.

# If the folder doesn't exist, then create it.
$ mkdir /tmp/spark-events
$ $SPARK_HOME/sbin/start-history-server.sh

Test Runs

Once you install and configure all of them, we can run some test runs to verify the installation. First, run a spark-submit job with example jar.

$ $SPARK_HOME/bin/spark-submit --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 file:///<spark home>/examples/jars/spark-examples_2.13-3.2.0.jar

It will automatically upload the spark-examples_2.13–3.2.0.jar to MinIO during runtime and start kubernetes pods to run the job. You can access the driver logs from pod like this

# The pod name should be in pattern spark-pi-<random string>-driver# You can find your pods from default namespace like this
$ kubectl get pods
$ kubectl logs <pod name>

You can see the events log from history server if you run this job with deploy-mode as ‘client’.

Second, likewise, we can test the spark-submit job with a python example.

$ $SPARK_HOME/bin/spark-submit --deploy-mode cluster <spark home>/examples/src/main/python/pi.py 100

It should be similar to the example jar.

Third, we can launch a pyspark with delta enabled. You should be able to see the event logs from history server once it starts. Let’s test if it can write some data with delta format to MinIO. It assumes you create a bucket called ‘delta’ before running it in this example.

$ pyspark --packages io.delta:delta-core_2.13:1.1.0,org.apache.hadoop:hadoop-aws:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog">>> spark.range(5).write.format("delta").save("s3a://delta/test_table")>>> df = spark.read.format("delta").load("s3a://delta/test_table")>>> df.show()
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 0|
| 4|
+---+

Fourth, we can also test it run on master ‘yarn’ rather than kubernetes since we already setup hadoop and yarn in Part 3. Just overwrite the master argument in ‘spark-submit’ command. Once it’s done, you should be able to see the finished application from hadoop cluster at http://localhost:8088/cluster

$ spark-submit --deploy-mode cluster --master yarn --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=5 file:///<spark home>/examples/jars/spark-examples_2.13-3.2.0.jar

Airflow

Custom airflow image

Airflow will be installed on top of docker-desktop for Mac via ‘docker compose’. To make it working with spark and local MinIO, we need to custom the official airflow image. The image should include airflow itself, python 3.8, JDK8, spark 3.2 with hadoop-aws support for S3 file system. I created a docker image project for custom airflow as screenshot below. The jars folder include all required jars for s3 file system as mentioned in ‘Apache Spark’ section above. ‘spark-defaults.conf’ will be the same configure file for your local spark. ‘generate_kubeconfig.sh’ is referenced from this github gist in order to generate kubeconfig for service account ‘spark’ which will be used by airflow worker (you can also reference below).We need to comment the function ‘create_service_account’ in ‘generate_kubeconfig.sh’ since we already have service account ‘spark’ created and role-binded from ‘Apache Spark’ section above.

build image project for airflow

generate_kubeconfig.sh

Source from https://gist.github.com/innovia/fbba8259042f71db98ea8d4ad19bd708
$ cd <path of the project>$ chmod +x generate_kubeconfig.sh$ ./generate_kubeconfig.sh spark default$ cp /tmp/kube/k8s-spark-default-conf .

After that, you can see file ‘k8s-spark-default-conf’ is under the project folder.

Build airflow image

Specify the ‘Dockerfile’ as below.

Dockerfile for custom airflow

In iTerm2, run

$ docker build -t <your docker repo>/airflow-custom:2.2.3 .

Start airflow with docker compose

You can reference apache airflow official documentation here. I would recommend you create a ‘airflow’ folder under your file system, and then download docker-compose.yaml to that folder as below.

$ cd <path of airflow>$ curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.2.3/docker-compose.yaml'

Modify the base image to the custom airflow image you build from above in docker-compose.yaml.

# Comment the default official image, use your custom one
#image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.2.3}
image: <your docker repo>/airflow-custom:2.2.3

Optionally, you can modify the default local port for airflow webserver in ‘airflow-webserver’ section. The default port is 8080. You can also modify other configurations in docker-compose.yaml depends on your use cases.

airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- <Your new port>:8080

Now, you can start the airflow with docker with docker compose.

$ docker compose -f <path of airflow folder>/docker-compose.yaml up -d

You may need to wait for a few minutes until all services are up and running. Use below command to check the status.

$ docker compose -f <path of airflow folder>/docker-compose.yaml ps

You can run below command to completely destroy the services.

$ docker compose -f <path of airflow folder> down --volumes --remove-orphans

You can access airflow webserver from http://localhost:<your port> with default username/password ‘airflow/airflow’.

Test airflow installation

The airflow with docker use folder ‘dags’ and ‘logs’ to synchronize the dags and logs between host and airflow worker in containers. So you can just put your dags under dags folder which is under ‘airflow’ folder in my case.

I created ‘mytest.py’ dag under ‘dags’ folder which includes two BatchOperators. Each BatchOperator calls the spark-submit command to run the spark-pi with jar and python-pi respectively. I placed related files under ‘mytest’ folder which is the subfolder of ‘dags’ folder. I also copied the spark-example jar and pi.py to this folder which were copied from $SPARK_HOME examples folder.

Test airflow installation

my_test.py

my_test.py

For the bash_command path, you should assume it’s the path from airflow worker in container. ‘dags’ folder is located under ‘/opt/airflow’ by default.

run_jar.sh

run_jar.sh

run_py.sh

run_py.sh

By default, airflow lists all example dags, with all of them paused by default. When you refresh your airflow webserver, you should be able to see ‘my_test’ dag and you need to unpause it.

my_test dag in airflow webserver

You can click the dag can trigger it manually.

Trigger the DAG manually

The dag should run the spark jobs using kubernetes pods which is exactly the same result as shown in ‘Apache Spark’ section above.

Limits

Sometimes you can still expect a bit slower run the same spark-submit job from airflow worker in container compares to running it from host. It also could have some resource restrictions within container. In my airflow configuration, I didn’t configure for yarn yet. If you need to run spark job using yarn, you may have to run it from host rather than from airflow.

--

--

Jun Li

An engineer enthusiastic about software/web/mobile development, cloud native, data and AI.