Setup Mini Data Lake and Platform on M1 Mac — Part 1

Jun Li
SelectFrom
Published in
6 min readJan 4, 2022

--

Photo by Jeremy Bezanger on Unsplash

Data Lake Solution from Public Cloud

Take Amazon Web Services (AWS) as an example, below is an example data lake solution from its solution implementations. In the production and the real commercial world, we can set up a big data platform and data lake like this. But for personal usage, we don’t need that many services to meet our day-to-day development and learning purpose locally with a personal computer.

Source from: https://aws.amazon.com/solutions/implementations/data-lake-solution/?anld_mp1

I’ve been using my M1 Max Macbook Pro for a couple of months at the time I was writing this series of articles, it becomes my favorite laptop that I ever used. The performance is incredibly great compares to its predecessor and other laptops with a similar levels. I decided to use it as my primary laptop for day-to-day local development and learning purposes.

Purposes

To meet my day-to-day local development requirements on a single laptop, the system should have services and tools for data storage, data metastore, data processing and capability for BI, data analytics, data science, and machine learning, etc. Fortunately, there are open source tools for each of the categories. It’s not the perfect solution compared to the cloud but it will be good alternatives for individual developers like myself to do some learnings, prototypes, and non-commercial small data-related projects on personal computers during after-hours but remain lower costs.

Architecture

Due to the different processor architecture between arm-based M1 and x86, I also experienced some software compatibility issues though most of the daily used development apps are all good since M1 was first released. In this series of articles, I would like to share my experience in setting up these services and tools on my M1 MacBook Pro. The diagram below shows the architecture of the mini data lake and platform with related services and tools on my laptop.

Mini data lake and platform on my M1 MBP (Image by author)

Infrastructure

I wish I could set up everything within my local docker containers and Kubernetes cluster. But I then decided to use the hybrid way. Docker Desktop for M1 is a popular tool that you can easily install for docker and local Kubernetes cluster on your personal computer. But it still has limits and performance issues on apple chip. This is also the key reason I ended up with a hybrid installation strategy. You can check out here to find out more. I experienced very slow performance on airflow installed via Kubernetes and helm with docker desktop, but it’s ok with docker-compose; Another example is presto with docker installations queries very slow compares to installing on my laptop directly.

Data Storage

I use MinIO as the default data storage which is an s3 compatible object storage. It has a cloud-native architecture with high performance and scalability. It’s adaptive to the multi-cloud and hybrid cloud but also friendly to developers. For developers on development locally, MinIO is a free great alternative to Amazon S3. It works well with my local spark, hive, and other services. Check out the official site for more details.

As Hadoop is required by Hive for map-reduce jobs even though it’s optional in the dev environment, it’s still one of the services listed on my laptop. Also, I sometimes will use HDFS as the additional distributed storage and file system. You definitely can find out plenty of articles on Hadoop online.

The storage part will be the core of the mini data lake which can store structured, semi-structured, and unstructured data to meet your day-to-day development locally. You can store raw data and transform data in different buckets or the same bucket with different prefixes.

Data Metastore

I mainly use Hive as the data metastore but rarely query data on it due to the slower query performance. Most of the data will be stored in MinIO, Hive can create either standard or external tables on top of the data with particular columns and partitions.

MySql will be used to be the metastore of Hive’s metadata. You can also use PostgresSQL to achieve that. I use MySql when there are RDBMS demands as well.

Tools for Data

Once you get your data ready, you can use the data for further purposes such as data query, data analytics, data science, and machine learning, etc.

Apache Presto is an open-source distributed SQL query engine that can run interactive analytic queries against all sizes of data sources. You can use it for querying data like Hive, relational databases with different connectors. You can also combine data from multiple sources. Check out official documentation from here.

If you have real-time data or streaming data or have a user-facing application on top of data, then a druid database can be considered. It can be integrated with existing data pipelines, load data from Kafka, HDFS, S3, etc. Druid's official site provides great documentation if you want to explore more.

For data science and machine learning or deep learning, Jupyter notebooks are one of the tools that I often use. You need to install python and frequently used python data science and machine learning libraries such as pandas, TensorFlow etc.

Data Processing

Apache spark is obviously a popular and powerful tool for data processing. This website gives you good explanations with examples on spark. Instead of using spark on top of HDFS and YARN, I prefer using spark with Kubernetes and s3 compatible storage like MinIO. One reason is it’s closer to cloud-native way and the second is the storage is shared by other services that I installed on my laptop.

I use airflow to run my data pipeline and schedule data processing jobs. Airflow is a platform where you can schedule and monitor your workflows programmatically. All workflows can be defined as code and easy to maintain. Check out the official documentation for more information.

Limits

Due to the compatibility issues and limits of arm-based Mac OS, not all services and tools can work exactly the same as they can do on x86 based machines. But it can still meet most of the requirements of day-to-day local development. I think as the time being, more and more software and tools will support arm-based Mac OS natively, then it will become even more powerful for professionals.

If you still prefer using Windows or Linux-based machines for local development for similar purposes, this solution can also help but replace the services and tools with the x86 versions. You can also try to use all container-based ways to install them on Windows WSL2 or Linux.

The world’s fastest cloud data warehouse:

When designing analytics experiences which are consumed by customers in production, even the smallest delays in query response times become critical. Learn how to achieve sub-second performance over TBs of data with Firebolt.

--

--

An engineer enthusiastic about software/web/mobile development, cloud native, data and AI.