Setup a Mini Data Lake and Platform on M1 Mac — Part 3

Jun Li
SelectFrom
Published in
3 min readJan 4, 2022

--

Photo by Alexandre Debiève on Unsplash

In this part, I will show you how to set up data storage in the mini data lake.

MinIO

It’s quite straightforward to install MinIO via brew.

$ brew install minio/stable/minio

After it’s installed, you can simply run ‘minio server <path of data>’ to start the MinIO server with all the default configurations. You can also specify your root user and root password as the s3 access key id and s3 secret access key.

$ export MINIO_ROOT_USER="<your root user>"$ export MINIO_ROOT_PASSWORD="<your root password>"$ nohup minio server <path of data> --address ":9000" --console-address ":9001" >> <path of log> &$ echo $! > <path of pid>

This sets MinIO port 9000 as the API to access data, and port 9001 will be the console. I use nohup to launch MinIO as I want to run it in the background. I also export the PID to a file in case I want to stop the service later by running:

$ kill -9 $(cat <path of pid>)

If you prefer launching the service at the login time, refer to this article.

Once the service is started, you can access the console from the browser through http://localhost:9001. After you fill the root username and password, you can see the console:

Local MinIO Dashboard

The dashboard gives you an overview of your service, like number of buckets, objects, etc. You can have a similar experience using AWS S3, but versioning, object locking and quota features are disabled in the single-disk setup.

Hadoop

First, you can download the hadoop arm-based binary tar.gz file. Then, unzip it in your local file system.

You can follow this tutorial for the installation. I did some changes based on the article in this series.

Once you get it ready, configure the environment variables as seen below within ~/.zshrc.

## Set HADOOP environment variables
export HADOOP_HOME=<path of hadoop home folder>
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export YARN_HOME=$HADOOP_HOME
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME
export HADOOP_LIBEXEC_DIR=$HADOOP_HOME/libexec
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
# export hadoop bin to PATH for easy access
export PATH=$PATH:$HADOOP_HOME/bin

Second, set up Hadoop related configuration files under $HADOOP_HOME/etc/hadoop.

core-site.xml

core-site.xml

hdfs-site.xml

hdfs-site.xml

hadoop-env.sh

You can keep most of the default content of this file. Just add two additional environment variables in this file.

export HADOOP_HOME=<path of hadoop home folder>
export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home

mapred-site.xml

mapred-site.xml

yarn-site.xml

yarn-site.xml

Before you start the Hadoop services, you need to set up ssh to localhost as it is required by the Hadoop namenode and the second namenode. To achieve this, you need to create a keypair first and add the public key to ~/.ssh/authorized_keys.

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Then, you need to enable remote login to yourself under System Preferences -> Sharing.

Enable remote login to your laptop

Lastly, you need to format your namenode.

$ hdfs namenode -format

Once it’s done, you can start your Hadoop server by running:

$ $HADOOP_HOME/sbin/start-all.sh

Use ‘jps’ command to check if all services started.

$ jps# You should see output like below
73772 DataNode
73644 NameNode
74126 ResourceManager
74228 NodeManager
79479 Jps
73910 SecondaryNameNode

You can access the Hadoop cluster health check via http://localhost:9870 and yarn applications via http://localhost:8088.

To stop the Hadoop services, just run:

$ $HADOOP_HOME/sbin/stop-all.sh

References

The world’s fastest cloud data warehouse:

When designing analytics experiences which are consumed by customers in production, even the smallest delays in query response times become critical. Learn how to achieve sub-second performance over TBs of data with Firebolt.

--

--

An engineer enthusiastic about software/web/mobile development, cloud native, data and AI.