Setup Mini Data Lake and Platform on M1 Mac — Part 5

Jun Li
7 min readJan 4, 2022
Photo by Stephen Dawson on Unsplash

In this article, I will show you the services and tools that I installed on my laptop for data query, data analytics, data science and machine learning.

Druid

First, download apache druid from here and unzip it to your local file system. Druid supports install on cluster and single-server. We will use the single-server one. To use it as a local service, I chose the micro-quickstart. For each size of server, you can see configurations for ‘broker’, ‘coordinator-overlord’, ‘historical’, ‘middleManager’, ‘router’ and ‘_common’ under folder <durid_home>/conf/druid/single-server/<server size>. The server size here is one of ‘xlarge’, ‘large’, ‘medium’, ‘small’, ‘micro-quickstart’ and ‘nano-quickstart’. They have different configurations of resource allocations in jvm.config.

By default, Amazon S3 is disabled by druid for single-server, to connect to MinIO, you need to enable the extension, also configure the S3 for MinIO under ‘<druid_home>/conf/druid/single-server/micro-quickstart/_common/common.runtime.properties’. Below is an example for my case.

common.runtime.properties

common.runtime.properties

You also need to create a bucket for your druid under the local MinIO. For other services, they have runtime.properties where you can configure different port number, service name, service discovery etc. If you need larger service, configure similar thing under bigger size folder, such as small, medium or large.

You can export your MinIO username/password in environment variables first, then start the druid service locally. The environment variables will be used when loading data from ‘Amazon S3’ data source. If you don’t use environment variable, you need to specify them as default key type which is the clear text.

$ export DRUID_S3_ACCESS_KEY_ID=<your minio user>
$ export DRUID_S3_ACCESS_SECRET_KEY=<your minio user password>
$ nohup <druid home>/bin/start-micro-quickstart >> <path of log> &

Example of loading data from MinIO

This example will show you how to load a data file from my datasources bucket in local MinIO to druid. The data file is the ‘wikiticker-2015–09–12-sampled.json.gz’ under ‘<druid home>/quickstart/tutorial’ folder which was then uploaded to my datasources bucket. Go to your local druid from browser http://localhost:18888 (I use port 18888 for router which is not conflict with my default jupyter notebook port). Then you select ‘Amazon S3’ from ‘Load Data’ menu, and click ‘connect data’. You then specify the data source s3 url with s3 access keys in environment variables.

Load data from Amazon S3 in Druid

Then you click to next steps which are ‘Parse data’, ‘Parse time’, ‘Transform’, ‘Filter’ and ‘Configure schema’, keep them as default and select ‘all’ for ‘ Primary partitioning by time’ which is the segment granularity. After that you can publish and submit your data source. The druid will run job to process the data.

Process data from s3

Once it’s completed, you can query data like below.

Query data

Presto

As I mentioned in Part 1, Presto has very slow performance under docker in M1 MacBook OS, it’s better to install it directly on local system. In presto official site, it provides this article to quickly install PrestoDB on Mac OS via ‘brew’, however, it’s not doable for arm based Mac OS yet for current latest version 0.268 at the time I was writing this article. It will give you an error message as ‘Presto requires x86_64 on Mac OS X’ when you launch the presto service locally. I also raised an issue on github presto repo and hopefully they can fix it over next released version. If they haven’t, please look at my workaround below.

Workaround

We can build our own version from the source to avoid this error. First, git clone the source code from github presto repo. If you don’t need to commit and push, you can just use master branch, otherwise, it’s better fork to your github account and create a new branch. Once you open the source code in your favorite IDE like JetBrain IntelliJ, navigate to class ‘PrestoSystemRequirements.java’ under package ‘com.facebook.presto.server’, in method ‘verifyOsArchitecture’, add ‘aarch64’ together with ‘x86_64’ as below.

Update the condition to verify OS architecture for Mac OS

Then in iTerm2 or open ‘Terminal’ under IntelliJ, under the folder of root path of the presto project, run command

$ ./mvnw clean install -DskipTests

Make sure Java and Python3 are installed as mentioned in Part 2 before running this command. It will take some time to build and generate the new version of package. If the current version is 0.268, your new version will be 0.269-SNAPSHOT. The new package file should be presto-server-0.269-SNAPSHOT.tar.gz under folder ‘<presto project root>/presto-server/target’, it should be around 1.2 GB in size.

Configure Presto

First, you need to create three configure files which are ‘config.properties’, ‘jvm.config’ and ‘node.properties’ under <presto home>/etc folder.

config.properties

config.properties

jvm.config

jvm.config

node.properties

node.properties

Second, create a new folder ‘catalog’ under <presto home>/etc which hosts all connector configurations. The one we need is the connector for the hive metastore in Part 4. You can also create connectors for jmx, memory and tpch. Place all of them into ‘catalog’ folder.

hive.properties

hive.properties

For jmx, memory and tpch, you just simply put something as below respectively for its properties file.

# memory for memory.properties
# tpch for tpch.properties
connector.name=jmx

Now you can launch your presto server at background by running

$ <presto home>/bin/launcher start# to stop just run
$ <presto home/bin/launcher stop

You can access web ui via http://localhost:8089 in my case. It depends on the port number you set in the configuration.

Connect to Presto

You can connect to presto server via presto-cli or third-party tool. I still use JetBrains DataGrid to connect it.

## Use presto-cli

If presto is still not supported and compatible with arm based Mac OS, the presto-cli will not work unfortunately, it will raise an error as described in open issue.

For more about usage of presto cli, see documentation here.

## Use JetBrains DataGrid

Create a new connection to data source ‘Presto’ under ‘Other’ menu from DataGrid.

Create data source from Presto

Download the driver from the popup window, you only need to do it once. Give the connection a name like ‘prestolocal’. In ‘General’ tab, fill ‘Host’ as ‘localhost’, port as ‘8089’ in my case. Select ‘No auth’ as Authentication. In ‘SSH/SSL’ tab, make sure ‘Use SSH tunnel’ and ‘Use SSL’ are unchecked. In ‘Schemas’ tab, only check hive if you only need to see hive schemas. In ‘Advanced’ tab, create new name/value as ‘user’ and ‘root’, and keep others as default. Then click ‘Test Connection’, it should be successful if everything is configured properly.

Configure connection — 1
Configure connection — 2

Now, run a test query. Assume we already have ‘test_schema’ under hive, and ‘test_table’ under ‘test_schema’. The syntax for a table under Presto is “<connector_name>.<schema_name>.<table_name>”. In my case, it should be ‘hive.test_schema.test_table’.

Test query from presto

Jupyter, Python Libraries for Data Science and Machine Learning

Assume you already installed minforege mentioned in Part 2.

  1. Install tensorflow for M1 Mac. Create a new conda environment with python 3.8.12. PyTorch for M1 GPU is in the works but yet completed at the time I was writing this article. But if you still want to install PyTorch which runs on CPU with osx-64 version, it should be fine. For deep learning under M1 Mac, tensorflow is still recommended.
$ conda create --name tf_m1 python=3.8.12
$ conda activate tf_m1
$ conda install -c apple tensorflow-deps
$ pip install tensorflow-macos
$ pip install tensorflow-metal
# Optional
$ conda install -c pytorch pytorch

2. Install pandas and scikit-learn

$ conda install -c pandas pandas
$ conda install scikit-learn

3. Install jupyter and jupyterlab

$ conda install -c conda-forge jupyter jupyterlab

For other python libraries, you can just use either conda install or pip install if package is not found by conda. However, some libraries may still have the compatibility issues.

Now, we can verify if tensorflow be able to see GPU. Run ipython under iTerm2. You should be able to see ‘GPU’ as one of the devices if everything is installed properly.

$ ipythonIn [1]: import tensorflow as tf
In [2]: tf.__version__
Out[2]: '2.7.0'
In [3]: tf.config.list_physical_devices()
Out[3]:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

You can also find some useful information from this article.

--

--

Jun Li

An engineer enthusiastic about software/web/mobile development, cloud native, data and AI.