Setup Mini Data Lake and Platform on M1 Mac — Part 4

Jun Li
5 min readJan 4, 2022
Photo by Li Zhang on Unsplash

In this article, I will walk you through the installation of data metastore part. As mentioned in Part 1, Hive is used as the metastore on top of the data in MinIO. To install Hive, we need to get MySql installed which is used as the metadata of Hive. Also, it can be used as the RDBMS in your data system. It assumes you already get Hadoop installed mentioned in Part 3.

MySQL

First, download arm based mysql server, then you just follow the install binary dmg file step by step. It will get your mysql installed as a service. You can find below page under ‘MySQL’ menu in System Preference.

Second, download the hive tar.gz package from here which is version 3.1.2.

Third, add the mysql bin folder to your path in ~/.zshrc file then source it.

export PATH=$PATH:/usr/local/mysql-8.0.27-macos11-arm64/bin

Forth, access your mysql in iTerm2 terminal and use the user root and password which is set during mysql installation. Once you login to mysql, you need to create the ‘metastore’ database and import the default schema script from hive, then create new user and grant all privileges to this database.

$ mysql -u root -p mysql> CREATE DATABASE metastore;

mysql> use metastore;

mysql> SOURCE <path of apache hive 3.1.2 bin>/scripts/metastore/upgrade/mysql/hive-schema-3.1.0.mysql.sql

mysql> CREATE USER '<your hive user>'@'%' IDENTIFIED BY '<your hive password>';

mysql> GRANT all on metastore.* to <your hive user>;
mysql> FLUSH PRIVILEGES;

Five, download the mysql connector, unzip it and put the ‘mysql-connector-java-8.0.27.jar’ file under apache-hive-3.1.2-bin/lib folder.

Hive

We need to add connection to MinIO in hive-site.xml, also add thrift metastore connection. To support MinIO, you also need to download the hadoop-aws and place the jars under apache-hive-3.1.2-bin/lib folder. You can download them from here and click ‘Download hadoop-aws.jar (3.3.1)’ link. Create or edit hive-site.xml under apache-hive-3.1.2-bin/conf as below.

hive-site.xml

hive-site.xml

If you haven’t create the default hdfs folder for hive data warehouse as mentioned in hive-site.xml, run below

$ hdfs dfs -mkdir -p /user/hive/warehouse

You can also add hive home folder to environment variable, and bin folder to path in ~/.zshrc

export HIVE_HOME=<path of hive home folder>
export PATH=$PATH:$HIVE_HOME/bin

Next, you need to give proper permission to default hive warehouse folder in HDFS and create tmp folder for it.

$ hdfs dfs -chmod g+w /user/hive/warehouse$ hdfs dfs -mkdir -p /tmp/hive$ hdfs dfs -chmod g+w /tmp$ hdfs dfs -chmod 777 /tmp/hive

Connect to Hive

You can connect to hive either from command line or third-party data tool like JetBrain DataGrid which is the one I’m current using.

Before you connect to hive, you need to start your hive server and metastore service. You can use nohup if you want to run them in background and don’t want to launch them at login. It will take some time for the services to be fully started.

$ nohup $HIVE_HOME/bin/hive --service hiveserver2 >> <path of log> &
echo $! > <path of hiveserver pid>
$ nohup $HIVE_HOME/bin/hive --service metastore >> <path of log> &
echo $! > <path of metastore pid>

Connect via CLI

In iTerm2, type ‘hive’ (if you already put hive bin to your path), then it will start the hive service and show you the hive interactive shell. You can test the connection by showing all databases.

$ hivehive> show databases;
OK
default

Connect via JetBrain DataGrid

In DataGrid, select ‘+’ on the left panel, and select Apache Hive from the data source menu.

Create new data source from hive

You can give a name to your connection. If you are connecting for the first time, you can download the driver directly from the window. The ‘Host’ will be ‘localhost’ and port is 10000 by default. The default user and password is ‘scott’ and ‘tiger’. Schema is optional, it will list all schemas once connected. In ‘SSH/SSL’ tab, make sure ‘Use SSH tunnel’ and ‘Use SSL’ are unchecked since you are connecting to your local service. Keep others options in each tab as default. Then you can click ‘Test Connection’. If it’s successful, you should be able to see the success info as below.

You can create new schema directly from the connection. In my case, it’s called ‘hivelocal’. Right click on ‘hivelocal’, and select ‘Schema’ from ‘New’ menu, you can see the window like below. Fill the ‘Name’ and ‘Location’, other fields can be empty, and then click ‘OK’, your new schema will be created. In my case, I created a new schema called ‘myschema’ under ‘hivedata’ bucket with ‘myschema’ prefix in MinIO via ‘s3a’ protocol. If your bucket is not created yet, you need to create it first via your local MinIO.

Create new hive schema location to minio

For more about usage of Hive, check out its wiki page.

References

--

--

Jun Li

An engineer enthusiastic about software/web/mobile development, cloud native, data and AI.