Deploy StarRocks with Docker
This tutorial covers:
- Running StarRocks in a single Docker container
- Loading two public datasets including basic transformation of the data
- Analyzing the data with SELECT and JOIN
- Basic data transformation (the T in ETL)
Follow along with the video if you preferβ
The data used is provided by NYC OpenData and the National Centers for Environmental Information.
Both of these datasets are very large, and because this tutorial is intended to help you get exposed to working with StarRocks we are not going to load data for the past 120 years. You can run the Docker image and load this data on a machine with 4Β GB RAM assigned to Docker. For larger fault-tolerant and scalable deployments we have other documentation and will provide that later.
There is a lot of information in this document, and it is presented with the step by step content at the beginning, and the technical details at the end. This is done to serve these purposes in this order:
- Allow the reader to load data in StarRocks and analyze that data.
- Explain the basics of data transformation during loading.
Prerequisitesβ
Dockerβ
- Docker
- 4Β GB RAM assigned to Docker
- 10Β GB free disk space assigned to Docker
SQL clientβ
You can use the SQL client provided in the Docker environment, or use one on your system. Many MySQL compatible clients will work, and this guide covers the configuration of DBeaver and MySQL Workbench.
curlβ
curl is used to issue the data load job to StarRocks, and to download the datasets. Check to see if you have it installed by running curl or curl.exe at your OS prompt. If curl is not installed, get curl here.
Terminologyβ
FEβ
Frontend nodes are responsible for metadata management, client connection management, query planning, and query scheduling. Each FE stores and maintains a complete copy of metadata in its memory, which guarantees indiscriminate services among the FEs.
BEβ
Backend nodes are responsible for both data storage and executing query plans.