ClickHouse Usage Tutorial Series

Article 1: Quick Start with ClickHouse

Article 1: Quick Start with ClickHouse

Overview

This article will guide you through a quick start with ClickHouse. You’ll learn how to download an appropriate binary for your OS, run the ClickHouse server, create a table, insert data into it, and query your table using the ClickHouse client.

Prerequisites

You’ll need curl or another command - line HTTP client to fetch the ClickHouse binary.

Download the binary

ClickHouse runs natively on Linux, FreeBSD, and macOS, and runs on Windows via the WSL. The simplest way to download ClickHouse locally is to run the following curl command:

1curl https://clickhouse.com/ | sh

You should see:

Successfully downloaded the ClickHouse binary, you can run it as:    ./clickhouse 
You can also install it: 
 sudo ./clickhouse install

At this stage, you can ignore the prompt to run the install command.

Start the server

Run the following command to start the ClickHouse server:

1./clickhouse server

You should see the terminal fill up with logging. This is expected as the default logging level in ClickHouse is set to trace rather than warning.

Start the client

Use clickhouse - client to connect to your ClickHouse service. Open a new terminal, change directories to where your clickhouse binary is saved, and run the following command:

1./clickhouse client

You should see a smiling face as it connects to your service running on localhost.

Create a table

Use CREATE TABLE to define a new table. In ClickHouse, tables require an ENGINE clause. Use MergeTree to take advantage of the performance benefits of ClickHouse:

1CREATE TABLE test_table ( 
2    id UInt32, 
3    name String 
4) ENGINE = MergeTree() 
5ORDER BY id;

Insert data

You can use the familiar INSERT INTO TABLE command. For example:

1INSERT INTO test_table (id, name) VALUES (1, 'John'), (2, 'Jane');

Query the table

To query the table, you can use a simple SELECT statement:

1SELECT * FROM test_table;

Article 2: Data Ingestion in ClickHouse

Introduction

ClickHouse integrates with a number of solutions for data integration and transformation. This article will introduce some common data ingestion tools.

Data Ingestion Tools

Data Ingestion ToolDescriptionAirbyteAn open — source data integration platform. It allows the creation of ELT data pipelines and is shipped with more than 140 out — of — the — box connectors.Apache SparkA multi — language engine for executing data engineering, data science, and machine learning on single — node machines or clusters.Amazon GlueA fully managed, serverless data integration service provided by Amazon Web Services (AWS) simplifying the process of discovering, preparing, and transforming data for analytics, machine learning, and application development.Azure SynapseA fully managed, cloud — based analytics service provided by Microsoft Azure, combining big data and data warehousing to simplify data integration, transformation, and analytics at scale using SQL, Apache Spark, and data pipelines.Apache BeamAn open — source, unified programming model that enables developers to define and execute both batch and stream (continuous) data processing pipelines.dbtEnables analytics engineers to transform data in their warehouses by simply writing select statements.dltAn open — source library that you can add to your Python scripts to load data from various and often messy data sources into well — structured, live datasets.FivetranAn automated data movement platform moving data out of, into and across your cloud data platforms.NiFiAn open — source workflow management software designed to automate data flow between software systems.VectorA high — performance observability data pipeline that puts organizations in control of their observability data.

Article 3: Query Optimization in ClickHouse

Understanding Query Performance

When considering performance optimization, the best time is when setting the data schema before importing data into ClickHouse for the first time. However, it’s difficult to predict how much your data will grow or what types of queries will be executed.

General Considerations

When ClickHouse executes a query, the following steps occur:

  1. Query Parsing and Analysis: The query is parsed and analyzed to generate a general query execution plan.
  2. Query Optimization: The query execution plan is optimized, unnecessary data is trimmed, and a query pipeline is built from the query plan.
  3. Query Pipeline Execution: Data is read and processed in parallel. This is the stage where ClickHouse actually performs query operations such as filtering, aggregating, and sorting.
  4. Final Processing: The results are merged, sorted, and formatted into the final result, then sent to the client.

Finding Slow Queries

By default, ClickHouse collects and records information about each executed query in the system.query_log table. You can find slow - running queries and display resource usage information for each query. For example, to find the five longest - running queries in the NYC taxi dataset:

1SELECT query, query_duration_ms, read_rows, memory_usage 
2FROM system.query_log 
3ORDER BY query_duration_ms DESC 
4LIMIT 5;

Using EXPLAIN Statements

ClickHouse supports the EXPLAIN statement to understand how queries are executed. For example, EXPLAIN indexes = 1 shows the query plan, and EXPLAIN PIPELINE shows the specific execution strategy.

1EXPLAIN indexes = 1 SELECT * FROM nyc_taxi.trips_small_inferred WHERE speed > 30;

Article 4: Monitoring ClickHouse

Monitoring in ClickHouse Cloud

The monitoring data in ClickHouse Cloud can be accessed through the built — in dashboard ($HOST:$PORT/dashboard, requires user and password) and directly in the main service console.

Metrics in the Dashboard

The built — in advanced observability dashboard shows the following metrics:

  • Queries/second
  • CPU usage (cores)
  • Queries running
  • Merges running
  • Selected bytes/second
  • IO wait
  • CPU wait
  • OS CPU Usage (userspace)
  • OS CPU Usage (kernel)
  • Read from disk
  • Read from filesystem
  • Memory (tracked)
  • Inserted rows/second
  • Total MergeTree parts
  • Max parts for partition

Server Metrics

ClickHouse server has embedded instruments for self — state monitoring. You can track server events using server logs. Metrics can be found in the system.metrics, system.events, and system.asynchronous_metrics tables. You can configure ClickHouse to export metrics to Graphite or Prometheus. To monitor server availability, you can send an HTTP GET request to /ping. For cluster configurations, use /replicas_status.

Article 5: Backup and Restore in ClickHouse

Background

Replication in ClickHouse protects against hardware failures but not human errors. To mitigate possible human errors, you should prepare a backup and restore strategy in advance.

Backup to a Local Disk

To configure a backup destination, add a file to /etc/clickhouse - server/config.d/backup_disk.xml specifying the disk and path. For example, the backup destination can be specified like Disk('backups', '1.zip').

Backup and Restore Commands

The general syntax for backup and restore commands is:

1BACKUP|RESTORE  TABLE [db.]table_name [AS [db.]table_name_in_backup] 
2    [PARTITION[S] partition_expr [,...]] | 
3  DICTIONARY [db.]dictionary_name [AS [db.]name_in_backup] | 
4  DATABASE database_name [AS database_name_in_backup] 
5    [EXCEPT TABLES ...] | 
6  TEMPORARY TABLE table_name [AS table_name_in_backup] | 
7  VIEW view_name [AS view_name_in_backup] | 
8  ALL TEMPORARY TABLES [EXCEPT ...] | 
9  ALL [EXCEPT ...] } [,...] 
10  [ON CLUSTER 'cluster_name'] 
11  TO|FROM File('<path>/<filename>') | Disk('<disk_name>', '<path>/') | S3('<S3 endpoint>/<path>', '<Access key ID>', '<Secret access key>') 
12  [SETTINGS base_backup = File('<path>/<filename>') | Disk(...) | S3('<S3 endpoint>/<path>', '<Access key ID>', '<Secret access key>')]

It’s important to note that if you back something up and never try to restore it, the restore process may not work properly when you actually need it. So, automate the restore process and practice it regularly on a spare ClickHouse cluster.