三台机器部署 ClickHouse 高可用集群实战记录
本文是一份可发布版部署记录。真实 IP、域名、账号、密码、下载链接、业务目录名、机器唯一标识等敏感信息已经替换为占位符。命令中的 <...> 需要按自己的环境替换。目标与拓扑
这次目标是用三台数据节点部署一套 ClickHouse 高可用集群,拓扑采用:
1 shard x 3 replicas
含义是:集群只有一个逻辑分片,三台机器都保存同一份数据的完整副本。任意一台数据节点宕机时,只要 ClickHouse Keeper 仍然有多数派,剩余节点仍可继续提供读写服务。
规划节点如下:
| 主机名 | 示例地址 | 角色 |
|---|---|---|
| ch-01 | <ch-01-ip> | ClickHouse Server + ClickHouse Keeper |
| ch-02 | <ch-02-ip> | ClickHouse Server + ClickHouse Keeper |
| ch-03 | <ch-03-ip> | ClickHouse Server + ClickHouse Keeper |
| ops-01 | <ops-01-ip> | 监控、统一查询入口、后续业务服务 |
机器规格:
- ClickHouse 数据节点:16 核、64GB 内存。
- 每台数据节点有一块约 3TB 的数据盘,挂载到
/www。 - ClickHouse 程序、配置、数据、日志、临时目录和 Keeper 数据都尽量放到
/www/clickhouse。 - ops 节点用于放 Prometheus、Grafana、Alertmanager、HAProxy 和后续业务服务。
容量口径要提前说清楚:
1 shard x 3 replicas的目标是高可用,不是扩大有效容量。- 三台机器各 3TB,裸盘总量看起来是 9TB。
- 但因为三台都保存同一份数据,所以有效 ClickHouse 物理数据容量接近单台 3TB。
- 商业运行不建议把数据盘打满,通常 70%-80% 就要开始规划扩容。
初始连接与 SSH 互信
先从本地机器检查三台服务器 SSH 是否可达:
for h in ch-01 ch-02 ch-03; do
ssh -o ConnectTimeout=5 "$h" 'hostname; uname -r; cat /etc/redhat-release'
done
如果还没有配置本地 SSH alias,可以在本地 ~/.ssh/config 加:
Host ch-01
HostName <ch-01-ip>
User <ssh-user>
IdentityFile <path-to-private-key>
Host ch-02
HostName <ch-02-ip>
User <ssh-user>
IdentityFile <path-to-private-key>
Host ch-03
HostName <ch-03-ip>
User <ssh-user>
IdentityFile <path-to-private-key>
Host ops-01
HostName <ops-01-ip>
User <ssh-user>
IdentityFile <path-to-private-key>
为了后续批量部署,让三台 ClickHouse 节点之间可以 SSH 免密互访。下面命令以有 root 权限的运维用户为例,如果使用普通用户,需要把路径和权限改成对应用户并配合 sudo。
每台机器先生成密钥:
ssh ch-01 'test -f /root/.ssh/id_ed25519 || ssh-keygen -t ed25519 -N "" -f /root/.ssh/id_ed25519'
ssh ch-02 'test -f /root/.ssh/id_ed25519 || ssh-keygen -t ed25519 -N "" -f /root/.ssh/id_ed25519'
ssh ch-03 'test -f /root/.ssh/id_ed25519 || ssh-keygen -t ed25519 -N "" -f /root/.ssh/id_ed25519'
收集三台公钥,再写入每台机器的 authorized_keys:
tmpfile="$(mktemp)"
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'cat /root/.ssh/id_ed25519.pub'
done | sort -u > "$tmpfile"
for h in ch-01 ch-02 ch-03; do
scp "$tmpfile" "$h:/tmp/cluster-authorized-keys"
ssh "$h" '
mkdir -p /root/.ssh
touch /root/.ssh/authorized_keys
cat /tmp/cluster-authorized-keys >> /root/.ssh/authorized_keys
sort -u /root/.ssh/authorized_keys -o /root/.ssh/authorized_keys
chmod 700 /root/.ssh
chmod 600 /root/.ssh/authorized_keys
'
done
rm -f "$tmpfile"
验证 3x3 互访:
for src in ch-01 ch-02 ch-03; do
for dst in ch-01 ch-02 ch-03; do
ssh "$src" "ssh -o BatchMode=yes -o ConnectTimeout=5 $dst hostname"
done
done
主机名与 hosts
三台机器如果都是模板克隆出来的,初始 hostname 可能相同。先改成清晰的名字:
ssh ch-01 'hostnamectl set-hostname ch-01'
ssh ch-02 'hostnamectl set-hostname ch-02'
ssh ch-03 'hostnamectl set-hostname ch-03'
在三台数据节点和 ops 节点都写入 /etc/hosts:
for h in ch-01 ch-02 ch-03 ops-01; do
ssh "$h" 'cat >>/etc/hosts <<EOF
<ch-01-ip> ch-01
<ch-02-ip> ch-02
<ch-03-ip> ch-03
<ops-01-ip> ops-01
EOF'
done
验证解析:
for h in ch-01 ch-02 ch-03 ops-01; do
ssh "$h" 'getent hosts ch-01 ch-02 ch-03 ops-01'
done
磁盘检查
确认 /www 是否挂载到数据盘:
for h in ch-01 ch-02 ch-03; do
echo "===== $h ====="
ssh "$h" 'df -h /www; findmnt /www; lsblk -f'
done
检查系统日志中是否有明显磁盘错误:
for h in ch-01 ch-02 ch-03; do
echo "===== $h ====="
ssh "$h" "dmesg | egrep -i 'i/o error|medium error|buffer i/o|blk_update_request|reset|ata' | tail -50 || true"
done
做一次轻量 direct IO 测试,确认数据盘没有离谱瓶颈:
for h in ch-01 ch-02 ch-03; do
echo "===== $h write ====="
ssh "$h" 'dd if=/dev/zero of=/www/io-test.bin bs=1G count=2 oflag=direct status=progress'
echo "===== $h read ====="
ssh "$h" 'dd if=/www/io-test.bin of=/dev/null bs=1G iflag=direct status=progress'
ssh "$h" 'rm -f /www/io-test.bin'
done
如果机器是虚拟机,客机内通常看不到真实物理盘 SMART。虚拟盘健康还需要在云平台、虚拟化平台或底层存储侧确认。
网络与下载能力
安装前先测试外部软件源是否可访问:
for h in ch-01 ch-02 ch-03; do
echo "===== $h ====="
ssh "$h" '
curl -I --connect-timeout 10 https://packages.clickhouse.com/tgz/lts/ | head
curl -I --connect-timeout 10 https://github.com/ClickHouse/ClickHouse | head
'
done
如果服务器外网速度不稳定,可以先在本地下载 ClickHouse 安装包,再通过 scp 或 rsync 分发到三台机器:
VERSION=26.3.13.31
mkdir -p "packages/clickhouse-$VERSION"
cd "packages/clickhouse-$VERSION"
for pkg in common-static server client; do
curl -fLO "https://packages.clickhouse.com/tgz/lts/clickhouse-${pkg}-${VERSION}-amd64.tgz"
curl -fLO "https://packages.clickhouse.com/tgz/lts/clickhouse-${pkg}-${VERSION}-amd64.tgz.sha512"
done
sha512sum -c *.sha512
for h in ch-01 ch-02 ch-03; do
ssh "$h" "mkdir -p /www/clickhouse/packages/$VERSION"
rsync -av ./ "$h:/www/clickhouse/packages/$VERSION/"
done
部署前风险项
部署前重点检查几项:
for h in ch-01 ch-02 ch-03; do
echo "===== $h ====="
ssh "$h" '
echo -n "THP enabled: "; cat /sys/kernel/mm/transparent_hugepage/enabled
echo -n "THP defrag: "; cat /sys/kernel/mm/transparent_hugepage/defrag
ulimit -n
systemctl is-active firewalld || true
'
done
这里重点处理:
- 关闭 THP,减少数据库服务延迟抖动。
- 提高
nofile和nproc。 - 只对集群内网开放 ClickHouse、Keeper、监控端口。
- 修复克隆机器可能重复的 machine-id。
修复 machine-id
模板克隆机器常见问题是 /etc/machine-id 完全相同。重复的 machine-id 会影响日志、监控、服务注册和部分系统组件识别。
检查:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'echo "$(hostname) $(cat /etc/machine-id)"'
done
修复命令:
for h in ch-01 ch-02 ch-03; do
ssh "$h" '
set -e
ts=$(date +%Y%m%d%H%M%S)
cp -f /etc/machine-id /etc/machine-id.bak.$ts 2>/dev/null || true
cp -f /var/lib/dbus/machine-id /var/lib/dbus/machine-id.bak.$ts 2>/dev/null || true
rm -f /etc/machine-id /var/lib/dbus/machine-id
systemd-machine-id-setup
cp -f /etc/machine-id /var/lib/dbus/machine-id
cat /etc/machine-id
'
done
修复后重新检查,确保三台输出不同。发布文章时不建议贴真实 machine-id。
ClickHouse 版本选择
生产环境第一套集群不建议追最新版本。我的选择是 LTS:
ClickHouse 26.3.13.31 LTS
选择理由:
- LTS 更适合生产起步。
- 不追最新 stable,减少新版本变化带来的风险。
- 三台机器 CPU 支持 ClickHouse 当前 x86_64 二进制需要的指令集。
- 使用官方
tgz/lts包,便于把程序主体放在/www。
检查 CPU 指令:
for h in ch-01 ch-02 ch-03; do
ssh "$h" "lscpu | egrep 'Model name|Flags' | head -2"
done
关闭 THP
创建 systemd 服务,确保现在和开机后都关闭 THP:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'cat >/etc/systemd/system/disable-thp.service <<EOF
[Unit]
Description=Disable Transparent Huge Pages
After=multi-user.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
ExecStart=/bin/sh -c "echo never > /sys/kernel/mm/transparent_hugepage/defrag"
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now disable-thp.service
cat /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag'
done
看到 [never] 即可。
ClickHouse 基础安装
目录规划:
/www/clickhouse/current -> /www/clickhouse/releases/<version>
/www/clickhouse/releases/<version>
/www/clickhouse/etc/clickhouse-server
/www/clickhouse/etc/clickhouse-keeper
/www/clickhouse/data
/www/clickhouse/tmp
/www/clickhouse/logs/clickhouse-server
/www/clickhouse/logs/clickhouse-keeper
/www/clickhouse/keeper
创建系统用户:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'id clickhouse >/dev/null 2>&1 || useradd --system --home-dir /www/clickhouse --shell /sbin/nologin clickhouse'
done
解压安装包:
VERSION=26.3.13.31
for h in ch-01 ch-02 ch-03; do
ssh "$h" "
set -e
BASE=/www/clickhouse
PKG_DIR=\$BASE/packages/$VERSION
RELEASE=\$BASE/releases/$VERSION
EXTRACT=\$RELEASE/.extract
mkdir -p \$RELEASE/bin \$EXTRACT
cd \$PKG_DIR
sha512sum -c *.sha512
rm -rf \$EXTRACT
mkdir -p \$EXTRACT
tar -xzf clickhouse-common-static-$VERSION-amd64.tgz -C \$EXTRACT
tar -xzf clickhouse-server-$VERSION-amd64.tgz -C \$EXTRACT
tar -xzf clickhouse-client-$VERSION-amd64.tgz -C \$EXTRACT
install -m 0755 \$EXTRACT/clickhouse-common-static-$VERSION/usr/bin/clickhouse \$RELEASE/bin/clickhouse
for tool in clickhouse-server clickhouse-client clickhouse-local clickhouse-keeper clickhouse-keeper-client clickhouse-benchmark clickhouse-format; do
ln -sfn clickhouse \$RELEASE/bin/\$tool
done
ln -sfn \$RELEASE \$BASE/current
\$BASE/current/bin/clickhouse local --version
"
done
创建基础目录:
for h in ch-01 ch-02 ch-03; do
ssh "$h" '
BASE=/www/clickhouse
mkdir -p \
$BASE/etc/clickhouse-server \
$BASE/etc/clickhouse-keeper \
$BASE/data \
$BASE/tmp \
$BASE/user_files \
$BASE/format_schemas \
$BASE/access \
$BASE/logs/clickhouse-server \
$BASE/logs/clickhouse-keeper \
$BASE/keeper/coordination/log \
$BASE/keeper/coordination/snapshots
chown -R clickhouse:clickhouse $BASE
'
done
提高 limits:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'cat >/etc/security/limits.d/clickhouse.conf <<EOF
clickhouse soft nofile 1048576
clickhouse hard nofile 1048576
clickhouse soft nproc 65536
clickhouse hard nproc 65536
EOF'
done
ClickHouse Keeper 配置
Keeper 是 ClickHouse 的协调服务,负责复制元数据、分布式 DDL 队列、锁和副本协作。它不是业务主节点,也不直接搬运大数据。
三节点 Keeper 的关键点:
server_id三台必须不同。- Raft 端口用于 Keeper 节点间通信。
- Client 端口供 ClickHouse Server 连接 Keeper。
- 三台 Keeper 允许挂 1 台;挂 2 台时多数派丢失。
示例 keeper_config.xml:
<clickhouse>
<logger>
<level>information</level>
<log>/www/clickhouse/logs/clickhouse-keeper/clickhouse-keeper.log</log>
<errorlog>/www/clickhouse/logs/clickhouse-keeper/clickhouse-keeper.err.log</errorlog>
<size>1000000000</size>
<count>10</count>
</logger>
<keeper_server>
<tcp_port>9181</tcp_port>
<server_id>1</server_id>
<log_storage_path>/www/clickhouse/keeper/coordination/log</log_storage_path>
<snapshot_storage_path>/www/clickhouse/keeper/coordination/snapshots</snapshot_storage_path>
<coordination_settings>
<operation_timeout_ms>10000</operation_timeout_ms>
<session_timeout_ms>30000</session_timeout_ms>
<raft_logs_level>information</raft_logs_level>
</coordination_settings>
<raft_configuration>
<server>
<id>1</id>
<hostname>ch-01</hostname>
<port>9234</port>
</server>
<server>
<id>2</id>
<hostname>ch-02</hostname>
<port>9234</port>
</server>
<server>
<id>3</id>
<hostname>ch-03</hostname>
<port>9234</port>
</server>
</raft_configuration>
</keeper_server>
</clickhouse>
在 ch-02、ch-03 上分别把 <server_id> 改成 2、3。
systemd 服务:
[Unit]
Description=ClickHouse Keeper
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=clickhouse
Group=clickhouse
ExecStart=/www/clickhouse/current/bin/clickhouse-keeper --config-file=/www/clickhouse/etc/clickhouse-keeper/keeper_config.xml
Restart=always
RestartSec=5
LimitNOFILE=1048576
LimitNPROC=65536
[Install]
WantedBy=multi-user.target
启动:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'systemctl daemon-reload && systemctl enable --now clickhouse-keeper.service'
done
内网端口开放
只对集群内网 CIDR 放通必要端口:
| 端口 | 用途 |
|---|---|
| 8123 | ClickHouse HTTP |
| 9000 | ClickHouse Native TCP |
| 9009 | ClickHouse interserver |
| 9181 | Keeper client |
| 9234 | Keeper Raft |
| 9363 | ClickHouse Prometheus metrics |
| 9100 | node_exporter |
firewalld 示例:
CLUSTER_CIDR="<cluster-cidr>"
for h in ch-01 ch-02 ch-03; do
for port in 8123 9000 9009 9181 9234 9363 9100; do
ssh "$h" "firewall-cmd --permanent --add-rich-rule='rule family=\"ipv4\" source address=\"$CLUSTER_CIDR\" port port=\"$port\" protocol=\"tcp\" accept'"
done
ssh "$h" 'firewall-cmd --reload'
done
不建议把 ClickHouse 和 Keeper 端口直接暴露到公网。
Keeper Quorum 验证
检查 Keeper 端口:
for h in ch-01 ch-02 ch-03; do
echo "===== $h ====="
ssh "$h" 'ss -lntp | egrep ":9181|:9234"'
done
检查服务:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'systemctl is-active clickhouse-keeper.service'
done
用 keeper-client 验证:
for h in ch-01 ch-02 ch-03; do
ssh "$h" '/www/clickhouse/current/bin/clickhouse keeper-client --host 127.0.0.1 --port 9181 --query "ls /"'
done
看 leader/follower:
for h in ch-01 ch-02 ch-03; do
echo "===== $h ====="
ssh "$h" "grep -E 'leader|follower|elected|term' /www/clickhouse/logs/clickhouse-keeper/clickhouse-keeper.log | tail -20"
done
ClickHouse Server 配置
Server 的核心配置片段如下:
<clickhouse>
<display_name>ch-01</display_name>
<listen_host>0.0.0.0</listen_host>
<http_port>8123</http_port>
<tcp_port>9000</tcp_port>
<interserver_http_port>9009</interserver_http_port>
<interserver_http_host>ch-01</interserver_http_host>
<timezone>Asia/Shanghai</timezone>
<path>/www/clickhouse/data/</path>
<tmp_path>/www/clickhouse/tmp/</tmp_path>
<user_files_path>/www/clickhouse/user_files/</user_files_path>
<format_schema_path>/www/clickhouse/format_schemas/</format_schema_path>
<access_control_path>/www/clickhouse/access/</access_control_path>
<user_directories>
<users_xml>
<path>/www/clickhouse/etc/clickhouse-server/users.xml</path>
</users_xml>
<local_directory>
<path>/www/clickhouse/access/</path>
</local_directory>
</user_directories>
<prometheus>
<endpoint>/metrics</endpoint>
<port>9363</port>
<metrics>true</metrics>
<events>true</events>
<asynchronous_metrics>true</asynchronous_metrics>
<errors>true</errors>
</prometheus>
<remote_servers>
<cluster_1s3r>
<shard>
<internal_replication>true</internal_replication>
<replica><host>ch-01</host><port>9000</port><user>cluster</user><password>cluster_password_placeholder</password></replica>
<replica><host>ch-02</host><port>9000</port><user>cluster</user><password>cluster_password_placeholder</password></replica>
<replica><host>ch-03</host><port>9000</port><user>cluster</user><password>cluster_password_placeholder</password></replica>
</shard>
</cluster_1s3r>
</remote_servers>
<zookeeper>
<node index="1"><host>ch-01</host><port>9181</port></node>
<node index="2"><host>ch-02</host><port>9181</port></node>
<node index="3"><host>ch-03</host><port>9181</port></node>
</zookeeper>
<macros>
<cluster>cluster_1s3r</cluster>
<shard>01</shard>
<replica>ch-01</replica>
</macros>
<distributed_ddl>
<path>/clickhouse/task_queue/ddl</path>
</distributed_ddl>
</clickhouse>
注意:
- 每台的
display_name、interserver_http_host和<macros><replica>要改成本机名。 user_directories很重要,漏掉后users.xml不会被加载,可能出现Settings profile default not found。- 密码不要明文写在公开文章里,生产环境也建议用独立密钥文件或部署系统注入。
Server systemd:
[Unit]
Description=ClickHouse Server
After=network-online.target clickhouse-keeper.service
Wants=network-online.target
[Service]
Type=simple
User=clickhouse
Group=clickhouse
ExecStart=/www/clickhouse/current/bin/clickhouse-server --config-file=/www/clickhouse/etc/clickhouse-server/config.xml
Restart=always
RestartSec=5
LimitNOFILE=1048576
LimitNPROC=65536
[Install]
WantedBy=multi-user.target
启动:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'systemctl daemon-reload && systemctl enable --now clickhouse-server.service'
done
用户、权限和日志
生产环境不建议保留内网无密码 default。这次处理思路:
default只允许 localhost 和集群节点访问,并设置密码。cluster用户用于节点间remote_servers。app_readonly用户用于业务查询,只读,限制来源为 ops 节点。monitor用户用于监控查询,只读,资源限制更低。
生成密码 hash 示例:
printf '%s' '<strong-password>' | sha256sum | awk '{print $1}'
users.xml 关键片段:
<clickhouse>
<profiles>
<default>
<max_memory_usage>17179869184</max_memory_usage>
<max_execution_time>600</max_execution_time>
<use_uncompressed_cache>0</use_uncompressed_cache>
</default>
<app_readonly>
<readonly>1</readonly>
<max_memory_usage>8589934592</max_memory_usage>
<max_execution_time>120</max_execution_time>
<max_threads>8</max_threads>
</app_readonly>
<monitor>
<readonly>1</readonly>
<max_memory_usage>536870912</max_memory_usage>
<max_execution_time>10</max_execution_time>
</monitor>
</profiles>
<users>
<default>
<password_sha256_hex>default_password_sha256_placeholder</password_sha256_hex>
<networks>
<ip>127.0.0.1</ip>
<ip>::1</ip>
<ip><ch-01-ip></ip>
<ip><ch-02-ip></ip>
<ip><ch-03-ip></ip>
</networks>
<profile>default</profile>
<quota>default</quota>
</default>
<cluster>
<password_sha256_hex>cluster_password_sha256_placeholder</password_sha256_hex>
<networks>
<ip><ch-01-ip></ip>
<ip><ch-02-ip></ip>
<ip><ch-03-ip></ip>
</networks>
<profile>default</profile>
<quota>default</quota>
</cluster>
<app_readonly>
<password_sha256_hex>app_password_sha256_placeholder</password_sha256_hex>
<networks>
<ip>127.0.0.1</ip>
<ip>::1</ip>
<ip><ops-01-ip></ip>
</networks>
<profile>app_readonly</profile>
<quota>default</quota>
</app_readonly>
<monitor>
<password_sha256_hex>monitor_password_sha256_placeholder</password_sha256_hex>
<networks>
<ip>127.0.0.1</ip>
<ip>::1</ip>
<ip><ops-01-ip></ip>
</networks>
<profile>monitor</profile>
<quota>default</quota>
</monitor>
</users>
</clickhouse>
查询日志和 part 日志开启,并设置 TTL:
<query_log>
<database>system</database>
<table>query_log</table>
<partition_by>toYYYYMM(event_date)</partition_by>
<ttl>event_date + INTERVAL 14 DAY DELETE</ttl>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>
<part_log>
<database>system</database>
<table>part_log</table>
<partition_by>toYYYYMM(event_date)</partition_by>
<ttl>event_date + INTERVAL 14 DAY DELETE</ttl>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
</part_log>
重启并验证:
for h in ch-01 ch-02 ch-03; do
ssh "$h" 'systemctl restart clickhouse-server.service && systemctl is-active clickhouse-server.service'
done
ssh ops-01 'curl -sS "http://127.0.0.1:8123/" --data-binary "SELECT 1"'
集群启动验证
检查端口:
for h in ch-01 ch-02 ch-03; do
echo "===== $h ====="
ssh "$h" 'ss -lntp | egrep ":8123|:9000|:9009|:9363"'
done
基础查询:
for h in ch-01 ch-02 ch-03; do
ssh "$h" '/www/clickhouse/current/bin/clickhouse-client --query "SELECT hostName(), version()"'
done
集群拓扑:
SELECT cluster, shard_num, replica_num, host_name, port
FROM system.clusters
WHERE cluster = 'cluster_1s3r'
ORDER BY shard_num, replica_num;
三副本应看到:
cluster_1s3r 1 1 ch-01 9000
cluster_1s3r 1 2 ch-02 9000
cluster_1s3r 1 3 ch-03 9000
分片、副本和 Keeper 的理解
副本是同一份数据复制多份。当前三台机器各保存一份完整数据:
ch-01 = replica
ch-02 = replica
ch-03 = replica
分片是把不同数据拆到不同机器组。当前只有 1 个分片,所以没有把数据拆开,主要目标是高可用。
Keeper 负责协调复制:
- 某台 ClickHouse 写入本地 part。
- 它把 part 信息登记到 Keeper。
- 其他副本从 Keeper 看到复制任务。
- 其他 ClickHouse Server 通过 9009 等节点间链路拉取 part。
Keeper 内部有 leader/follower,但这不是业务主库。ClickHouse 查询层没有固定主节点,三台健康副本都可以查。
复制表验证
创建测试库:
CREATE DATABASE IF NOT EXISTS ch_test ON CLUSTER cluster_1s3r;
创建复制表:
CREATE TABLE IF NOT EXISTS ch_test.replica_test ON CLUSTER cluster_1s3r
(
id UInt64,
source_node String,
created_at DateTime DEFAULT now()
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/ch_test/replica_test', '{replica}')
ORDER BY id;
创建 Distributed 表:
CREATE TABLE IF NOT EXISTS ch_test.replica_test_all ON CLUSTER cluster_1s3r
AS ch_test.replica_test
ENGINE = Distributed(cluster_1s3r, ch_test, replica_test, rand());
写入测试:
INSERT INTO ch_test.replica_test (id, source_node) VALUES (1, hostName());
查询三副本:
SELECT hostName(), *
FROM clusterAllReplicas(cluster_1s3r, ch_test.replica_test)
ORDER BY hostName(), id;
检查复制状态:
SELECT
hostName(),
database,
table,
is_readonly,
absolute_delay,
total_replicas,
active_replicas,
queue_size
FROM clusterAllReplicas(cluster_1s3r, system.replicas)
WHERE database = 'ch_test';
期望:
is_readonly = 0
absolute_delay = 0
total_replicas = 3
active_replicas = 3
queue_size = 0
单节点下线测试
模拟下线 ch-01:
ssh ch-01 'systemctl stop clickhouse-server.service clickhouse-keeper.service'
从 ch-02 验证 Keeper 仍可访问、表仍可查询:
ssh ch-02 '/www/clickhouse/current/bin/clickhouse keeper-client --host 127.0.0.1 --port 9181 --query "ls /"'
ssh ch-02 '/www/clickhouse/current/bin/clickhouse-client --query "SELECT count() FROM ch_test.replica_test_all"'
下线期间写入第二行:
INSERT INTO ch_test.replica_test (id, source_node) VALUES (2, hostName());
检查剩余副本:
SELECT hostName(), active_replicas, total_replicas, is_readonly, absolute_delay, queue_size
FROM clusterAllReplicas(cluster_1s3r, system.replicas)
WHERE database = 'ch_test';
恢复 ch-01:
ssh ch-01 'systemctl start clickhouse-keeper.service clickhouse-server.service'
如果需要主动等待追平:
SYSTEM SYNC REPLICA ch_test.replica_test;
最终应恢复到 active_replicas=3。
HAProxy 统一查询入口
业务不建议直接写死某一台 ClickHouse。可以在 ops 节点部署 HAProxy:
业务服务 -> ops-01:8123 -> ch-01/ch-02/ch-03:8123
业务服务 -> ops-01:9000 -> ch-01/ch-02/ch-03:9000
安装:
ssh ops-01 'yum install -y haproxy'
示例配置:
global
log 127.0.0.1 local0
maxconn 4096
defaults
log global
mode tcp
option tcplog
timeout connect 5s
timeout client 10m
timeout server 10m
listen clickhouse_http
bind 0.0.0.0:8123
mode http
option httplog
option httpchk GET /ping
http-check expect string Ok.
balance roundrobin
server ch-01 ch-01:8123 check
server ch-02 ch-02:8123 check
server ch-03 ch-03:8123 check
listen clickhouse_native
bind 0.0.0.0:9000
mode tcp
balance roundrobin
server ch-01 ch-01:9000 check
server ch-02 ch-02:9000 check
server ch-03 ch-03:9000 check
listen stats
bind 127.0.0.1:8404
mode http
stats enable
stats uri /
启动:
ssh ops-01 'systemctl enable --now haproxy.service && systemctl status haproxy.service --no-pager'
验证 HTTP 入口:
for i in $(seq 1 10); do
ssh ops-01 'curl -sS http://127.0.0.1:8123 --data-binary "SELECT hostName()"'
done
验证 Native 入口:
ssh ops-01 'clickhouse-client --host 127.0.0.1 --port 9000 --query "SELECT hostName()"'
业务部署在 ops 本机时,优先连:
HTTP: http://127.0.0.1:8123
Native: 127.0.0.1:9000
公网业务入口占位
ClickHouse 不直接暴露公网。公网入口只进入业务服务,业务服务再通过内网访问 HAProxy。
示例链路:
用户 -> https://api.example.com -> 前置网关/反代 -> ops-01:443 -> 业务服务 -> HAProxy -> ClickHouse
如果先用 Caddy 占位:
:443 {
handle /healthz {
respond "ok"
}
handle {
reverse_proxy 127.0.0.1:8080
}
}
验证:
curl -I https://api.example.com/healthz
高并发业务场景下,Caddy、Nginx、OpenResty 都能做入口。真正选择建议基于压测,而不是凭感觉。这里 ClickHouse 查询入口已经由 HAProxy 处理,公网反代只负责业务 HTTP 层。
ops 节点和 Docker
ops 节点安装 Docker:
ssh ops-01 'bash <(curl -fsSL https://get.docker.com)'
ssh ops-01 'systemctl enable --now docker'
ssh ops-01 'docker version'
ssh ops-01 'docker compose version'
如果企业环境有内部镜像源,可以写入 /etc/docker/daemon.json:
{
"registry-mirrors": ["https://<your-docker-mirror>"]
}
重启:
ssh ops-01 'systemctl restart docker && docker info | grep -A3 "Registry Mirrors"'
安装 node_exporter:
NODE_EXPORTER_VERSION=1.9.1
curl -fLO "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"
for h in ch-01 ch-02 ch-03 ops-01; do
scp "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" "$h:/tmp/"
ssh "$h" "
cd /tmp
tar -xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
mkdir -p /opt/node_exporter/releases
rm -rf /opt/node_exporter/releases/${NODE_EXPORTER_VERSION}
mv node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64 /opt/node_exporter/releases/${NODE_EXPORTER_VERSION}
ln -sfn /opt/node_exporter/releases/${NODE_EXPORTER_VERSION} /opt/node_exporter/current
useradd --system --shell /sbin/nologin node_exporter 2>/dev/null || true
chown -R node_exporter:node_exporter /opt/node_exporter
"
done
systemd:
[Unit]
Description=Prometheus Node Exporter
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
ExecStart=/opt/node_exporter/current/node_exporter
Restart=always
[Install]
WantedBy=multi-user.target
验证:
for h in ch-01 ch-02 ch-03 ops-01; do
ssh "$h" 'systemctl enable --now node_exporter.service; curl -sS http://127.0.0.1:9100/metrics | head'
done
Prometheus、Grafana 和 Alertmanager
监控链路:
ClickHouse /metrics -> Prometheus
node_exporter -> Prometheus
blackbox_exporter -> Prometheus
light SQL exporter -> Prometheus
Prometheus -> Grafana
Prometheus -> Alertmanager
目录:
/opt/ch-monitoring/monitoring
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── rules/clickhouse-basic.yml
├── alertmanager/alertmanager.yml
├── blackbox/blackbox.yml
└── grafana/provisioning/
docker-compose.yml 核心:
services:
prometheus:
image: prom/prometheus:v3.5.0
container_name: ch-monitor-prometheus
restart: unless-stopped
ports:
- "127.0.0.1:19090:9090"
volumes:
- ./prometheus:/etc/prometheus:ro
- prometheus_data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=30d
- --web.enable-lifecycle
alertmanager:
image: prom/alertmanager:v0.28.1
container_name: ch-monitor-alertmanager
restart: unless-stopped
ports:
- "127.0.0.1:19093:9093"
volumes:
- ./alertmanager:/etc/alertmanager:ro
- alertmanager_data:/alertmanager
command:
- --config.file=/etc/alertmanager/alertmanager.yml
- --storage.path=/alertmanager
grafana:
image: grafana/grafana-oss:12.0.2
container_name: ch-monitor-grafana
restart: unless-stopped
ports:
- "0.0.0.0:13000:3000"
volumes:
prometheus_data:
alertmanager_data:
grafana_data:
Prometheus 配置重点:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
基础告警规则:
groups:
- name: clickhouse-basic
rules:
- alert: ClickHouseMetricsDown
expr: up{job="clickhouse"} == 0
for: 2m
labels:
severity: critical
- alert: ClickHouseHTTPPingDown
expr: probe_success{job="clickhouse-http-ping"} == 0
for: 2m
labels:
severity: critical
- alert: ClickHouseReplicaReadonly
expr: clickhouse_sql_readonly_replicas > 0
for: 1m
labels:
severity: critical
- alert: ClickHouseActiveReplicasLow
expr: clickhouse_sql_replicated_tables > 0 and clickhouse_sql_min_active_replicas < 3
for: 2m
labels:
severity: critical
- alert: ClickHouseReplicaDelayHigh
expr: clickhouse_sql_max_replica_absolute_delay_seconds > 60
for: 5m
labels:
severity: warning
- alert: ClickHouseDiskAlmostFull
expr: |
(node_filesystem_avail_bytes{job="node",mountpoint="/www",fstype!~"tmpfs|overlay"} /
node_filesystem_size_bytes{job="node",mountpoint="/www",fstype!~"tmpfs|overlay"}) < 0.15
for: 5m
labels:
severity: warning
- alert: NodeRootDiskAlmostFull
expr: |
(node_filesystem_avail_bytes{job="node",mountpoint="/",fstype!~"tmpfs|overlay"} /
node_filesystem_size_bytes{job="node",mountpoint="/",fstype!~"tmpfs|overlay"}) < 0.15
for: 5m
labels:
severity: warning
Alertmanager 先只接收不外发:
global:
resolve_timeout: 5m
route:
receiver: noop
group_by:
- alertname
- cluster
- instance_name
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: noop
启动:
ssh ops-01 'cd /opt/ch-monitoring/monitoring && docker compose up -d'
校验:
ssh ops-01 'docker exec ch-monitor-prometheus promtool check config /etc/prometheus/prometheus.yml'
ssh ops-01 'docker exec ch-monitor-alertmanager amtool check-config /etc/alertmanager/alertmanager.yml'
ssh ops-01 'curl -sS http://127.0.0.1:19090/api/v1/alertmanagers'
ssh ops-01 'curl -sS http://127.0.0.1:19090/api/v1/alerts'
后续要接邮件、飞书、企业微信时,不需要改 Prometheus 采集,只需要在 Alertmanager receiver 里新增 email 或 webhook。
Grafana 面板
这次面板覆盖:
- ClickHouse 三节点状态。
- HTTP
/ping探测。 - HAProxy HTTP / Native 入口探测。
- node_exporter 机器 CPU、内存、磁盘。
- 所有机器入站/出站带宽。
- 当前运行查询数。
- 只读副本数。
- 最小活跃副本数。
- 最大复制延迟。
- 复制队列数量。
- active parts 总数。
- 单分区最大 parts。
- 当前 merge 数。
- 未完成 mutation 数。
网络带宽 PromQL 示例:
sum by (instance_name, device) (
rate(node_network_receive_bytes_total{job="node",device!~"lo|docker.*|br.*|veth.*"}[5m])
) * 8
出站把 receive 换成 transmit:
sum by (instance_name, device) (
rate(node_network_transmit_bytes_total{job="node",device!~"lo|docker.*|br.*|veth.*"}[5m])
) * 8
几个阈值建议:
当前运行查询数: 0-19 绿色,20-99 黄色,>=100 红色
只读副本数: 0 绿色,>=1 红色
最小活跃副本数: >=3 绿色,<3 红色
/www 使用率: 70% 开始关注,80% 制定方案,85% 尽快处理,90% 高风险
数据包迁移与下载排查
业务历史数据不要直接搬 Docker ClickHouse 数据目录到新集群。更稳妥流程是:
- 准备压缩包或导出文件。
- 放到某一台 ClickHouse 数据节点的数据盘。
- 校验压缩包完整性。
- 解压并核对文件数量、大小和导入清单。
- 抽取旧库 DDL。
- 调整为 ReplicatedMergeTree / Distributed 表结构。
- 使用具备写权限的管理账号导入。
建议目录:
/www/import/business_data/packages
/www/import/business_data/logs
/www/import/business_data/extracted
/www/import/business_data/ddl
用 tmux 挂下载:
ssh ch-01 'tmux new -s data_download'
HTTP 下载模板:
cd /www/import/business_data/packages
curl -L --fail --connect-timeout 20 --retry 3 --retry-delay 5 \
--speed-time 60 --speed-limit 1024 \
-o dataset.tar.gz \
'https://download.example.com/path/to/dataset.tar.gz'
如果下载源支持 Range,可以用 aria2 多连接:
aria2c -x 8 -s 8 -k 4M \
-d /www/import/business_data/packages \
-o dataset.tar.gz \
'https://download.example.com/path/to/dataset.tar.gz'
检查是否支持断点续传:
curl -I 'https://download.example.com/path/to/dataset.tar.gz' | grep -i 'Accept-Ranges'
校验压缩包:
gzip -t /www/import/business_data/packages/dataset.tar.gz
tar -tzf /www/import/business_data/packages/dataset.tar.gz | head
如果 ops 节点下载更快,可以先下载到 ops,再走内网传到数据节点:
rsync -avP /tmp/dataset.tar.gz ch-01:/www/import/business_data/packages/
内网速度可以用 iperf3 确认:
ssh ops-01 'yum install -y iperf3'
ssh ch-01 'yum install -y iperf3'
ssh ops-01 'iperf3 -s'
ssh ch-01 'iperf3 -c ops-01 -P 4 -t 20'
如果公网下载慢但内网很快,瓶颈通常不是 ClickHouse 机器或磁盘,而是公网出口、NAT、运营商链路或下载源限速。
后续容量扩展
当前拓扑是高可用优先:
1 shard x 3 replicas
扩容思路:
- 最简单的是三台一起扩
/www数据盘,然后扩文件系统。 - 如果加新磁盘,可以规划 ClickHouse storage policy,把新盘挂到
/www2之类路径后纳入策略。 - 单独加第 4 台做副本,只会多一份完整数据,提高读能力和冗余,不增加有效写入容量。
- 真正增加有效容量,需要增加分片。例如从
1 shard x 3 replicas变成2 shards x 3 replicas,通常意味着再增加 3 台数据节点。
新增分片不是简单加机器。需要重新评估:
- Distributed 表。
- sharding key。
- 历史数据重分布。
- 写入路由。
- 查询是否会跨分片放大。
- 回滚方案。
陕公网安备61011302002223号