部署运维

三台机器部署 ClickHouse 高可用集群实战记录

ladydd

09 Jun 2026 — 24 min read

本文是一份可发布版部署记录。真实 IP、域名、账号、密码、下载链接、业务目录名、机器唯一标识等敏感信息已经替换为占位符。命令中的 <...> 需要按自己的环境替换。

目标与拓扑

这次目标是用三台数据节点部署一套 ClickHouse 高可用集群，拓扑采用：

1 shard x 3 replicas

含义是：集群只有一个逻辑分片，三台机器都保存同一份数据的完整副本。任意一台数据节点宕机时，只要 ClickHouse Keeper 仍然有多数派，剩余节点仍可继续提供读写服务。

规划节点如下：

主机名	示例地址	角色
ch-01	`<ch-01-ip>`	ClickHouse Server + ClickHouse Keeper
ch-02	`<ch-02-ip>`	ClickHouse Server + ClickHouse Keeper
ch-03	`<ch-03-ip>`	ClickHouse Server + ClickHouse Keeper
ops-01	`<ops-01-ip>`	监控、统一查询入口、后续业务服务

机器规格：

ClickHouse 数据节点：16 核、64GB 内存。
每台数据节点有一块约 3TB 的数据盘，挂载到 /www。
ClickHouse 程序、配置、数据、日志、临时目录和 Keeper 数据都尽量放到 /www/clickhouse。
ops 节点用于放 Prometheus、Grafana、Alertmanager、HAProxy 和后续业务服务。

容量口径要提前说清楚：

1 shard x 3 replicas 的目标是高可用，不是扩大有效容量。
三台机器各 3TB，裸盘总量看起来是 9TB。
但因为三台都保存同一份数据，所以有效 ClickHouse 物理数据容量接近单台 3TB。
商业运行不建议把数据盘打满，通常 70%-80% 就要开始规划扩容。

初始连接与 SSH 互信

先从本地机器检查三台服务器 SSH 是否可达：

for h in ch-01 ch-02 ch-03; do
  ssh -o ConnectTimeout=5 "$h" 'hostname; uname -r; cat /etc/redhat-release'
done

如果还没有配置本地 SSH alias，可以在本地 ~/.ssh/config 加：

Host ch-01
  HostName <ch-01-ip>
  User <ssh-user>
  IdentityFile <path-to-private-key>

Host ch-02
  HostName <ch-02-ip>
  User <ssh-user>
  IdentityFile <path-to-private-key>

Host ch-03
  HostName <ch-03-ip>
  User <ssh-user>
  IdentityFile <path-to-private-key>

Host ops-01
  HostName <ops-01-ip>
  User <ssh-user>
  IdentityFile <path-to-private-key>

为了后续批量部署，让三台 ClickHouse 节点之间可以 SSH 免密互访。下面命令以有 root 权限的运维用户为例，如果使用普通用户，需要把路径和权限改成对应用户并配合 sudo。

每台机器先生成密钥：

ssh ch-01 'test -f /root/.ssh/id_ed25519 || ssh-keygen -t ed25519 -N "" -f /root/.ssh/id_ed25519'
ssh ch-02 'test -f /root/.ssh/id_ed25519 || ssh-keygen -t ed25519 -N "" -f /root/.ssh/id_ed25519'
ssh ch-03 'test -f /root/.ssh/id_ed25519 || ssh-keygen -t ed25519 -N "" -f /root/.ssh/id_ed25519'

收集三台公钥，再写入每台机器的 authorized_keys：

tmpfile="$(mktemp)"
for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'cat /root/.ssh/id_ed25519.pub'
done | sort -u > "$tmpfile"

for h in ch-01 ch-02 ch-03; do
  scp "$tmpfile" "$h:/tmp/cluster-authorized-keys"
  ssh "$h" '
    mkdir -p /root/.ssh
    touch /root/.ssh/authorized_keys
    cat /tmp/cluster-authorized-keys >> /root/.ssh/authorized_keys
    sort -u /root/.ssh/authorized_keys -o /root/.ssh/authorized_keys
    chmod 700 /root/.ssh
    chmod 600 /root/.ssh/authorized_keys
  '
done
rm -f "$tmpfile"

验证 3x3 互访：

for src in ch-01 ch-02 ch-03; do
  for dst in ch-01 ch-02 ch-03; do
    ssh "$src" "ssh -o BatchMode=yes -o ConnectTimeout=5 $dst hostname"
  done
done

主机名与 hosts

三台机器如果都是模板克隆出来的，初始 hostname 可能相同。先改成清晰的名字：

ssh ch-01 'hostnamectl set-hostname ch-01'
ssh ch-02 'hostnamectl set-hostname ch-02'
ssh ch-03 'hostnamectl set-hostname ch-03'

在三台数据节点和 ops 节点都写入 /etc/hosts：

for h in ch-01 ch-02 ch-03 ops-01; do
  ssh "$h" 'cat >>/etc/hosts <<EOF
<ch-01-ip> ch-01
<ch-02-ip> ch-02
<ch-03-ip> ch-03
<ops-01-ip> ops-01
EOF'
done

验证解析：

for h in ch-01 ch-02 ch-03 ops-01; do
  ssh "$h" 'getent hosts ch-01 ch-02 ch-03 ops-01'
done

磁盘检查

确认 /www 是否挂载到数据盘：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h ====="
  ssh "$h" 'df -h /www; findmnt /www; lsblk -f'
done

检查系统日志中是否有明显磁盘错误：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h ====="
  ssh "$h" "dmesg | egrep -i 'i/o error|medium error|buffer i/o|blk_update_request|reset|ata' | tail -50 || true"
done

做一次轻量 direct IO 测试，确认数据盘没有离谱瓶颈：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h write ====="
  ssh "$h" 'dd if=/dev/zero of=/www/io-test.bin bs=1G count=2 oflag=direct status=progress'
  echo "===== $h read ====="
  ssh "$h" 'dd if=/www/io-test.bin of=/dev/null bs=1G iflag=direct status=progress'
  ssh "$h" 'rm -f /www/io-test.bin'
done

如果机器是虚拟机，客机内通常看不到真实物理盘 SMART。虚拟盘健康还需要在云平台、虚拟化平台或底层存储侧确认。

网络与下载能力

安装前先测试外部软件源是否可访问：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h ====="
  ssh "$h" '
    curl -I --connect-timeout 10 https://packages.clickhouse.com/tgz/lts/ | head
    curl -I --connect-timeout 10 https://github.com/ClickHouse/ClickHouse | head
  '
done

如果服务器外网速度不稳定，可以先在本地下载 ClickHouse 安装包，再通过 scp 或 rsync 分发到三台机器：

VERSION=26.3.13.31
mkdir -p "packages/clickhouse-$VERSION"
cd "packages/clickhouse-$VERSION"

for pkg in common-static server client; do
  curl -fLO "https://packages.clickhouse.com/tgz/lts/clickhouse-${pkg}-${VERSION}-amd64.tgz"
  curl -fLO "https://packages.clickhouse.com/tgz/lts/clickhouse-${pkg}-${VERSION}-amd64.tgz.sha512"
done

sha512sum -c *.sha512

for h in ch-01 ch-02 ch-03; do
  ssh "$h" "mkdir -p /www/clickhouse/packages/$VERSION"
  rsync -av ./ "$h:/www/clickhouse/packages/$VERSION/"
done

部署前风险项

部署前重点检查几项：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h ====="
  ssh "$h" '
    echo -n "THP enabled: "; cat /sys/kernel/mm/transparent_hugepage/enabled
    echo -n "THP defrag:  "; cat /sys/kernel/mm/transparent_hugepage/defrag
    ulimit -n
    systemctl is-active firewalld || true
  '
done

这里重点处理：

关闭 THP，减少数据库服务延迟抖动。
提高 nofile 和 nproc。
只对集群内网开放 ClickHouse、Keeper、监控端口。
修复克隆机器可能重复的 machine-id。

修复 machine-id

模板克隆机器常见问题是 /etc/machine-id 完全相同。重复的 machine-id 会影响日志、监控、服务注册和部分系统组件识别。

检查：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'echo "$(hostname) $(cat /etc/machine-id)"'
done

修复命令：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" '
    set -e
    ts=$(date +%Y%m%d%H%M%S)
    cp -f /etc/machine-id /etc/machine-id.bak.$ts 2>/dev/null || true
    cp -f /var/lib/dbus/machine-id /var/lib/dbus/machine-id.bak.$ts 2>/dev/null || true
    rm -f /etc/machine-id /var/lib/dbus/machine-id
    systemd-machine-id-setup
    cp -f /etc/machine-id /var/lib/dbus/machine-id
    cat /etc/machine-id
  '
done

修复后重新检查，确保三台输出不同。发布文章时不建议贴真实 machine-id。

ClickHouse 版本选择

生产环境第一套集群不建议追最新版本。我的选择是 LTS：

ClickHouse 26.3.13.31 LTS

选择理由：

LTS 更适合生产起步。
不追最新 stable，减少新版本变化带来的风险。
三台机器 CPU 支持 ClickHouse 当前 x86_64 二进制需要的指令集。
使用官方 tgz/lts 包，便于把程序主体放在 /www。

检查 CPU 指令：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" "lscpu | egrep 'Model name|Flags' | head -2"
done

关闭 THP

创建 systemd 服务，确保现在和开机后都关闭 THP：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'cat >/etc/systemd/system/disable-thp.service <<EOF
[Unit]
Description=Disable Transparent Huge Pages
After=multi-user.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c "echo never > /sys/kernel/mm/transparent_hugepage/enabled"
ExecStart=/bin/sh -c "echo never > /sys/kernel/mm/transparent_hugepage/defrag"
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now disable-thp.service
cat /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag'
done

看到 [never] 即可。

ClickHouse 基础安装

目录规划：

/www/clickhouse/current -> /www/clickhouse/releases/<version>
/www/clickhouse/releases/<version>
/www/clickhouse/etc/clickhouse-server
/www/clickhouse/etc/clickhouse-keeper
/www/clickhouse/data
/www/clickhouse/tmp
/www/clickhouse/logs/clickhouse-server
/www/clickhouse/logs/clickhouse-keeper
/www/clickhouse/keeper

创建系统用户：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'id clickhouse >/dev/null 2>&1 || useradd --system --home-dir /www/clickhouse --shell /sbin/nologin clickhouse'
done

解压安装包：

VERSION=26.3.13.31
for h in ch-01 ch-02 ch-03; do
  ssh "$h" "
    set -e
    BASE=/www/clickhouse
    PKG_DIR=\$BASE/packages/$VERSION
    RELEASE=\$BASE/releases/$VERSION
    EXTRACT=\$RELEASE/.extract

    mkdir -p \$RELEASE/bin \$EXTRACT
    cd \$PKG_DIR
    sha512sum -c *.sha512

    rm -rf \$EXTRACT
    mkdir -p \$EXTRACT
    tar -xzf clickhouse-common-static-$VERSION-amd64.tgz -C \$EXTRACT
    tar -xzf clickhouse-server-$VERSION-amd64.tgz -C \$EXTRACT
    tar -xzf clickhouse-client-$VERSION-amd64.tgz -C \$EXTRACT

    install -m 0755 \$EXTRACT/clickhouse-common-static-$VERSION/usr/bin/clickhouse \$RELEASE/bin/clickhouse
    for tool in clickhouse-server clickhouse-client clickhouse-local clickhouse-keeper clickhouse-keeper-client clickhouse-benchmark clickhouse-format; do
      ln -sfn clickhouse \$RELEASE/bin/\$tool
    done
    ln -sfn \$RELEASE \$BASE/current
    \$BASE/current/bin/clickhouse local --version
  "
done

创建基础目录：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" '
    BASE=/www/clickhouse
    mkdir -p \
      $BASE/etc/clickhouse-server \
      $BASE/etc/clickhouse-keeper \
      $BASE/data \
      $BASE/tmp \
      $BASE/user_files \
      $BASE/format_schemas \
      $BASE/access \
      $BASE/logs/clickhouse-server \
      $BASE/logs/clickhouse-keeper \
      $BASE/keeper/coordination/log \
      $BASE/keeper/coordination/snapshots
    chown -R clickhouse:clickhouse $BASE
  '
done

提高 limits：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'cat >/etc/security/limits.d/clickhouse.conf <<EOF
clickhouse soft nofile 1048576
clickhouse hard nofile 1048576
clickhouse soft nproc 65536
clickhouse hard nproc 65536
EOF'
done

ClickHouse Keeper 配置

Keeper 是 ClickHouse 的协调服务，负责复制元数据、分布式 DDL 队列、锁和副本协作。它不是业务主节点，也不直接搬运大数据。

三节点 Keeper 的关键点：

server_id 三台必须不同。
Raft 端口用于 Keeper 节点间通信。
Client 端口供 ClickHouse Server 连接 Keeper。
三台 Keeper 允许挂 1 台；挂 2 台时多数派丢失。

示例 keeper_config.xml：

<clickhouse>
    <logger>
        <level>information</level>
        <log>/www/clickhouse/logs/clickhouse-keeper/clickhouse-keeper.log</log>
        <errorlog>/www/clickhouse/logs/clickhouse-keeper/clickhouse-keeper.err.log</errorlog>
        <size>1000000000</size>
        <count>10</count>
    </logger>

    <keeper_server>
        <tcp_port>9181</tcp_port>
        <server_id>1</server_id>
        <log_storage_path>/www/clickhouse/keeper/coordination/log</log_storage_path>
        <snapshot_storage_path>/www/clickhouse/keeper/coordination/snapshots</snapshot_storage_path>

        <coordination_settings>
            <operation_timeout_ms>10000</operation_timeout_ms>
            <session_timeout_ms>30000</session_timeout_ms>
            <raft_logs_level>information</raft_logs_level>
        </coordination_settings>

        <raft_configuration>
            <server>
                <id>1</id>
                <hostname>ch-01</hostname>
                <port>9234</port>
            </server>
            <server>
                <id>2</id>
                <hostname>ch-02</hostname>
                <port>9234</port>
            </server>
            <server>
                <id>3</id>
                <hostname>ch-03</hostname>
                <port>9234</port>
            </server>
        </raft_configuration>
    </keeper_server>
</clickhouse>

在 ch-02、ch-03 上分别把 <server_id> 改成 2、3。

systemd 服务：

[Unit]
Description=ClickHouse Keeper
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=clickhouse
Group=clickhouse
ExecStart=/www/clickhouse/current/bin/clickhouse-keeper --config-file=/www/clickhouse/etc/clickhouse-keeper/keeper_config.xml
Restart=always
RestartSec=5
LimitNOFILE=1048576
LimitNPROC=65536

[Install]
WantedBy=multi-user.target

启动：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'systemctl daemon-reload && systemctl enable --now clickhouse-keeper.service'
done

内网端口开放

只对集群内网 CIDR 放通必要端口：

端口	用途
8123	ClickHouse HTTP
9000	ClickHouse Native TCP
9009	ClickHouse interserver
9181	Keeper client
9234	Keeper Raft
9363	ClickHouse Prometheus metrics
9100	node_exporter

firewalld 示例：

CLUSTER_CIDR="<cluster-cidr>"
for h in ch-01 ch-02 ch-03; do
  for port in 8123 9000 9009 9181 9234 9363 9100; do
    ssh "$h" "firewall-cmd --permanent --add-rich-rule='rule family=\"ipv4\" source address=\"$CLUSTER_CIDR\" port port=\"$port\" protocol=\"tcp\" accept'"
  done
  ssh "$h" 'firewall-cmd --reload'
done

不建议把 ClickHouse 和 Keeper 端口直接暴露到公网。

Keeper Quorum 验证

检查 Keeper 端口：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h ====="
  ssh "$h" 'ss -lntp | egrep ":9181|:9234"'
done

检查服务：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'systemctl is-active clickhouse-keeper.service'
done

用 keeper-client 验证：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" '/www/clickhouse/current/bin/clickhouse keeper-client --host 127.0.0.1 --port 9181 --query "ls /"'
done

看 leader/follower：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h ====="
  ssh "$h" "grep -E 'leader|follower|elected|term' /www/clickhouse/logs/clickhouse-keeper/clickhouse-keeper.log | tail -20"
done

ClickHouse Server 配置

Server 的核心配置片段如下：

<clickhouse>
    <display_name>ch-01</display_name>
    <listen_host>0.0.0.0</listen_host>
    <http_port>8123</http_port>
    <tcp_port>9000</tcp_port>
    <interserver_http_port>9009</interserver_http_port>
    <interserver_http_host>ch-01</interserver_http_host>
    <timezone>Asia/Shanghai</timezone>

    <path>/www/clickhouse/data/</path>
    <tmp_path>/www/clickhouse/tmp/</tmp_path>
    <user_files_path>/www/clickhouse/user_files/</user_files_path>
    <format_schema_path>/www/clickhouse/format_schemas/</format_schema_path>
    <access_control_path>/www/clickhouse/access/</access_control_path>

    <user_directories>
        <users_xml>
            <path>/www/clickhouse/etc/clickhouse-server/users.xml</path>
        </users_xml>
        <local_directory>
            <path>/www/clickhouse/access/</path>
        </local_directory>
    </user_directories>

    <prometheus>
        <endpoint>/metrics</endpoint>
        <port>9363</port>
        <metrics>true</metrics>
        <events>true</events>
        <asynchronous_metrics>true</asynchronous_metrics>
        <errors>true</errors>
    </prometheus>

    <remote_servers>
        <cluster_1s3r>
            <shard>
                <internal_replication>true</internal_replication>
                <replica><host>ch-01</host><port>9000</port><user>cluster</user><password>cluster_password_placeholder</password></replica>
                <replica><host>ch-02</host><port>9000</port><user>cluster</user><password>cluster_password_placeholder</password></replica>
                <replica><host>ch-03</host><port>9000</port><user>cluster</user><password>cluster_password_placeholder</password></replica>
            </shard>
        </cluster_1s3r>
    </remote_servers>

    <zookeeper>
        <node index="1"><host>ch-01</host><port>9181</port></node>
        <node index="2"><host>ch-02</host><port>9181</port></node>
        <node index="3"><host>ch-03</host><port>9181</port></node>
    </zookeeper>

    <macros>
        <cluster>cluster_1s3r</cluster>
        <shard>01</shard>
        <replica>ch-01</replica>
    </macros>

    <distributed_ddl>
        <path>/clickhouse/task_queue/ddl</path>
    </distributed_ddl>
</clickhouse>

注意：

每台的 display_name、interserver_http_host 和 <macros><replica> 要改成本机名。
user_directories 很重要，漏掉后 users.xml 不会被加载，可能出现 Settings profile default not found。
密码不要明文写在公开文章里，生产环境也建议用独立密钥文件或部署系统注入。

Server systemd：

[Unit]
Description=ClickHouse Server
After=network-online.target clickhouse-keeper.service
Wants=network-online.target

[Service]
Type=simple
User=clickhouse
Group=clickhouse
ExecStart=/www/clickhouse/current/bin/clickhouse-server --config-file=/www/clickhouse/etc/clickhouse-server/config.xml
Restart=always
RestartSec=5
LimitNOFILE=1048576
LimitNPROC=65536

[Install]
WantedBy=multi-user.target

启动：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'systemctl daemon-reload && systemctl enable --now clickhouse-server.service'
done

用户、权限和日志

生产环境不建议保留内网无密码 default。这次处理思路：

default 只允许 localhost 和集群节点访问，并设置密码。
cluster 用户用于节点间 remote_servers。
app_readonly 用户用于业务查询，只读，限制来源为 ops 节点。
monitor 用户用于监控查询，只读，资源限制更低。

生成密码 hash 示例：

printf '%s' '<strong-password>' | sha256sum | awk '{print $1}'

users.xml 关键片段：

<clickhouse>
    <profiles>
        <default>
            <max_memory_usage>17179869184</max_memory_usage>
            <max_execution_time>600</max_execution_time>
            <use_uncompressed_cache>0</use_uncompressed_cache>
        </default>

        <app_readonly>
            <readonly>1</readonly>
            <max_memory_usage>8589934592</max_memory_usage>
            <max_execution_time>120</max_execution_time>
            <max_threads>8</max_threads>
        </app_readonly>

        <monitor>
            <readonly>1</readonly>
            <max_memory_usage>536870912</max_memory_usage>
            <max_execution_time>10</max_execution_time>
        </monitor>
    </profiles>

    <users>
        <default>
            <password_sha256_hex>default_password_sha256_placeholder</password_sha256_hex>
            <networks>
                <ip>127.0.0.1</ip>
                <ip>::1</ip>
                <ip><ch-01-ip></ip>
                <ip><ch-02-ip></ip>
                <ip><ch-03-ip></ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
        </default>

        <cluster>
            <password_sha256_hex>cluster_password_sha256_placeholder</password_sha256_hex>
            <networks>
                <ip><ch-01-ip></ip>
                <ip><ch-02-ip></ip>
                <ip><ch-03-ip></ip>
            </networks>
            <profile>default</profile>
            <quota>default</quota>
        </cluster>

        <app_readonly>
            <password_sha256_hex>app_password_sha256_placeholder</password_sha256_hex>
            <networks>
                <ip>127.0.0.1</ip>
                <ip>::1</ip>
                <ip><ops-01-ip></ip>
            </networks>
            <profile>app_readonly</profile>
            <quota>default</quota>
        </app_readonly>

        <monitor>
            <password_sha256_hex>monitor_password_sha256_placeholder</password_sha256_hex>
            <networks>
                <ip>127.0.0.1</ip>
                <ip>::1</ip>
                <ip><ops-01-ip></ip>
            </networks>
            <profile>monitor</profile>
            <quota>default</quota>
        </monitor>
    </users>
</clickhouse>

查询日志和 part 日志开启，并设置 TTL：

<query_log>
    <database>system</database>
    <table>query_log</table>
    <partition_by>toYYYYMM(event_date)</partition_by>
    <ttl>event_date + INTERVAL 14 DAY DELETE</ttl>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>

<part_log>
    <database>system</database>
    <table>part_log</table>
    <partition_by>toYYYYMM(event_date)</partition_by>
    <ttl>event_date + INTERVAL 14 DAY DELETE</ttl>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
</part_log>

重启并验证：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" 'systemctl restart clickhouse-server.service && systemctl is-active clickhouse-server.service'
done

ssh ops-01 'curl -sS "http://127.0.0.1:8123/" --data-binary "SELECT 1"'

集群启动验证

检查端口：

for h in ch-01 ch-02 ch-03; do
  echo "===== $h ====="
  ssh "$h" 'ss -lntp | egrep ":8123|:9000|:9009|:9363"'
done

基础查询：

for h in ch-01 ch-02 ch-03; do
  ssh "$h" '/www/clickhouse/current/bin/clickhouse-client --query "SELECT hostName(), version()"'
done

集群拓扑：

SELECT cluster, shard_num, replica_num, host_name, port
FROM system.clusters
WHERE cluster = 'cluster_1s3r'
ORDER BY shard_num, replica_num;

三副本应看到：

cluster_1s3r  1  1  ch-01  9000
cluster_1s3r  1  2  ch-02  9000
cluster_1s3r  1  3  ch-03  9000

分片、副本和 Keeper 的理解

副本是同一份数据复制多份。当前三台机器各保存一份完整数据：

ch-01 = replica
ch-02 = replica
ch-03 = replica

分片是把不同数据拆到不同机器组。当前只有 1 个分片，所以没有把数据拆开，主要目标是高可用。

Keeper 负责协调复制：

某台 ClickHouse 写入本地 part。
它把 part 信息登记到 Keeper。
其他副本从 Keeper 看到复制任务。
其他 ClickHouse Server 通过 9009 等节点间链路拉取 part。

Keeper 内部有 leader/follower，但这不是业务主库。ClickHouse 查询层没有固定主节点，三台健康副本都可以查。

复制表验证

创建测试库：

CREATE DATABASE IF NOT EXISTS ch_test ON CLUSTER cluster_1s3r;

创建复制表：

CREATE TABLE IF NOT EXISTS ch_test.replica_test ON CLUSTER cluster_1s3r
(
    id UInt64,
    source_node String,
    created_at DateTime DEFAULT now()
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/ch_test/replica_test', '{replica}')
ORDER BY id;

创建 Distributed 表：

CREATE TABLE IF NOT EXISTS ch_test.replica_test_all ON CLUSTER cluster_1s3r
AS ch_test.replica_test
ENGINE = Distributed(cluster_1s3r, ch_test, replica_test, rand());

写入测试：

INSERT INTO ch_test.replica_test (id, source_node) VALUES (1, hostName());

查询三副本：

SELECT hostName(), *
FROM clusterAllReplicas(cluster_1s3r, ch_test.replica_test)
ORDER BY hostName(), id;

检查复制状态：

SELECT
    hostName(),
    database,
    table,
    is_readonly,
    absolute_delay,
    total_replicas,
    active_replicas,
    queue_size
FROM clusterAllReplicas(cluster_1s3r, system.replicas)
WHERE database = 'ch_test';

期望：

is_readonly = 0
absolute_delay = 0
total_replicas = 3
active_replicas = 3
queue_size = 0

单节点下线测试

模拟下线 ch-01：

ssh ch-01 'systemctl stop clickhouse-server.service clickhouse-keeper.service'

从 ch-02 验证 Keeper 仍可访问、表仍可查询：

ssh ch-02 '/www/clickhouse/current/bin/clickhouse keeper-client --host 127.0.0.1 --port 9181 --query "ls /"'
ssh ch-02 '/www/clickhouse/current/bin/clickhouse-client --query "SELECT count() FROM ch_test.replica_test_all"'

下线期间写入第二行：

INSERT INTO ch_test.replica_test (id, source_node) VALUES (2, hostName());

检查剩余副本：

SELECT hostName(), active_replicas, total_replicas, is_readonly, absolute_delay, queue_size
FROM clusterAllReplicas(cluster_1s3r, system.replicas)
WHERE database = 'ch_test';

恢复 ch-01：

ssh ch-01 'systemctl start clickhouse-keeper.service clickhouse-server.service'

如果需要主动等待追平：

SYSTEM SYNC REPLICA ch_test.replica_test;

最终应恢复到 active_replicas=3。

HAProxy 统一查询入口

业务不建议直接写死某一台 ClickHouse。可以在 ops 节点部署 HAProxy：

业务服务 -> ops-01:8123 -> ch-01/ch-02/ch-03:8123
业务服务 -> ops-01:9000 -> ch-01/ch-02/ch-03:9000

安装：

ssh ops-01 'yum install -y haproxy'

示例配置：

global
    log 127.0.0.1 local0
    maxconn 4096

defaults
    log global
    mode tcp
    option tcplog
    timeout connect 5s
    timeout client  10m
    timeout server  10m

listen clickhouse_http
    bind 0.0.0.0:8123
    mode http
    option httplog
    option httpchk GET /ping
    http-check expect string Ok.
    balance roundrobin
    server ch-01 ch-01:8123 check
    server ch-02 ch-02:8123 check
    server ch-03 ch-03:8123 check

listen clickhouse_native
    bind 0.0.0.0:9000
    mode tcp
    balance roundrobin
    server ch-01 ch-01:9000 check
    server ch-02 ch-02:9000 check
    server ch-03 ch-03:9000 check

listen stats
    bind 127.0.0.1:8404
    mode http
    stats enable
    stats uri /

启动：

ssh ops-01 'systemctl enable --now haproxy.service && systemctl status haproxy.service --no-pager'

验证 HTTP 入口：

for i in $(seq 1 10); do
  ssh ops-01 'curl -sS http://127.0.0.1:8123 --data-binary "SELECT hostName()"'
done

验证 Native 入口：

ssh ops-01 'clickhouse-client --host 127.0.0.1 --port 9000 --query "SELECT hostName()"'

业务部署在 ops 本机时，优先连：

HTTP:   http://127.0.0.1:8123
Native: 127.0.0.1:9000

公网业务入口占位

ClickHouse 不直接暴露公网。公网入口只进入业务服务，业务服务再通过内网访问 HAProxy。

示例链路：

用户 -> https://api.example.com -> 前置网关/反代 -> ops-01:443 -> 业务服务 -> HAProxy -> ClickHouse

如果先用 Caddy 占位：

:443 {
    handle /healthz {
        respond "ok"
    }

    handle {
        reverse_proxy 127.0.0.1:8080
    }
}

验证：

curl -I https://api.example.com/healthz

高并发业务场景下，Caddy、Nginx、OpenResty 都能做入口。真正选择建议基于压测，而不是凭感觉。这里 ClickHouse 查询入口已经由 HAProxy 处理，公网反代只负责业务 HTTP 层。

ops 节点和 Docker

ops 节点安装 Docker：

ssh ops-01 'bash <(curl -fsSL https://get.docker.com)'
ssh ops-01 'systemctl enable --now docker'
ssh ops-01 'docker version'
ssh ops-01 'docker compose version'

如果企业环境有内部镜像源，可以写入 /etc/docker/daemon.json：

{
  "registry-mirrors": ["https://<your-docker-mirror>"]
}

重启：

ssh ops-01 'systemctl restart docker && docker info | grep -A3 "Registry Mirrors"'

安装 node_exporter：

NODE_EXPORTER_VERSION=1.9.1
curl -fLO "https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz"

for h in ch-01 ch-02 ch-03 ops-01; do
  scp "node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz" "$h:/tmp/"
  ssh "$h" "
    cd /tmp
    tar -xzf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
    mkdir -p /opt/node_exporter/releases
    rm -rf /opt/node_exporter/releases/${NODE_EXPORTER_VERSION}
    mv node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64 /opt/node_exporter/releases/${NODE_EXPORTER_VERSION}
    ln -sfn /opt/node_exporter/releases/${NODE_EXPORTER_VERSION} /opt/node_exporter/current
    useradd --system --shell /sbin/nologin node_exporter 2>/dev/null || true
    chown -R node_exporter:node_exporter /opt/node_exporter
  "
done

systemd：

[Unit]
Description=Prometheus Node Exporter
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
ExecStart=/opt/node_exporter/current/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target

验证：

for h in ch-01 ch-02 ch-03 ops-01; do
  ssh "$h" 'systemctl enable --now node_exporter.service; curl -sS http://127.0.0.1:9100/metrics | head'
done

Prometheus、Grafana 和 Alertmanager

监控链路：

ClickHouse /metrics -> Prometheus
node_exporter -> Prometheus
blackbox_exporter -> Prometheus
light SQL exporter -> Prometheus
Prometheus -> Grafana
Prometheus -> Alertmanager

/opt/ch-monitoring/monitoring
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── rules/clickhouse-basic.yml
├── alertmanager/alertmanager.yml
├── blackbox/blackbox.yml
└── grafana/provisioning/

docker-compose.yml 核心：

services:
  prometheus:
    image: prom/prometheus:v3.5.0
    container_name: ch-monitor-prometheus
    restart: unless-stopped
    ports:
      - "127.0.0.1:19090:9090"
    volumes:
      - ./prometheus:/etc/prometheus:ro
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=30d
      - --web.enable-lifecycle

  alertmanager:
    image: prom/alertmanager:v0.28.1
    container_name: ch-monitor-alertmanager
    restart: unless-stopped
    ports:
      - "127.0.0.1:19093:9093"
    volumes:
      - ./alertmanager:/etc/alertmanager:ro
      - alertmanager_data:/alertmanager
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
      - --storage.path=/alertmanager

  grafana:
    image: grafana/grafana-oss:12.0.2
    container_name: ch-monitor-grafana
    restart: unless-stopped
    ports:
      - "0.0.0.0:13000:3000"

volumes:
  prometheus_data:
  alertmanager_data:
  grafana_data:

Prometheus 配置重点：

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

基础告警规则：

groups:
  - name: clickhouse-basic
    rules:
      - alert: ClickHouseMetricsDown
        expr: up{job="clickhouse"} == 0
        for: 2m
        labels:
          severity: critical

      - alert: ClickHouseHTTPPingDown
        expr: probe_success{job="clickhouse-http-ping"} == 0
        for: 2m
        labels:
          severity: critical

      - alert: ClickHouseReplicaReadonly
        expr: clickhouse_sql_readonly_replicas > 0
        for: 1m
        labels:
          severity: critical

      - alert: ClickHouseActiveReplicasLow
        expr: clickhouse_sql_replicated_tables > 0 and clickhouse_sql_min_active_replicas < 3
        for: 2m
        labels:
          severity: critical

      - alert: ClickHouseReplicaDelayHigh
        expr: clickhouse_sql_max_replica_absolute_delay_seconds > 60
        for: 5m
        labels:
          severity: warning

      - alert: ClickHouseDiskAlmostFull
        expr: |
          (node_filesystem_avail_bytes{job="node",mountpoint="/www",fstype!~"tmpfs|overlay"} /
          node_filesystem_size_bytes{job="node",mountpoint="/www",fstype!~"tmpfs|overlay"}) < 0.15
        for: 5m
        labels:
          severity: warning

      - alert: NodeRootDiskAlmostFull
        expr: |
          (node_filesystem_avail_bytes{job="node",mountpoint="/",fstype!~"tmpfs|overlay"} /
          node_filesystem_size_bytes{job="node",mountpoint="/",fstype!~"tmpfs|overlay"}) < 0.15
        for: 5m
        labels:
          severity: warning

Alertmanager 先只接收不外发：

global:
  resolve_timeout: 5m

route:
  receiver: noop
  group_by:
    - alertname
    - cluster
    - instance_name
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: noop

启动：

ssh ops-01 'cd /opt/ch-monitoring/monitoring && docker compose up -d'

校验：

ssh ops-01 'docker exec ch-monitor-prometheus promtool check config /etc/prometheus/prometheus.yml'
ssh ops-01 'docker exec ch-monitor-alertmanager amtool check-config /etc/alertmanager/alertmanager.yml'
ssh ops-01 'curl -sS http://127.0.0.1:19090/api/v1/alertmanagers'
ssh ops-01 'curl -sS http://127.0.0.1:19090/api/v1/alerts'

后续要接邮件、飞书、企业微信时，不需要改 Prometheus 采集，只需要在 Alertmanager receiver 里新增 email 或 webhook。

Grafana 面板

这次面板覆盖：

ClickHouse 三节点状态。
HTTP /ping 探测。
HAProxy HTTP / Native 入口探测。
node_exporter 机器 CPU、内存、磁盘。
所有机器入站/出站带宽。
当前运行查询数。
只读副本数。
最小活跃副本数。
最大复制延迟。
复制队列数量。
active parts 总数。
单分区最大 parts。
当前 merge 数。
未完成 mutation 数。

网络带宽 PromQL 示例：

sum by (instance_name, device) (
  rate(node_network_receive_bytes_total{job="node",device!~"lo|docker.*|br.*|veth.*"}[5m])
) * 8

出站把 receive 换成 transmit：

sum by (instance_name, device) (
  rate(node_network_transmit_bytes_total{job="node",device!~"lo|docker.*|br.*|veth.*"}[5m])
) * 8

几个阈值建议：

当前运行查询数: 0-19 绿色，20-99 黄色，>=100 红色
只读副本数: 0 绿色，>=1 红色
最小活跃副本数: >=3 绿色，<3 红色
/www 使用率: 70% 开始关注，80% 制定方案，85% 尽快处理，90% 高风险

数据包迁移与下载排查

业务历史数据不要直接搬 Docker ClickHouse 数据目录到新集群。更稳妥流程是：

准备压缩包或导出文件。
放到某一台 ClickHouse 数据节点的数据盘。
校验压缩包完整性。
解压并核对文件数量、大小和导入清单。
抽取旧库 DDL。
调整为 ReplicatedMergeTree / Distributed 表结构。
使用具备写权限的管理账号导入。

建议目录：

/www/import/business_data/packages
/www/import/business_data/logs
/www/import/business_data/extracted
/www/import/business_data/ddl

用 tmux 挂下载：

ssh ch-01 'tmux new -s data_download'

HTTP 下载模板：

cd /www/import/business_data/packages
curl -L --fail --connect-timeout 20 --retry 3 --retry-delay 5 \
  --speed-time 60 --speed-limit 1024 \
  -o dataset.tar.gz \
  'https://download.example.com/path/to/dataset.tar.gz'

如果下载源支持 Range，可以用 aria2 多连接：

aria2c -x 8 -s 8 -k 4M \
  -d /www/import/business_data/packages \
  -o dataset.tar.gz \
  'https://download.example.com/path/to/dataset.tar.gz'

检查是否支持断点续传：

curl -I 'https://download.example.com/path/to/dataset.tar.gz' | grep -i 'Accept-Ranges'

校验压缩包：

gzip -t /www/import/business_data/packages/dataset.tar.gz
tar -tzf /www/import/business_data/packages/dataset.tar.gz | head

如果 ops 节点下载更快，可以先下载到 ops，再走内网传到数据节点：

rsync -avP /tmp/dataset.tar.gz ch-01:/www/import/business_data/packages/

内网速度可以用 iperf3 确认：

ssh ops-01 'yum install -y iperf3'
ssh ch-01 'yum install -y iperf3'
ssh ops-01 'iperf3 -s'
ssh ch-01 'iperf3 -c ops-01 -P 4 -t 20'

如果公网下载慢但内网很快，瓶颈通常不是 ClickHouse 机器或磁盘，而是公网出口、NAT、运营商链路或下载源限速。

后续容量扩展

当前拓扑是高可用优先：

1 shard x 3 replicas

扩容思路：

最简单的是三台一起扩 /www 数据盘，然后扩文件系统。
如果加新磁盘，可以规划 ClickHouse storage policy，把新盘挂到 /www2 之类路径后纳入策略。
单独加第 4 台做副本，只会多一份完整数据，提高读能力和冗余，不增加有效写入容量。
真正增加有效容量，需要增加分片。例如从 1 shard x 3 replicas 变成 2 shards x 3 replicas，通常意味着再增加 3 台数据节点。

新增分片不是简单加机器。需要重新评估：

Distributed 表。
sharding key。
历史数据重分布。
写入路由。
查询是否会跨分片放大。
回滚方案。