etcd官方文档中文版
  • Introduction
  • 官方文档
    • 开发指南
      • 搭建本地集群
      • 和 etcd 交互
      • 核心 API 参考文档
      • 并发 API 参考文档
      • gRPC 网关
      • gRPC 命名和发现
      • 试验性的 API 和特性
      • 系统限制
    • 运维指南
      • 搭建 etcd 集群
        • 运行时重配置
        • 运行时重配置的设计
      • 搭建 etcd 网关
      • 在容器内运行 etcd 集群
      • 配置
      • gRPC代理(TBD)
      • L4 网关
      • 支持平台
      • 硬件推荐(TBD)
      • 性能评测
      • 调优(TBD)
      • 安全模式
      • 基于角色的访问控制(TBD)
      • 常见问题(TBD)
      • 监控(TBD)
      • 维护
      • 理解失败
      • 灾难恢复
      • 版本
    • 学习
      • 为什么是etcd
      • 理解数据模型
      • 理解API
      • 术语
      • API保证
      • 认证子系统(TBD)
  • 核心 API 参考文档
    • KV service
      • Range方法
      • Put方法
      • DeleteRange方法
      • Txn方法
      • Compact方法
    • Watch service
      • Watch方法
    • Lease service
      • LeaseGrant方法
      • LeaseRevoke方法
      • LeaseKeepAlive方法
      • LeaseTimeToLive方法
  • 并发 API 参考文档
    • Lock service
      • Lock方法
      • Unlock方法
    • Election service
      • Campaign方法
      • Proclaim方法
      • Leader方法
      • Observe方法
      • Resign方法
  • 全文标签总览
Powered by GitBook
On this page
  • CPUs
  • Memory
  • Disks
  • Network
  • Example hardware configurations
  • Small cluster
  • Medium cluster
  • Large cluster
  • xLarge cluster
  1. 官方文档
  2. 运维指南

硬件推荐(TBD)

Previous支持平台Next性能评测

Last updated 6 years ago

TBD

etcd usually runs well with limited resources for development or testing purposes; it’s common to develop with etcd on a laptop or a cheap cloud machine. However, when running etcd clusters in production, some hardware guidelines are useful for proper administration. These suggestions are not hard rules; they serve as a good starting point for a robust production deployment. As always, deployments should be tested with simulated workloads before running in production.

CPUs

Few etcd deployments require a lot of CPU capacity. Typical clusters need two to four cores to run smoothly. Heavily loaded etcd deployments, serving thousands of clients or tens of thousands of requests per second, tend to be CPU bound since etcd can serve requests from memory. Such heavy deployments usually need eight to sixteen dedicated cores.

Memory

etcd has a relatively small memory footprint but its performance still depends on having enough memory. An etcd server will aggressively cache key-value data and spends most of the rest of its memory tracking watchers. Typically 8GB is enough. For heavy deployments with thousands of watchers and millions of keys, allocate 16GB to 64GB memory accordingly.

Disks

Fast disks are the most critical factor for etcd deployment performance and stability.

A slow disk will increase etcd request latency and potentially hurt cluster stability. Since etcd’s consensus protocol depends on persistently storing metadata to a log, a majority of etcd cluster members must write every request down to disk. Additionally, etcd will also incrementally checkpoint its state to disk so it can truncate this log. If these writes take too long, heartbeats may time out and trigger an election, undermining the stability of the cluster.

etcd is very sensitive to disk write latency. Typically 50 sequential IOPS (e.g., a 7200 RPM disk) is required. For heavily loaded clusters, 500 sequential IOPS (e.g., a typical local SSD or a high performance virtualized block device) is recommended. Note that most cloud providers publish concurrent IOPS rather than sequential IOPS; the published concurrent IOPS can be 10x greater than the sequential IOPS. To measure actual sequential IOPS, we suggest using a disk benchmarking tool such as or .

etcd requires only modest disk bandwidth but more disk bandwidth buys faster recovery times when a failed member has to catch up with the cluster. Typically 10MB/s will recover 100MB data within 15 seconds. For large clusters, 100MB/s or higher is suggested for recovering 1GB data within 15 seconds.

When possible, back etcd’s storage with a SSD. A SSD usually provides lower write latencies and with less variance than a spinning disk, thus improving the stability and reliability of etcd. If using spinning disk, get the fastest disks possible (15,000 RPM). Using RAID 0 is also an effective way to increase disk speed, for both spinning disks and SSD. With at least three cluster members, mirroring and/or parity variants of RAID are unnecessary; etcd's consistent replication already gets high availability.

Network

Multi-member etcd deployments benefit from a fast and reliable network. In order for etcd to be both consistent and partition tolerant, an unreliable network with partitioning outages will lead to poor availability. Low latency ensures etcd members can communicate fast. High bandwidth can reduce the time to recover a failed etcd member. 1GbE is sufficient for common etcd deployments. For large etcd clusters, a 10GbE network will reduce mean time to recovery.

Example hardware configurations

Here are a few example hardware setups on AWS and GCE environments. As mentioned before, but must be stressed regardless, administrators should test an etcd deployment with a simulated workload before putting it into production.

Note that these configurations assume these machines are totally dedicated to etcd. Running other applications along with etcd on these machines may cause resource contentions and lead to cluster instability.

Small cluster

A small cluster serves fewer than 100 clients, fewer than 200 of requests per second, and stores no more than 100MB of data.

Example application workload: A 50-node Kubernetes cluster

Provider

Type

vCPUs

Memory (GB)

Max concurrent IOPS

Disk bandwidth (MB/s)

AWS

m4.large

2

8

3600

56.25

GCE

n1-standard-1 + 50GB PD SSD

2

7.5

1500

25

Medium cluster

A medium cluster serves fewer than 500 clients, fewer than 1,000 of requests per second, and stores no more than 500MB of data.

Example application workload: A 250-node Kubernetes cluster

Provider

Type

vCPUs

Memory (GB)

Max concurrent IOPS

Disk bandwidth (MB/s)

AWS

m4.xlarge

4

16

6000

93.75

GCE

n1-standard-4 + 150GB PD SSD

4

15

4500

75

Large cluster

A large cluster serves fewer than 1,500 clients, fewer than 10,000 of requests per second, and stores no more than 1GB of data.

Example application workload: A 1,000-node Kubernetes cluster

Provider

Type

vCPUs

Memory (GB)

Max concurrent IOPS

Disk bandwidth (MB/s)

AWS

m4.2xlarge

8

32

8000

125

GCE

n1-standard-8 + 250GB PD SSD

8

30

7500

125

xLarge cluster

An xLarge cluster serves more than 1,500 clients, more than 10,000 of requests per second, and stores more than 1GB data.

Example application workload: A 3,000 node Kubernetes cluster

Provider

Type

vCPUs

Memory (GB)

Max concurrent IOPS

Disk bandwidth (MB/s)

AWS

m4.4xlarge

16

64

16,000

250

GCE

n1-standard-16 + 500GB PD SSD

16

60

15,000

250

Deploy etcd members within a single data center when possible to avoid latency overheads and lessen the possibility of partitioning events. If a failure domain in another data center is required, choose a data center closer to the existing one. Please also read the documentation for more information on cross data center deployment.

diskbench
fio
tuning