etcd官方文档中文版
  • Introduction
  • 官方文档
    • 开发指南
      • 搭建本地集群
      • 和 etcd 交互
      • 核心 API 参考文档
      • 并发 API 参考文档
      • gRPC 网关
      • gRPC 命名和发现
      • 试验性的 API 和特性
      • 系统限制
    • 运维指南
      • 搭建 etcd 集群
        • 运行时重配置
        • 运行时重配置的设计
      • 搭建 etcd 网关
      • 在容器内运行 etcd 集群
      • 配置
      • gRPC代理(TBD)
      • L4 网关
      • 支持平台
      • 硬件推荐(TBD)
      • 性能评测
      • 调优(TBD)
      • 安全模式
      • 基于角色的访问控制(TBD)
      • 常见问题(TBD)
      • 监控(TBD)
      • 维护
      • 理解失败
      • 灾难恢复
      • 版本
    • 学习
      • 为什么是etcd
      • 理解数据模型
      • 理解API
      • 术语
      • API保证
      • 认证子系统(TBD)
  • 核心 API 参考文档
    • KV service
      • Range方法
      • Put方法
      • DeleteRange方法
      • Txn方法
      • Compact方法
    • Watch service
      • Watch方法
    • Lease service
      • LeaseGrant方法
      • LeaseRevoke方法
      • LeaseKeepAlive方法
      • LeaseTimeToLive方法
  • 并发 API 参考文档
    • Lock service
      • Lock方法
      • Unlock方法
    • Election service
      • Campaign方法
      • Proclaim方法
      • Leader方法
      • Observe方法
      • Resign方法
  • 全文标签总览
Powered by GitBook
On this page
  • etcd, general
  • Do clients have to send requests to the etcd leader?
  • Configuration
  • What is the difference between advertise-urls and listen-urls?
  • Deployment
  • System requirements
  • Why an odd number of cluster members?
  • What is maximum cluster size?
  • What is failure tolerance?
  • Does etcd work in cross-region or cross data center deployments?
  • Operation
  • How to backup a etcd cluster?
  • Should I add a member before removing an unhealthy member?
  • Why won't etcd accept my membership changes?
  • Performance
  • How should I benchmark etcd?
  • What does the etcd warning "apply entries took too long" mean?
  • What does the etcd warning "failed to send out heartbeat on time" mean?
  • What does the etcd warning "request ignored (cluster ID mismatch)" mean?
  1. 官方文档
  2. 运维指南

常见问题(TBD)

Previous基于角色的访问控制(TBD)Next监控(TBD)

Last updated 6 years ago

etcd, general

Do clients have to send requests to the etcd leader?

is leader-based; the leader handles all client requests which need cluster consensus. However, the client does not need to know which node is the leader. Any request that requires consensus sent to a follower is automatically forwarded to the leader. Requests that do not require consensus (e.g., serialized reads) can be processed by any cluster member.

Configuration

What is the difference between advertise-urls and listen-urls?

listen-urls specifies the local addresses etcd server binds to for accepting incoming connections. To listen on a port for all interfaces, specify 0.0.0.0 as the listen IP address.

advertise-urls specifies the addresses etcd clients or other etcd members should use to contact the etcd server. The advertise addresses must be reachable from the remote machines. Do not advertise addresses like localhost or 0.0.0.0 for a production setup since these addresses are unreachable from remote machines.

Deployment

System requirements

Since etcd writes data to disk, SSD is highly recommended. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a 2GB default storage size quota, configurable up to 8GB. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD at the very least. Note that performance is intrinsically workload dependent; please test before production deployment. See for more recommendations.

Most stable production environment is Linux operating system with amd64 architecture; see for more.

Why an odd number of cluster members?

An etcd cluster needs a majority of nodes, a quorum, to agree on updates to the cluster state. For a cluster with n members, quorum is (n/2)+1. For any odd-sized cluster, adding one node will always increase the number of nodes necessary for quorum. Although adding a node to an odd-sized cluster appears better since there are more machines, the fault tolerance is worse since exactly the same number of nodes may fail without losing quorum but there are more nodes that can fail. If the cluster is in a state where it can't tolerate any more failures, adding a node before removing nodes is dangerous because if the new node fails to register with the cluster (e.g., the address is misconfigured), quorum will be permanently lost.

What is maximum cluster size?

What is failure tolerance?

It is recommended to have an odd number of members in a cluster. An odd-size cluster tolerates the same number of failures as an even-size cluster but with fewer nodes. The difference can be seen by comparing even and odd sized clusters:

Cluster Size

Majority

Failure Tolerance

1

1

0

2

2

0

3

2

1

4

3

1

5

3

2

6

4

2

7

4

3

8

5

3

9

5

4

Adding a member to bring the size of cluster up to an even number doesn't buy additional fault tolerance. Likewise, during a network partition, an odd number of members guarantees that there will always be a majority partition that can continue to operate and be the source of truth when the partition ends.

Does etcd work in cross-region or cross data center deployments?

Deploying etcd across regions improves etcd's fault tolerance since members are in separate failure domains. The cost is higher consensus request latency from crossing data center boundaries. Since etcd relies on a member quorum for consensus, the latency from crossing data centers will be somewhat pronounced because at least a majority of cluster members must respond to consensus requests. Additionally, cluster data must be replicated across all peers, so there will be bandwidth cost as well.

Operation

How to backup a etcd cluster?

Should I add a member before removing an unhealthy member?

When replacing an etcd node, it's important to remove the member first and then add its replacement.

etcd employs distributed consensus based on a quorum model; (n+1)/2 members, a majority, must agree on a proposal before it can be committed to the cluster. These proposals include key-value updates and membership changes. This model totally avoids any possibility of split brain inconsistency. The downside is permanent quorum loss is catastrophic.

How this applies to membership: If a 3-member cluster has 1 downed member, it can still make forward progress because the quorum is 2 and 2 members are still live. However, adding a new member to a 3-member cluster will increase the quorum to 3 because 3 votes are required for a majority of 4 members. Since the quorum increased, this extra member buys nothing in terms of fault tolerance; the cluster is still one node failure away from being unrecoverable.

Additionally, that new member is risky because it may turn out to be misconfigured or incapable of joining the cluster. In that case, there's no way to recover quorum because the cluster has two members down and two members up, but needs three votes to change membership to undo the botched membership addition. etcd will by default reject member add attempts that could take down the cluster in this manner.

On the other hand, if the downed member is removed from cluster membership first, the number of members becomes 2 and the quorum remains at 2. Following that removal by adding a new member will also keep the quorum steady at 2. So, even if the new node can't be brought up, it's still possible to remove the new member through quorum on the remaining live members.

Why won't etcd accept my membership changes?

etcd sets strict-reconfig-check in order to reject reconfiguration requests that would cause quorum loss. Abandoning quorum is really risky (especially when the cluster is already unhealthy). Although it may be tempting to disable quorum checking if there's quorum loss to add a new member, this could lead to full fledged cluster inconsistency. For many applications, this will make the problem even worse ("disk geometry corruption" being a candidate for most terrifying).

Performance

How should I benchmark etcd?

What does the etcd warning "apply entries took too long" mean?

After a majority of etcd members agree to commit a request, each etcd server applies the request to its data store and persists the result to disk. Even with a slow mechanical disk or a virtualized network disk, such as Amazon’s EBS or Google’s PD, applying a request should normally take fewer than 50 milliseconds. If the average apply duration exceeds 100 milliseconds, etcd will warn that entries are taking too long to apply.

The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation cgroups, or renicing the etcd server process into a higher priority can usually solve the problem.

Expensive user requests which access too many keys (e.g., fetching the entire keyspace) can also cause long apply latencies. Accessing fewer than a several hundred keys per request, however, should always be performant.

What does the etcd warning "failed to send out heartbeat on time" mean?

etcd uses a leader-based consensus protocol for consistent data replication and log execution. Cluster members elect a single leader, all other members become followers. The elected leader must periodically send heartbeats to its followers to maintain its leadership. Followers infer leader failure if no heartbeats are received within an election interval and trigger an election. If a leader doesn’t send its heartbeats in time but is still running, the election is spurious and likely caused by insufficient resources. To catch these soft failures, if the leader skips two heartbeat intervals, etcd will warn it failed to send a heartbeat on time.

The second most common cause is CPU starvation. If monitoring of the machine’s CPU usage shows heavy utilization, there may not be enough compute capacity for etcd. Moving etcd to dedicated machine, increasing process resource isolation with cgroups, or renicing the etcd server process into a higher priority can usually solve the problem.

What does the etcd warning "request ignored (cluster ID mismatch)" mean?

Every new etcd cluster generates a new cluster ID based on the initial cluster configuration and a user-provided unique initial-cluster-token value. By having unique cluster ID's, etcd is protected from cross-cluster interaction which could corrupt the cluster.

Usually this warning happens after tearing down an old cluster, then reusing some of the peer addresses for the new cluster. If any etcd process from the old cluster is still running it will try to contact the new cluster. The new cluster will recognize a cluster ID mismatch, then ignore the request and emit this warning. This warning is often cleared by ensuring peer addresses among distinct clusters are disjoint.

Theoretically, there is no hard limit. However, an etcd cluster probably should have no more than seven nodes. , similar to etcd and widely deployed within Google for many years, suggests running five nodes. A 5-member etcd cluster can tolerate two member failures, which is enough in most cases. Although larger clusters provide better fault tolerance, the write performance suffers because data must be replicated across more machines.

An etcd cluster operates so long as a member quorum can be established. If quorum is lost through transient network failures (e.g., partitions), etcd automatically and safely resumes once the network recovers and restores quorum; Raft enforces cluster consistency. For power loss, etcd persists the Raft log to disk; etcd replays the log to the point of failure and resumes cluster participation. For permanent hardware failure, the node may be removed from the cluster through .

With longer latencies, the default etcd configuration may cause frequent elections or heartbeat timeouts. See for adjusting timeouts for high latency deployments.

etcdctl provides a snapshot command to create backups. See for more details.

Try the tool. Current are available for comparison.

Usually this issue is caused by a slow disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor (p99 duration should be less than 25ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.

If none of the above suggestions clear the warnings, please with detailed logging, monitoring, metrics and optionally workload information.

Usually this issue is caused by a slow disk. Before the leader sends heartbeats attached with metadata, it may need to persist the metadata to disk. The disk could be experiencing contention among etcd and other applications, or the disk is too simply slow (e.g., a shared virtualized disk). To rule out a slow disk from causing this warning, monitor (p99 duration should be less than 10ms) to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.

A slow network can also cause this issue. If network metrics among the etcd machines shows long latencies or high drop rate, there may not be enough network capacity for etcd. Moving etcd members to a less congested network will typically solve the problem. However, if the etcd cluster is deployed across data centers, long latency between members is expected. For such deployments, tune the heartbeat-interval configuration to roughly match the round trip time between the machines, and the election-timeout configuration to be at least 5 * heartbeat-interval. See for detailed information.

If none of the above suggestions clear the warnings, please with detailed logging, monitoring, metrics and optionally workload information.

Raft
hardware
supported platform
Google Chubby lock service
runtime reconfiguration
tuning
backup
benchmark
benchmark results
backend_commit_duration_seconds
open an issue
wal_fsync_duration_seconds
tuning documentation
open an issue