Our Services
Software Engineering

Engineer your business' future through tailored solutions

Big Data

Cloud and Infrastructure

DevOps

Custom Software Development

Artificial Intelligence (AI)

Internet of Things (IoT)

QA & Software Testing

Project Support

Expand your technical capabilities rapidly

Team Augmentation

Full Teams

Technology Consulting

Get guidance in making informed technical decisions

IT System Audit

Architecture Design

IoT Hardware Design

Business Consulting

Make your business transformation worthwhile

Agile Consulting

Business Analysis

Technical Training
Can?t find what you?re looking for?

We might still be able to help you. We have the skills and experience to match any challenge.
Our Products
KubeLake

Kubernetes-native modular data platform for handling, processing, and analyzing vast amounts of data, both in real-time and in batch.

Discover

Retail

kountr

smart store monitoring

rupio

Digital Business

roomio

TVio

digital checklist

panomio

Agile

Agile Work Assessment

AgilePro: Interactive Training

Agile Compass Program

AI / ML / Computer Vision

Transcriber

spot

AI-Powered Chat Assistant

Data & Infrastructure

KubeLake

KubeSol

IT Audits

Application Audit

IT Landscape Audit
Can?t find what you?re looking for?

We might still be able to help you. We have the skills and experience to match any challenge.
Portfolio
Technologies
Languages & Frameworks

Leading languages & frameworks for your project

Java / Kotlin

JavaScript / TypeScript

Python

C / C++

GoLang

Spring

ReactJS / Angular

Flutter

Infrastructure

Streamlined infrastructure solutions for modern demands

AWS / GCP / Azure

Kubernetes

Docker

Terraform / Terragrunt

ArgoCD / Jenkins

Ansible

Crossplane

Grafana / Loki / Prometheus

Linux

Database & Storage

On-premise & cloud solutions for every business need

Cassandra

BigQuery

PosgreSQL

OracleDB

Redis

Apache Druid

MongoDB

Solr / ElasticSearch

Cloud Storage

HDFS

MiniO

GlusterFS

Processing & Pipelines

Architecting your data processing pipelines

Apache Spark

Kafka / Kafka Streams

MQTT

Airflow

Apache Beam

Apache NiFi

Databricks

Snowflake
Can?t find what you?re looking for?

We might still be able to help you. We have the skills and experience to match any challenge.
Blog
About

Troubleshooting Alert: Component etcd is unhealthy in Kubernetes / Rancher 2.x

TLDR

To quickly decide if this is for you, this article deals with 3 errors/alerts related to the etcd component of a Kubernetes cluster:

# in rancher web interface: alert component etcd is unhealthy

# docker logs etcd

panic: freepages: failed to get all reachable pages (page 6987: multiple references)

# docker logs etcd

rafthttp: request cluster ID mismatch ...

rafthttp: request sent was ignored (cluster ID mismatch: ...

Rancher etcd alert

We use Rancher 2.x to manage some of our Kubernetes clusters and this has a nice web interface. One day we got this alert / error about 'etcd is unhealthy'.

Rancher 2x

Etcd is the Kubernetes database and it holds basically all the cluster information. You can backup and restore a full cluster having only etcd data. Fortunately, etcd (usually) runs in a HA system with 3 nodes. But now we have one of the etcd nodes down, as can be confirmed with

$ kubectl get component status
Warning: v1 ComponentStatus is deprecated in v1.19+
NAME                 STATUS      MESSAGE                                                                                             ERROR
etcd-1               Unhealthy   Get "https://192.168.60.48:2379/health": dial tcp 192.168.60.48:2379: connect: connection refused   
scheduler            Healthy     ok                                                                                                  
controller-manager   Healthy     ok                                                                                                  
etcd-2               Healthy     {"health":"true"}                                                                                   

etcd-0               Healthy     {"health":"true"}

Docker logs error

This is a cluster installed directly from the Rancher interface, and by ssh-ing into that VM we can see etcd is running inside a docker container:

$ docker ps | grep etcd

5fb87452ddff   rancher/coreos-etcd:v3.4.13-rancher1   "/usr/local/bin/etcd…"
   6 months ago   Restarting (2) 21 seconds ago 
             etcd

So, the container is not running ("Restarting"...). Looking in logs:

$ docker logs etcd

[...] 

panic: freepages: failed to get all reachable pages (page 6987: multiple references)

goroutine 126 [running]:

go.etcd.io/bbolt.(*DB).freepages.func2(0xc0001c64e0)

/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.etcd.io/bbolt@v1.3.3/db.go:1003 +0xe5

created by go.etcd.io/bbolt.(*DB).freepages

/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/go.etcd.io/bbolt@v1.3.3/db.go:1001 +0x1b5

The container gets in the same state when trying to restart it (or docker container stop ...; docker container start ...).

Some online searches later, the general opinion was the etcd database is corrupt, probably by an unclean restart/shutdown and the etcd node must be recreated. It will start empty and will sync the database from the other nodes.

Yeey, that's seems simple and it is also confirmed by the official Rancher documentation at https://rancher.com/docs/rancher/v2.5/en/troubleshooting/kubernetes-components/etcd/#replacing-unhealthy-etcd-nodes where we find this:

"When a node in your etcd cluster becomes unhealthy, the recommended approach is to fix or remove the failed or unhealthy node before adding a new etcd node to the cluster."

and then the page ends. Wait! Wait!! How? How do we do that?? Well, back to online searches...

Etcd - removing and adding nodes

In the same Rancher documentation, at least, we find we can use etcdctl commands like this, on a working etcd node:

$ docker exec etcd etcdctl member list

b5f3d11f27cbc784, started, etcd-k1c1, https://192.168.60.47:2380, https://192.168.60.47:2379, false

d8c8687da4a99f58, started, etcd-k1c2, https://192.168.60.48:2380, https://192.168.60.48:2379, false

dd6d795b336a5dc7, started, etcd-k1c3, https://192.168.60.49:2380, https://192.168.60.49:2379, false

Ok, that will be useful later. We read more docs about etcd to find out how to remove and re-add a node. I have not been able to find info for our particular case but I think the process should be like this:

etcdctl member remove ... 

docker container stop ... # the unhealthy etcd node 

remove the data directory so the container will start with empty database

etcdctl member add ... # we need to check parameters 

docker container start ... # the unhealthy etcd node

After some experiments and filling in the dots, as an example, our commands were:

1$ docker exec etcd etcdctl member list # on a working etcd node

1$ docker exec etcd etcdctl member remove c6d99083f92f060c


2$ docker container stop etcd  # on the unhealthy etcd node

2$ rm -rf  /var/lib/etcd/member/ # this is the host directory mounted in the etcd container
1$ docker exec etcd etcdctl member add etcd-vm5 --peer-urls=https://64.225.105.45:2380 

# exactly as it was before when we saw it as etcdctl member list

2$ docker container start etcd

Success! Drinks!

From docker inspect to docker run

Um, not really, not yet... Looking the container logs, we see a new error:

2021-07-30 06:56:50.766417 E | rafthttp: request cluster ID mismatch (got 1df875ec16cdadc0 want 2aafb0f5f954d734)

2021-07-30 06:56:50.795741 E | rafthttp: request cluster ID mismatch (got 1df875ec16cdadc0 want 2aafb0f5f954d734)

2021-07-30 06:56:50.798215 E | rafthttp: request sent was ignored (cluster ID mismatch: peer[55981658a5324937]=1df875ec16cdadc0, local=2aafb0f5f954d734)

2021-07-30 06:56:50.798893 E | rafthttp: request sent was ignored (cluster ID mismatch: peer

So our trick with docker stop; docker start was not enough. After some more online searching, we found out we need to start this container from zero and with some command line arguments changed! More exactly, we must change this argument:

from

"--initial-cluster-state=new"

to

"--initial-cluster-state=existing"

which kinda makes sense and seems logical.

We don't have the full docker run command used to start this container, but we can use docker inspect to find out all the details:

docker inspect

So, how can we create a docker run command from this? Either with hard work... or by online searching and finding this solution: https://gist.github.com/efrecon/8ce9c75d518b6eb863f667442d7bc679 . The file run.tpl is now also saved here: https://github.com/viorel-anghel/alert-component-etcd-is-unhealthy .

We save that file as run.tpl on the unhealthy node and finally we run

docker inspect etcd >docker-inspect-etcd-save.txt    # save the info, just in case

docker inspect --format "$(<run.tpl)" etcd >docker-run-etcd   # create the file with the docker run command

The resulted file will look something like:

resulted file

Now we just need to create the new container:

docker rm -f etcd  # remove it now, we'll start fresh

rm -rf  /var/lib/etcd/member/

# edit the file docker-run-etcd and replace 

# --initial-cluster-state=new

# with 

# --initial-cluster-state=existing

# the simply run the resulting command:

bash docker-run-etcd

Checking the etcd status with docker logs etcd, etcdctl member list, kubectl get componentstatus: yes, all good!

Lessons learned? I have to study the etcd official documentation at https://etcd.io/docs/v3.5/op-guide/ .

As a final advice for my fellow devops/sysadmins, please note I've done the experiments with etcd member removing / adding on a test cluster, not directly on the one with problems which is production.

About the Author

viorel anghel esolutions

Viorel Anghel has 20+ years of experience as an IT Professional, taking on various roles, such as Systems Architect, Sysadmin, Network Engineer, SRE, DevOps, and Tech Lead. He has a background in Unix/Linux systems administration, high availability, scalability, change, and config management. Also, Viorel is a RedHat Certified Engineer and AWS Certified Solutions Architect, working with Docker, Kubernetes, Xen, AWS, GCP, Cassandra, Kafka, and many other technologies. He is the Head of Cloud and Infrastructure at eSolutions.

Got a question or need advice? We're just one click away.

See all articles

We are a technology company developing integrated, complex, and secure big data, cloud, IoT, and software engineering solutions for key market verticals.

Bucharest

20 General Constantin Budisteanu street, Bucharest, 010775

Office hours

Monday - Friday : 09:00 - 18:00

Bonn

Am Dickobskreuz 10, D-53121 Bonn

Berlin

Kohlfurter Strasse 41/43, D-10999 Berlin

Koln

Kennedy-Ufer 11, D-50679 Koln

Software Engineering

Project Support

Technology Consulting

Business Consulting

Retail

Digital Business

Agile

AI / ML / Computer Vision

Data & Infrastructure

IT Audits