Node Monitoring Device
Prerequisites
We recommend the following for running a node monitoring device:
2 or more CPU cores
At least 40G of disk storage
At least 4G of memory
At least 10mbps network bandwidth
Have to be setup in a separate environment from validator nodes/sentry nodes
Before setting up a node monitoring device, you may take a look at the PundiX installation setup to setup the PundiX CLI.
Prometheus metrics
PundiX also supports the use of Prometheus metrics. This monitoring device allows you to keep up to date with you validator nodes especially the status and performance of your validator nodes.
More information on the list of available metrics and useful queries can be found here.
Deploy and Configure Monitoring Services
Before deploying monitoring program, install docker following the official docs.
Configure node services
User should git clone pundix from Github first!
The config.toml file is in the /.pundix directory and the prometheus.yml file is in /pundix directory.
To enable the Prometheus metrics, set prometheus=true
in your config file $HOME/.pundix/config/config.toml
. Through setting the prometheus_listen_addr
in the config file, you may choose the port for you to monitor your node. It is defaulted to port 26660
.
In the file ./pundix/develop/prometheus/prometheus.yml
you can configure the target node(s) IP address, multiple nodes can be added in the following format.
For example:
Telegram Administrator and Bot Configuration
In the file ./pundix/develop/docker-compose.yaml
under alertmanager-bot
- environment
are the variables TELEGRAM_ADMIN
and TELEGRAM_TOKEN
:
For example:
TELEGRAM_ADMIN
: The Telegram user id for the admin (not the bot itself, you, the user). The bot will only reply to messages sent from an admin. All other messages are dropped and logged on the bot's console. Your can get your user id from@userinfobot
.TELEGRAM_TOKEN
: Token you get from@botfather
Access the monitoring services
Should you not want to change the default username and password, you can start the monitoring service by using the following command:
Open port
:9095
(for examplehttp:// <your_IP_address>:9095
) and you will see the prometheus page. Here you can see all the defined alarm rules. You can change these rules in the file./pundix/develop/prometheus/rules/pundix-chain-alerts.yml
. The default username and password arepx
andpundix
respectively.Open port
:9093
(for examplehttp:// <your_IP_address>:9093
) and you will see thealertmanager
page. You can manage alarm notifications here. The default username and password arepx
andpundix
respectively.Open port
:3000
(for examplehttp:// <your_IP_address>:3000
) and you will see thegrafana
page.The default username and password are both
admin
, once you have logged in you will be asked to set a new password.After setting a new password, you can go into Dashboards > Manage and select 'PundiX Chain Dashboard'. Here you can see a dashboard of various indicators and information of a selected node.
You may find out the details of
<your_IP_address>
.
Authorise inbound traffic for the following ports ranges 9091, 9093, 3000 for
<your_IP_address>
in node monitoring device. you can also allow the port range 26660 for<node_
monitoring_public_ip
>
in the validator instance.
Changing The Default Passwords for Prometheus and Alertmanager
DO NOT use $
in any of your passwords, as it will not work with the alertmanager.url
You can change the default username and password in the file ./pundix/develop/prometheus/web-config.yml
with the following format:
How to hash with bcrypt:
install apache2
input this command
htpasswd -nBC 10 "" | tr -d ':\\n'
type in password
copy hash
For more info on prometheus web-configuration see this link.
If you changed the username and password in the web-config.yaml
file there are 3 other areas where you need to update as well.
Grafana
For grafana
to be able to get data from prometheus
you will need to update the username and password in the file ./pundix/develop/grafana/provisioning/datasources/datasource.yml
The password here is in text format and does not need to be hashed.
For example:
Prometheus
For prometheus
to be able to send alerts to alertmanager
you will need to update the username and password in the file ./pundix/develop/prometheus/prometheus.yml
For example:
The password here is in text format and does not need to be hashed.
Alertmanager-bot
For the telegram bot to be able to obtain information from the alertmanager
you will need to update the username and password within the --alertmanager.url
in the ./pundix/develop/docker-compose.yaml
file. Also you may update the alert-manager-bot.
For example:
The password here is in text format and does not need to be hashed
Under alertmanager-bot
, command
you will find --alertmanager.url
:
DO NOT use $
in any of your passwords, as it will not work with the alertmanager.url
Commands
Start monitoring service:
Restart monitoring service:
Stop monitoring service:
Updating Node Monitoring Services
Do a update by pulling the latest code with the below command, whenever you are making changes to the telegram configuration under ./pundix/develop/docker-compose.yaml
Ensure you have changed your passwords and also that your data source is configured correctly
Prometheus Rules
Metric | Rule | Threshold | explain |
---|---|---|---|
| tendermint_consensus_height - (tendermint_consensus_height offset 1m) == 0 | 0 | The node did not produce blocks in 1 minute |
| avg((tendermint_consensus_validators{kind="val-node"} - tendermint_consensus_validators{kind="val-node"} offset 1m) > 0) by (chain_id) | 0 | The number of validators has increased compared to the number of validators a minute ago |
| avg((tendermint_consensus_validators{kind="val-node"} offset 1m - tendermint_consensus_validators{kind="val-node"}) > 0) by (chain_id) | 0 | The number of validators is reduced compared to the number of validators one minute ago |
| tendermint_consensus_latest_block_height - (tendermint_consensus_latest_block_height offset 2m) | 0 | The height of the node does not increase in 2 minutes |
| tendermint_consensus_validator_last_signed_height - (tendermint_consensus_validator_last_signed_height offset 2m) == 0 | 0 | The verifier did not sign in 2 minutes |
| tendermint_consensus_validator_missed_blocks - (tendermint_consensus_validator_missed_blocks offset 2m) >= 3 | 3 | The total number of blocks with the verifier address not participating in the signature is greater than 3 |
| tendermint_consensus_missing_validators > 10 | 10 | The number of verifiers not participating in the signature exceeds the threshold of 10 |
| tendermint_consensus_byzantine_validators > 0 | 0 | The number of Byzantine validators exceeds the threshold 0 |
| tendermint_consensus_byzantine_validators > 0 | 0 | The number of Byzantine validators exceeds the threshold 0 |
| tendermint_consensus_block_interval_seconds_sum / tendermint_consensus_block_interval_seconds_count > 7 | 7 | The block generation interval exceeds 7 seconds |
| tendermint_consensus_rounds != 0 | 0 | Consensus round is not equal to 0 |
| tendermint_consensus_num_txs > 100 | 100 | The number of block packaging transactions exceeds the threshold of 100 |
| tendermint_mempool_size > 100 | 100 | The number of unchained transactions in the memory pool exceeds the threshold of 100 |
| tendermint_mempool_failed_txs - (tendermint_mempool_failed_txs offset 1m) > 10 | 10 | The number of failed transactions in the memory pool has increased by more than 10 in 1 minute |
| tendermint_consensus_fast_syncing - (tendermint_consensus_fast_syncing offset 5m) != 0 | 0 | The current synchronization status of the node is not 0 |
| tendermint_p2p_peers < 5 | 5 | The number of connected nodes is below the threshold 5 |
| (tendermint_p2p_peers offset 30s) - tendermint_p2p_peers > 1 | 1 | The number of currently connected nodes decreases for 1 minute |
Last updated