Bug #56672
open'ceph zabbix send' can block (mon) ceph commands and messages
0%
Description
It is possible to DOS the MGR by executing repeated `ceph zabbix send` manually when the zabbix server is unresponsive.
This can stop important MON messages and commands until the zabbix send returns. If these are configured (eg.) as a cron job and keep stacking this can block MON commands indefinitely.
When the mon message throttler is full, `ceph status` shows stale and inaccurate info (presumably it uses cached info if it can't get latest from MGR), and other important cmds hang, eg. `ceph osd df` `ceph pg dump` etc
How to reproduce:
1. To make it easier, set the MON message throttler to a very low amount this will trigger it very easily, in ceph.conf for MGR:
mgr mon messages = 2
and restart the MGRs.
2. Monitor the mon message throttler on the active MGR using `ceph daemon mgr.{id} perf dump | jq '."throttle-mgr_mon_messages"'`
This looks something like this:
root@soz-mon2:/home/debian# ceph daemon mgr.`hostname` perf dump | jq '."throttle-mgr_mon_messsages"' { "val": 0, "max": 2, "get_started": 0, "get": 9781, "get_sum": 9781, "get_or_fail_fail": 297888, "get_or_fail_success": 9781, "take": 0, "take_sum": 0, "put": 9781, "put_sum": 9781, "wait": { "avgcount": 0, "sum": 0, "avgtime": 0 } }
3. Set up a fake non responsive zabbix server (or real zabbix if you can make it unresponsive).
apt install netcat nc -k -l -v -p 10051
4. Configure your ceph zabbix module to that server
ceph zabbix config-set zabbix_host {your netcat server}
5. Set up the ceph zabbix host config
ceph zabbix config-set zabbix_host {your fake/unresponsive zabbix server}
6. Run `ceph zabbix send` a few times in the background, probably 5-10 is enough.
for i in `seq 1 5`; do ceph zabbix send &done
7. Check the throttler from perf dump again, it should show "val" reached max, and get_or_fail_fail increasing.
Check commands such as '*ceph osd df*' '*ceph pg dump*' '*ceph fs status*' etc, they will hang.
{ "val": 2, "max": 2, "get_started": 0, "get": 13085, "get_sum": 13085, "get_or_fail_fail": 518045, "get_or_fail_success": 13085, "take": 0, "take_sum": 0, "put": 13083, "put_sum": 13083, "wait": { "avgcount": 0, "sum": 0, "avgtime": 0 } }
In this test, after the zabbix commands all time out (60s) it should release the throttler and everything comes back. In the case of `ceph zabbix send` being run indefinitely (we discovered this behaviour from a `ceph zabbix send` cron job)
Updated by Rafael Lopez almost 2 years ago
submitted PR https://github.com/ceph/ceph/pull/47225
Updated by Konstantin Shalygin almost 2 years ago
- Status changed from New to Fix Under Review
- Assignee set to Rafael Lopez
- Source set to Community (user)
- Pull request ID set to 47225
Updated by Patrick Donnelly almost 2 years ago
- Target version changed from v17.2.2 to v18.0.0