Bug #56672: 'ceph zabbix send' can block (mon) ceph commands and messages - mgr - Ceph

Actions

Copy link

Bug #56672

open

'ceph zabbix send' can block (mon) ceph commands and messages

Added by Rafael Lopez almost 2 years ago. Updated almost 2 years ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Rafael Lopez

Category:

zabbix module

Target version:

Ceph - v18.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

47225

Crash signature (v1):

Crash signature (v2):

Description

It is possible to DOS the MGR by executing repeated `ceph zabbix send` manually when the zabbix server is unresponsive.

This can stop important MON messages and commands until the zabbix send returns. If these are configured (eg.) as a cron job and keep stacking this can block MON commands indefinitely.

When the mon message throttler is full, `ceph status` shows stale and inaccurate info (presumably it uses cached info if it can't get latest from MGR), and other important cmds hang, eg. `ceph osd df` `ceph pg dump` etc

How to reproduce:

1. To make it easier, set the MON message throttler to a very low amount this will trigger it very easily, in ceph.conf for MGR:

mgr mon messages = 2

and restart the MGRs.

2. Monitor the mon message throttler on the active MGR using `ceph daemon mgr.{id} perf dump | jq '."throttle-mgr_mon_messages"'`
This looks something like this:

root@soz-mon2:/home/debian# ceph daemon mgr.`hostname` perf dump | jq '."throttle-mgr_mon_messsages"'
{
  "val": 0,
  "max": 2,
  "get_started": 0,
  "get": 9781,
  "get_sum": 9781,
  "get_or_fail_fail": 297888,
  "get_or_fail_success": 9781,
  "take": 0,
  "take_sum": 0,
  "put": 9781,
  "put_sum": 9781,
  "wait": {
    "avgcount": 0,
    "sum": 0,
    "avgtime": 0
  }
}

3. Set up a fake non responsive zabbix server (or real zabbix if you can make it unresponsive).

apt install netcat
nc -k -l -v -p 10051

4. Configure your ceph zabbix module to that server

ceph zabbix config-set zabbix_host {your netcat server}

5. Set up the ceph zabbix host config

ceph zabbix config-set zabbix_host {your fake/unresponsive zabbix server}

6. Run `ceph zabbix send` a few times in the background, probably 5-10 is enough.

for i in `seq 1 5`; do ceph zabbix send &done

7. Check the throttler from perf dump again, it should show "val" reached max, and get_or_fail_fail increasing.
Check commands such as '*ceph osd df*' '*ceph pg dump*' '*ceph fs status*' etc, they will hang.

{
  "val": 2,
  "max": 2,
  "get_started": 0,
  "get": 13085,
  "get_sum": 13085,
  "get_or_fail_fail": 518045,
  "get_or_fail_success": 13085,
  "take": 0,
  "take_sum": 0,
  "put": 13083,
  "put_sum": 13083,
  "wait": {
    "avgcount": 0,
    "sum": 0,
    "avgtime": 0
  }
}

In this test, after the zabbix commands all time out (60s) it should release the throttler and everything comes back. In the case of `ceph zabbix send` being run indefinitely (we discovered this behaviour from a `ceph zabbix send` cron job)