Bug #65768: rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log - RADOS - Ceph

Actions

Copy link

Bug #65768

open

rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Added by Sridhar Seshasayee 15 days ago. Updated 1 day ago.

Status:

New

Priority:

Normal

Assignee:

Sridhar Seshasayee

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

57485

Crash signature (v1):

Crash signature (v2):

Description

This is observed on squid. I couldn't find a tracker on main related to this test.
A more proper analysis on whether this needs to be fixed on main branch is needed as well.
If analysis shows that the fix is needed on main as well, this tracker can probably be
clubbed with https://tracker.ceph.com/issues/65521 which is tracking a bunch of other
trackers related to adding cluster log warnings to the ignorelist.

/a/yuriw-2024-04-30_03:21:19-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7680387

Description:
rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/none mon_election/classic msgr-failures/few msgr/async objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

The OSD_DOWN is expected since it is taken down as part of the thrasher. The warning is eventually cleared.
This warning must therefore be added to the ignorelist.

2024-04-30T11:35:49.898+0000 11fae640 10 mon.a@0(leader).log v419  logging 2024-04-30T11:35:49.821422+0000 mon.a (mon.0) 1015 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Laura Flores 11 days ago

Related to Cleanup #65521: Add expected warnings in cluster log to ignorelists added

Actions

Copy link

Updated by Radoslaw Zarzynski 11 days ago

Sridhar, are you working on this?

Actions

Copy link

Updated by Sridhar Seshasayee 10 days ago

Assignee set to Sridhar Seshasayee

@Radoslaw Zarzynski I found this during a review of a squid run that included a couple of my PRs.
I wasn't working on this, but to help out I can take it up and come up with a fix.

Actions

Copy link

Updated by Laura Flores 4 days ago

@Sridhar Seshasayee you can add me as a reviewer if you raise a PR to add this to the ignorelist, or otherwise.

Actions

Copy link

Updated by Sridhar Seshasayee 1 day ago · Edited

Further analysis of the logs show that the OSD_DOWN warning was generated because osd.1 exceeded the heartbeat grace timeout period
as shown below and the warning was cleared a few secs later:

2024-04-30T11:35:48.878+0000 fba9640 10 mon.a@0(leader).log v418  logging 2024-04-30T11:35:48.875903+0000 mon.a (mon.0) 1013 : cluster [INF] osd.1 failed (root=default,host=smithi005) (2 reporters from different osd after 87.150472 >= grace 80.000000)

...

2024-04-30T11:35:49.809+0000 d3a4640  2 mon.a@0(leader).osd e308  osd.1 DOWN
2024-04-30T11:35:49.810+0000 d3a4640 10 mon.a@0(leader).osd e308 encode_pending encoding full map with squid features 1080873258835847684
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308 encode_pending mon is running version: 19.0.0-2455-g09dbd6bb
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308  full_crc 1068048625 inc_crc 3555704646
2024-04-30T11:35:49.819+0000 d3a4640 10 mon.a@0(leader) e1 log_health updated 1 previous 0
2024-04-30T11:35:49.819+0000 d3a4640  0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

The test sets osd_heartbeat_grace to 80. Historically, this value was set to 40 secs and was
increased to 80 secs a while ago as part of the following commit:
https://github.com/ceph/ceph/pull/34011/commits/4fda9d50f09d527262fd65eab9b9cff3fd700aad

Considering the nature of the test, the osd_heartbeat_grace timeout can be further increased to 90 secs
on main and backported just to Squid to begin with.

@Radoslaw Zarzynski, what do you think?

Actions

Copy link