Project

General

Profile

Actions

Bug #65768

open

rados/verify: Health check failed: 1 osds down (OSD_DOWN)" in cluster log

Added by Sridhar Seshasayee 15 days ago. Updated 1 day ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is observed on squid. I couldn't find a tracker on main related to this test.
A more proper analysis on whether this needs to be fixed on main branch is needed as well.
If analysis shows that the fix is needed on main as well, this tracker can probably be
clubbed with https://tracker.ceph.com/issues/65521 which is tracking a bunch of other
trackers related to adding cluster log warnings to the ignorelist.

/a/yuriw-2024-04-30_03:21:19-rados-wip-yuri4-testing-2024-04-29-0642-distro-default-smithi/7680387

Description:
rados/verify/{centos_latest ceph clusters/{fixed-2 openstack} d-thrash/none mon_election/classic msgr-failures/few msgr/async objectstore/bluestore-low-osd-mem-target rados tasks/rados_api_tests validater/valgrind}

The OSD_DOWN is expected since it is taken down as part of the thrasher. The warning is eventually cleared.
This warning must therefore be added to the ignorelist.

2024-04-30T11:35:49.898+0000 11fae640 10 mon.a@0(leader).log v419  logging 2024-04-30T11:35:49.821422+0000 mon.a (mon.0) 1015 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

Related issues 1 (1 open0 closed)

Related to RADOS - Cleanup #65521: Add expected warnings in cluster log to ignorelistsNew

Actions
Actions #1

Updated by Laura Flores 11 days ago

  • Related to Cleanup #65521: Add expected warnings in cluster log to ignorelists added
Actions #2

Updated by Radoslaw Zarzynski 11 days ago

Sridhar, are you working on this?

Actions #3

Updated by Sridhar Seshasayee 10 days ago

  • Assignee set to Sridhar Seshasayee

@Radoslaw Zarzynski I found this during a review of a squid run that included a couple of my PRs.
I wasn't working on this, but to help out I can take it up and come up with a fix.

Actions #4

Updated by Laura Flores 4 days ago

@Sridhar Seshasayee you can add me as a reviewer if you raise a PR to add this to the ignorelist, or otherwise.

Actions #5

Updated by Sridhar Seshasayee 1 day ago ยท Edited

Further analysis of the logs show that the OSD_DOWN warning was generated because osd.1 exceeded the heartbeat grace timeout period
as shown below and the warning was cleared a few secs later:

2024-04-30T11:35:48.878+0000 fba9640 10 mon.a@0(leader).log v418  logging 2024-04-30T11:35:48.875903+0000 mon.a (mon.0) 1013 : cluster [INF] osd.1 failed (root=default,host=smithi005) (2 reporters from different osd after 87.150472 >= grace 80.000000)

...

2024-04-30T11:35:49.809+0000 d3a4640  2 mon.a@0(leader).osd e308  osd.1 DOWN
2024-04-30T11:35:49.810+0000 d3a4640 10 mon.a@0(leader).osd e308 encode_pending encoding full map with squid features 1080873258835847684
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308 encode_pending mon is running version: 19.0.0-2455-g09dbd6bb
2024-04-30T11:35:49.813+0000 d3a4640 20 mon.a@0(leader).osd e308  full_crc 1068048625 inc_crc 3555704646
2024-04-30T11:35:49.819+0000 d3a4640 10 mon.a@0(leader) e1 log_health updated 1 previous 0
2024-04-30T11:35:49.819+0000 d3a4640  0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN)

...

2024-04-30T11:35:55.298+0000 d3a4640  0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down)

The test sets osd_heartbeat_grace to 80. Historically, this value was set to 40 secs and was
increased to 80 secs a while ago as part of the following commit:
https://github.com/ceph/ceph/pull/34011/commits/4fda9d50f09d527262fd65eab9b9cff3fd700aad

Considering the nature of the test, the osd_heartbeat_grace timeout can be further increased to 90 secs
on main and backported just to Squid to begin with.

@Radoslaw Zarzynski, what do you think?

Actions #6

Updated by Sridhar Seshasayee 1 day ago

  • Pull request ID set to 57485
Actions #7

Updated by Laura Flores 1 day ago

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461

Actions #8

Updated by Laura Flores 1 day ago

Laura Flores wrote in #note-7:

/a/yuriw-2024-04-11_17:03:54-rados-wip-yuri6-testing-2024-04-02-1310-distro-default-smithi/7652461

Oops, this is not rados/verify; ignore

Actions

Also available in: Atom PDF