Bug #64988: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds" - CephFS - Ceph

Custom queries

Bug queue
Bug triage
CephFS Bug Triage
CephFS task-easy
CephFS: Available Easy Issues
CephFS: Documentation
Crash queue
Crash triage
Feedback
My issues
Need Review
Pending backports
Product Backlog Scrub
Release: Quincy: Backports (open)
Release: Reef: Backports (open)
Release: Squid: Backports (open)
Release: Squid: Open Issues
Release: Tentacle: Features
Release: Tentacle: Open Issues
Zee CephFS Ticket Well

Actions

Copy link

Bug #64988

closed

qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds"

Added by Patrick Donnelly 2 months ago. Updated about 1 month ago.

Status:

Resolved

Priority:

High

Assignee:

Patrick Donnelly

Category:

Testing

Target version:

Ceph - v20.0.0

% Done:

Source:

Q/A

Tags:

backport_processed

Backport:

squid,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

qa-failure

Pull request ID:

56354

Crash signature (v1):

Crash signature (v2):

Description

https://pulpito.ceph.com/pdonnell-2024-03-19_04:56:42-fs-wip-batrick-testing-20240318.181317-distro-default-smithi/7610533/

and many others in that run

Related issues 3 (1 open — 2 closed)

Related to CephFS - Bug #64985: qa: mgr logs do not include client debugging

Pending Backport

Patrick Donnelly

Actions

Copied to CephFS - Backport #65092: reef: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds"

Resolved

Patrick Donnelly

Actions

Copied to CephFS - Backport #65093: squid: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds"

Resolved

Patrick Donnelly

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Copy link

Updated by Patrick Donnelly 2 months ago

Related to Bug #64985: qa: mgr logs do not include client debugging added

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Status changed from New to In Progress
Assignee set to Patrick Donnelly

Okay, so as expected this is a non-issue:

2024-03-20T18:59:44.324+0000 7ff1adba6700  1 -- 172.21.15.42:0/4057698876 <== mon.0 v2:172.21.15.42:3300/0 2621 ==== mgrmap(e 19) ==== 137871+0+0 (secure 0 0 0) 0x55bdef6bef00 con 0x55bdec7ec400
2024-03-20T18:59:44.324+0000 7ff1adba6700 10 mgr ms_dispatch2 active mgrmap(e 19)
2024-03-20T18:59:44.324+0000 7ff1adba6700  4 mgr handle_mgr_map received map epoch 19
2024-03-20T18:59:44.324+0000 7ff1adba6700  4 mgr handle_mgr_map active in map: 1 active is 14150
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr handle_mgr_map respawning because set of enabled modules changed!
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  e: '/usr/bin/ceph-mgr'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  0: '/usr/bin/ceph-mgr'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  1: '-n'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  2: 'mgr.x'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  3: '-f'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  4: '--setuser'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  5: 'ceph'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  6: '--setgroup'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  7: 'ceph'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  8: '--default-log-to-file=false'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  9: '--default-log-to-journald=true'
2024-03-20T18:59:44.324+0000 7ff1adba6700  1 mgr respawn  10: '--default-log-to-stderr=false'
2024-03-20T18:59:44.325+0000 7ff1adba6700  1 mgr respawn respawning with exe /usr/bin/ceph-mgr
2024-03-20T18:59:44.325+0000 7ff1adba6700  1 mgr respawn  exe_path /proc/self/exe

/teuthology/pdonnell-2024-03-20_18:16:52-fs-wip-batrick-testing-20240320.145742-distro-default-smithi/7612921/remote/smithi042/log/6efffee4-e6ea-11ee-95c9-87774f69a715/ceph-mgr.x.log.gz

The mgr modules changed so it rebooted and the client instance got evicted.

I'll work on a fix.

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 56354

Actions

Copy link

Updated by Greg Farnum about 2 months ago

The mgr modules changed so it rebooted and the client instance got evicted.

o_0

Shouldn’t we do a polite unmount when rebooting? Leaving a hanging client session from the manager seems real bad…
I guess when the monitor fails it over, it does a blocklist entry so the mds cleans up faster? Otherwise there’d be disasters there, too.

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Greg Farnum wrote:

The mgr modules changed so it rebooted and the client instance got evicted.

o_0

Shouldn’t we do a polite unmount when rebooting? Leaving a hanging client session from the manager seems real bad…
I guess when the monitor fails it over, it does a blocklist entry so the mds cleans up faster? Otherwise there’d be disasters there, too.

It's not really a big deal and unlikely to happen in production. Again, it only happens when a failover occurs between when the session is established and the beacon with the client addr is sent to the mons. The mgr doesn't do anything with the mount until it has acknowledgement**.

actually only after https://github.com/ceph/ceph/pull/51169 is merged. See:

https://github.com/ceph/ceph/pull/51169/files#diff-50ab66411d9293d402a15e00ed6843a4d37889c616873e69534e609c210f72ec

Actions

Copy link

Updated by Patrick Donnelly about 2 months ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Backport Bot about 2 months ago

Copied to Backport #65092: reef: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds" added

Actions

Copy link

Updated by Backport Bot about 2 months ago

Copied to Backport #65093: squid: qa: fs:workloads mgr client evicted indicated by "cluster [WRN] evicting unresponsive client smithi042:x (15288), after 303.306 seconds" added

Actions

Copy link