Bug #42213: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" - CephFS - Ceph

Actions

Copy link

Bug #42213

closed

test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"

Added by Venky Shankar over 4 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

Venky Shankar

Category:

Target version:

Ceph - v15.0.0

% Done:

Source:

Community (dev)

Tags:

Backport:

nautilus,mimic

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

30986

Crash signature (v1):

Crash signature (v2):

Description

seen here: http://qa-proxy.ceph.com/teuthology/yuriw-2019-10-02_14:24:11-kcephfs-wip-yuri6-testing-2019-10-01-1605-nautilus-testing-basic-smithi/4351999/teuthology.log

MDS reached `reject` state ("up:active") rather than reaching "up:reconnect"

019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:ERROR: test_reconnect_eviction (tasks.cephfs.test_client_recovery.TestClientRecovery)
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/test_client_recovery.py", line 193, in test_reconnect_eviction
2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_state('up:reconnect', reject='up:active', timeout=MDS_RESTART_GRACE)
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/filesystem.py", line 1016, in wait_for_state
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:    raise RuntimeError("MDS in reject state {0}".format(current_state))
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:RuntimeError: MDS in reject state up:active
2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Related to Bug #40999: qa: AssertionError: u'open' != 'stale' added

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Subject changed from nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" to test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"
Assignee set to Venky Shankar
Priority changed from Normal to High

This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.

Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Target version changed from v14.2.5 to v15.0.0
Backport set to nautilus,mimic

Actions

Copy link

Updated by Venky Shankar over 4 years ago

Patrick Donnelly wrote:

This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.

Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.

ACK -- I'll take a look.

Actions

Copy link

Updated by Venky Shankar over 4 years ago

there's one more instance of this in test_reconnect_eviction() -- need to fix that too. I'll push a PR.

Actions

Copy link

Updated by Venky Shankar over 4 years ago

Status changed from New to Fix Under Review
Pull request ID set to 30986

Actions

Copy link

Updated by Patrick Donnelly over 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42421: mimic: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added

Actions

Copy link

Updated by Nathan Cutler over 4 years ago

Copied to Backport #42422: nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added

Actions

Copy link

#10

Updated by Nathan Cutler about 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #42213

test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Venky Shankar over 4 years ago

Updated by Venky Shankar over 4 years ago

Updated by Venky Shankar over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler about 4 years ago