Bug #42213
closedtest_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"
0%
Description
MDS reached `reject` state ("up:active") rather than reaching "up:reconnect"
019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:ERROR: test_reconnect_eviction (tasks.cephfs.test_client_recovery.TestClientRecovery) 2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/test_client_recovery.py", line 193, in test_reconnect_eviction 2019-10-02T22:28:57.689 INFO:tasks.cephfs_test_runner: self.fs.wait_for_state('up:reconnect', reject='up:active', timeout=MDS_RESTART_GRACE) 2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_wip-yuri6-testing-2019-10-01-1605-nautilus/qa/tasks/cephfs/filesystem.py", line 1016, in wait_for_state 2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner: raise RuntimeError("MDS in reject state {0}".format(current_state)) 2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:RuntimeError: MDS in reject state up:active 2019-10-02T22:28:57.690 INFO:tasks.cephfs_test_runner:
Updated by Patrick Donnelly over 4 years ago
- Related to Bug #40999: qa: AssertionError: u'open' != 'stale' added
Updated by Patrick Donnelly over 4 years ago
- Subject changed from nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" to test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active"
- Assignee set to Venky Shankar
- Priority changed from Normal to High
This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.
Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.
Updated by Patrick Donnelly over 4 years ago
- Target version changed from v14.2.5 to v15.0.0
- Backport set to nautilus,mimic
Updated by Venky Shankar over 4 years ago
Patrick Donnelly wrote:
This looks like the same problem as #40999. Can't verify because there are no mds logs. The issue is that the hard reset of the kernel client machine is not immediate. The MDS come back fast enough (<5 seconds) that the kernel client still has time to reconnect before its node is powered off.
Venky, can you quickly check any other places this race occurs in this test file and correct? The race will also exist on master.
ACK -- I'll take a look.
Updated by Venky Shankar over 4 years ago
there's one more instance of this in test_reconnect_eviction() -- need to fix that too. I'll push a PR.
Updated by Venky Shankar over 4 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 30986
Updated by Patrick Donnelly over 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42421: mimic: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #42422: nautilus: test_reconnect_eviction fails with "RuntimeError: MDS in reject state up:active" added
Updated by Nathan Cutler about 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".