Bug #50821: qa: untar_snap_rm failure during mds thrashing - CephFS - Ceph

Actions

Copy link

Bug #50821

open

qa: untar_snap_rm failure during mds thrashing

Added by Patrick Donnelly almost 3 years ago. Updated 7 days ago.

Status:

Fix Under Review

Priority:

High

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

Ceph - v20.0.0

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS, kceph

Labels (FS):

Pull request ID:

57124

Crash signature (v1):

Crash signature (v2):

Description

2021-05-14T22:51:46.078 INFO:tasks.workunit.client.0.smithi094.stderr:tar: linux-2.6.33/arch/microblaze: Cannot stat: Permission denied
2021-05-14T22:51:46.078 INFO:tasks.workunit.client.0.smithi094.stderr:tar: linux-2.6.33/arch: Cannot stat: Permission denied
2021-05-14T22:51:46.078 INFO:tasks.workunit.client.0.smithi094.stderr:tar: linux-2.6.33: Cannot stat: Permission denied
2021-05-14T22:51:46.078 INFO:tasks.workunit.client.0.smithi094.stderr:tar: Error is not recoverable: exiting now
2021-05-14T22:51:46.079 DEBUG:teuthology.orchestra.run:got remote process result: 2
2021-05-14T22:51:46.080 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
2021-05-14T22:51:46.080 DEBUG:teuthology.orchestra.run.smithi094:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2021-05-14T22:51:46.264 ERROR:teuthology.run_tasks:Saw exception from tasks.
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/run_tasks.py", line 91, in run_tasks
    manager = run_one_task(taskname, ctx=ctx, config=config)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/run_tasks.py", line 70, in run_one_task
    return task(**kwargs)
  File "/home/teuthworker/src/github.com_batrick_ceph_e78e41c7f45263bfc3d22dafa953b7e485aac84d/qa/tasks/workunit.py", line 147, in task
    cleanup=cleanup)
  File "/home/teuthworker/src/github.com_batrick_ceph_e78e41c7f45263bfc3d22dafa953b7e485aac84d/qa/tasks/workunit.py", line 297, in _spawn_on_all_clients
    timeout=timeout)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
  File "/home/teuthworker/src/github.com_batrick_ceph_e78e41c7f45263bfc3d22dafa953b7e485aac84d/qa/tasks/workunit.py", line 425, in _run_tests
    label="workunit test {workunit}".format(workunit=workunit)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/orchestra/remote.py", line 509, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/orchestra/run.py", line 455, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/orchestra/run.py", line 161, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_19220a3bd6e252c6e8260827019668a766d85490/teuthology/orchestra/run.py", line 183, in _raise_for_status
    node=self.hostname, label=self.label
teuthology.exceptions.CommandFailedError: Command failed (workunit test fs/snaps/untar_snap_rm.sh) on smithi094 with status 2: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=e78e41c7f45263bfc3d22dafa953b7e485aac84d TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/fs/snaps/untar_snap_rm.sh'

From: /ceph/teuthology-archive/pdonnell-2021-05-14_21:45:42-fs-master-distro-basic-smithi/6115751/teuthology.log

With RHEL stock kernel. Might be related to some other issues I've been suddenly seeing with the stock RHEL kernel.

Related issues 4 (3 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

I don't think this is related to #50281 but may be.

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Related to Bug #50823: qa: RuntimeError: timeout waiting for cluster to stabilize added

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Related to Bug #50824: qa: snaptest-git-ceph bus error added

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Related to Bug #51278: mds: "FAILED ceph_assert(!segments.empty())" added

Actions

Copy link

Updated by Venky Shankar about 2 years ago

Similar failure here: https://pulpito.ceph.com/vshankar-2022-04-11_12:24:06-fs-wip-vshankar-testing1-20220411-144044-testing-default-smithi/6786336/

although in this instance, we see ESTALE/EIO.

2022-04-11T15:56:23.599 INFO:teuthology.orchestra.run.smithi141.stderr:2022-04-11T15:56:23.590+0000 7f3cba9ff700  1 -- 172.21.15.141:0/3624046670 --> [v2:172.21.15.153:6808/205989,v1:172.21.15.153:6809/205989] -- command(tid 11: {"prefix": "get_command_descriptions"}) v1
 -- 0x7f3c90018dc0 con 0x7f3c90011730
2022-04-11T15:56:23.599 INFO:teuthology.orchestra.run.smithi141.stderr:2022-04-11T15:56:23.590+0000 7f3cb37fe700  1 --2- 172.21.15.141:0/3624046670 >> [v2:172.21.15.153:6808/205989,v1:172.21.15.153:6809/205989] conn(0x7f3c90011730 0x7f3c90011b60 unknown :-1 s=BANNER_CONN
ECTING pgs=0 cs=0 l=1 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0)._handle_peer_banner_payload supported=3 required=0
2022-04-11T15:56:23.628 INFO:tasks.ceph.osd.7.smithi153.stderr:2022-04-11T15:56:23.619+0000 7f22a0340700 -1 received  signal: Hangup from /usr/bin/python3 /bin/daemon-helper kill ceph-osd -f --cluster ceph -i 7  (PID: 27672) UID: 0
2022-04-11T15:56:23.644 INFO:tasks.workunit.client.0.smithi141.stdout:'.snap/k' -> './k'
2022-04-11T15:56:23.644 INFO:tasks.workunit.client.0.smithi141.stdout:'.snap/k/linux-2.6.33.tar.bz2' -> './k/linux-2.6.33.tar.bz2'
2022-04-11T15:56:23.645 INFO:tasks.workunit.client.0.smithi141.stderr:cp: error writing './k/linux-2.6.33.tar.bz2': Stale file handle
2022-04-11T15:56:23.645 INFO:teuthology.orchestra.run.smithi141.stderr:umount: /home/ubuntu/cephtest/mnt.0: target is busy.
2022-04-11T15:56:23.646 INFO:tasks.workunit.client.0.smithi141.stderr:cp: cannot stat '.snap/k/linux-2.6.33': Input/output error
2022-04-11T15:56:23.646 INFO:tasks.workunit.client.0.smithi141.stderr:cp: preserving times for './k': Input/output error
2022-04-11T15:56:23.647 INFO:teuthology.orchestra.run.smithi141.stderr:2022-04-11T15:56:23.639+0000 7f3cb37fe700  1 --2- 172.21.15.141:0/3624046670 >> [v2:172.21.15.153:6808/205989,v1:172.21.15.153:6809/205989] conn(0x7f3c90011730 0x7f3c90011b60 crc :-1 s=READY pgs=222 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=osd.5 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2022-04-11T15:56:23.647 DEBUG:teuthology.orchestra.run:got remote process result: 1
2022-04-11T15:56:23.648 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
2022-04-11T15:56:23.648 DEBUG:teuthology.orchestra.run.smithi141:> sudo rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
2022-04-11T15:56:23.658 DEBUG:teuthology.orchestra.run:got remote process result: 32

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

Target version deleted (~~v17.0.0~~)

Actions

Copy link

Updated by Venky Shankar 9 months ago

This popped up again with centos 9.stream, but I don't think anything to do with the distro. ref: /a/yuriw-2023-07-26_14:28:57-fs-wip-yuri-testing-2023-07-25-0833-reef-distro-default-smithi/7353025

The failures are the usual -EIO errno:

2023-07-26T21:51:59.533 INFO:tasks.workunit.client.0.smithi043.stdout:'.snap/k/linux-2.6.33/drivers/isdn/mISDN/dsp_dtmf.c' -> './k/linux-2.6.33/drivers/isdn/mISDN/dsp_dtmf.c'
2023-07-26T21:51:59.534 INFO:tasks.workunit.client.0.smithi043.stdout:'.snap/k/linux-2.6.33/drivers/isdn/mISDN/dsp_ecdis.h' -> './k/linux-2.6.33/drivers/isdn/mISDN/dsp_ecdis.h'
2023-07-26T21:51:59.534 INFO:tasks.workunit.client.0.smithi043.stdout:'.snap/k/linux-2.6.33/drivers/isdn/mISDN/dsp_hwec.c' -> './k/linux-2.6.33/drivers/isdn/mISDN/dsp_hwec.c'
2023-07-26T21:51:59.534 DEBUG:teuthology.orchestra.run:got remote process result: 1
2023-07-26T21:51:59.535 INFO:tasks.workunit.client.0.smithi043.stderr:cp: cannot stat '.snap/k/linux-2.6.33/drivers/isdn/mISDN/dsp_hwec.h': Input/output error
2023-07-26T21:51:59.535 INFO:tasks.workunit.client.0.smithi043.stderr:cp: cannot stat '.snap/k/linux-2.6.33/drivers/isdn/mISDN/dsp_pipeline.c': Input/output error
2023-07-26T21:51:59.535 INFO:tasks.workunit.client.0.smithi043.stderr:cp: cannot stat '.snap/k/linux-2.6.33/drivers/isdn/mISDN/dsp_tones.c': Input/output error
2023-07-26T21:51:59.535 INFO:tasks.workunit.client.0.smithi043.stderr:cp: cannot stat '.snap/k/linux-2.6.33/drivers/isdn/mISDN/fsm.c': Input/output error
2023-07-26T21:51:59.535 INFO:tasks.workunit.client.0.smithi043.stderr:cp: cannot stat '.snap/k/linux-2.6.33/drivers/isdn/mISDN/fsm.h': Input/output error

No MDS core dumps and/or anything in the kernel ring buffer.

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Category set to Correctness/Safety
Assignee set to Xiubo Li
Target version set to v20.0.0

This is the latest instance - https://pulpito.ceph.com/pdonnell-2024-03-20_18:16:52-fs-wip-batrick-testing-20240320.145742-distro-default-smithi/7612983/

Nothing in the kernel ring buffer.

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Related to Bug #64707: suites/fsstress.sh hangs on one client - test times out added

Actions

Copy link

#10

Updated by Patrick Donnelly 13 days ago

Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381041
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381043
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381045
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381047
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381049
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381051
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381053
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381055
Apr 21 02:55:25 smithi043 kernel: ceph:  dropping unsafe request 381057
Apr 21 02:55:25 smithi043 kernel: ceph: ceph_do_invalidate_pages: inode 1000000b5c4.fffffffffffffffe is shut down
Apr 21 02:55:25 smithi043 kernel: ceph: ceph_do_invalidate_pages: inode 1000000bc24.fffffffffffffffe is shut down
Apr 21 02:55:25 smithi043 kernel: ceph: ceph_do_invalidate_pages: inode 1000000bd57.fffffffffffffffe is shut down
Apr 21 02:55:25 smithi043 kernel: ceph: ceph_do_invalidate_pages: inode 1000000c4dd.fffffffffffffffe is shut down
Apr 21 02:55:25 smithi043 kernel: ceph: ceph_do_invalidate_pages: inode 1000000c4e7.fffffffffffffffe is shut down
Apr 21 02:55:25 smithi043 kernel: ceph: ceph_do_invalidate_pages: inode 1000000c4e6.fffffffffffffffe is shut down
Apr 21 02:55:26 smithi043 sudo[71057]:   ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/bin/rm -rf -- /home/ubuntu/cephtest/workunits.list.client.0 /home/ubuntu/cephtest/clone.client.0
Apr 21 02:55:26 smithi043 sudo[71057]: pam_unix(sudo:session): session opened for user root(uid=0) by ubuntu(uid=1000)
Apr 21 02:55:26 smithi043 sudo[71089]:   ubuntu : PWD=/home/ubuntu ; USER=root ; ENV=PATH=/usr/sbin:/home/ubuntu/.local/bin:/home/ubuntu/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbinCOMMAND=/bin/lsof
Apr 21 02:55:26 smithi043 sudo[71089]: pam_unix(sudo:session): session opened for user root(uid=0) by ubuntu(uid=1000)
Apr 21 02:55:26 smithi043 sudo[71057]: pam_unix(sudo:session): session closed for user root
Apr 21 02:55:27 smithi043 kernel: libceph: mds0 (1)172.21.15.73:6839 socket closed (con state V1_BANNER)

From: /teuthology/pdonnell-2024-04-20_23:33:17-fs-wip-pdonnell-testing-20240420.180737-debug-distro-default-smithi/7665863/remote/smithi043/syslog/journalctl-b0.gz

Actions

Copy link

#11

Updated by Xiubo Li 8 days ago

Patrick Donnelly wrote in #note-10:

[...]

From: /teuthology/pdonnell-2024-04-20_23:33:17-fs-wip-pdonnell-testing-20240420.180737-debug-distro-default-smithi/7665863/remote/smithi043/syslog/journalctl-b0.gz

The client.4607 closed the session at 2024-04-21T02:55:25.991:

2024-04-21T02:55:25.773+0000 7f1135472640 10 mds.2.log trim 2 / 128 segments, 8 / -1 events, 0 (0) expiring, 0 (0) expired
2024-04-21T02:55:25.773+0000 7f1135472640 10 mds.2.log trim: new_expiring_segments=0, num_remaining_segments=2, max_segments=128
2024-04-21T02:55:25.773+0000 7f1135472640 10 mds.2.log trim: breaking out of trim loop - segments/events fell below ceiling max_segments/max_ev
2024-04-21T02:55:25.773+0000 7f1135472640 20 mds.2.log _trim_expired_segments: examining LogSegment(1/0x400000 events=1)
2024-04-21T02:55:25.773+0000 7f1135472640 10 mds.2.log _trim_expired_segments waiting for expiry LogSegment(1/0x400000 events=1)
2024-04-21T02:55:25.991+0000 7f1139c7b640  1 -- [v2:172.21.15.43:6832/3227946235,v1:172.21.15.43:6834/3227946235] <== client.4607 v1:172.21.15.43:0/3153587376 12 ==== client_session(request_close seq 2) ==== 28+0+0 (unknown 95654502 0 0) 0x556cf7d41200 con 0x556cf7f8bb00
2024-04-21T02:55:25.991+0000 7f1139c7b640 20 mds.2.177 get_session have 0x556cf7f5e800 client.4607 v1:172.21.15.43:0/3153587376 state open
2024-04-21T02:55:25.991+0000 7f1139c7b640  3 mds.2.server handle_client_session client_session(request_close seq 2) from client.4607
2024-04-21T02:55:25.991+0000 7f1139c7b640 10 mds.2.server journal_close_session : client.4607 v1:172.21.15.43:0/3153587376 pending_prealloc_inos [] free_prealloc_inos [] delegated_inos []
2024-04-21T02:55:25.991+0000 7f1139c7b640 20 mds.2.sessionmap mark_projected s=0x556cf7f5e800 name=client.4607 pv=6 -> 7
2024-04-21T02:55:25.991+0000 7f1139c7b640 20 mds.2.log _submit_entry ESession client.4607 v1:172.21.15.43:0/3153587376 close cmapv 7
2024-04-21T02:55:25.991+0000 7f113246c640  5 mds.2.log _submit_thread 4195728~121 : ESession client.4607 v1:172.21.15.43:0/3153587376 close cmapv 7
2024-04-21T02:55:25.991+0000 7f113246c640  1 -- [v2:172.21.15.43:6832/3227946235,v1:172.21.15.43:6834/3227946235] --> [v2:172.21.15.73:6800/1594527196,v1:172.21.15.73:6801/1594527196] -- osd_op(unknown.0.177:43 2.11 2:8975f766:::202.00000001:head [write 1424~141 [fadvise_dontneed] in=141b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e132) -- 0x556cf7f57800 con 0x556cf7db7680
2024-04-21T02:55:25.991+0000 7f113246c640  1 -- [v2:172.21.15.43:6832/3227946235,v1:172.21.15.43:6834/3227946235] --> [v2:172.21.15.43:6808/2211093047,v1:172.21.15.43:6810/2211093047] -- osd_op(unknown.0.177:44 2.1 2:85bbe569:::202.00000000:head [writefull 0~90 [fadvise_dontneed] in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e132) -- 0x556cf80ccc00 con 0x556cf7db7f80
2024-04-21T02:55:25.995+0000 7f113c480640  1 -- [v2:172.21.15.43:6832/3227946235,v1:172.21.15.43:6834/3227946235] <== osd.2 v2:172.21.15.43:6808/2211093047 13 ==== osd_op_reply(44 202.00000000 [writefull 0~90 [fadvise_dontneed]] v132'2411 uv2411 ondisk = 0) ==== 156+0+0 (crc 0 0 0) 0x556cf708ca00 con 0x556cf7db7f80
2024-04-21T02:55:25.995+0000 7f113cc81640  1 -- [v2:172.21.15.43:6832/3227946235,v1:172.21.15.43:6834/3227946235] <== osd.6 v2:172.21.15.73:6800/1594527196 9 ==== osd_op_reply(43 202.00000001 [write 1424~141 [fadvise_dontneed]] v132'1035 uv1035 ondisk = 0) ==== 156+0+0 (crc 0 0 0) 0x556cf708c280 con 0x556cf7db7680
2024-04-21T02:55:25.995+0000 7f113346e640 10 MDSIOContextBase::complete: 20C_MDS_session_finish
2024-04-21T02:55:25.995+0000 7f113346e640 10 MDSContext::complete: 20C_MDS_session_finish
2024-04-21T02:55:25.995+0000 7f113346e640 10 mds.2.server _session_logged client.4607 v1:172.21.15.43:0/3153587376 state_seq 2 close 7 inos_to_free [] inotablev 0 inos_to_purge []
2024-04-21T02:55:25.995+0000 7f113346e640 20 mds.2.sessionmap mark_dirty s=0x556cf7f5e800 name=client.4607 v=6
2024-04-21T02:55:25.995+0000 7f113346e640 10 mds.2.177 send_message_client client.4607 v1:172.21.15.43:0/3153587376 client_session(close)
2024-04-21T02:55:25.995+0000 7f113346e640  1 -- [v2:172.21.15.43:6832/3227946235,v1:172.21.15.43:6834/3227946235] --> v1:172.21.15.43:0/3153587376 -- client_session(close) -- 0x556cf8156000 con 0x556cf7f8bb00
2024-04-21T02:55:25.995+0000 7f113346e640 10 remove_session: mds.metrics: session=0x556cf7f5e800, client=client.4607 v1:172.21.15.43:0/3153587376

While the unmount seems done just before the files to be removed:

2024-04-21T02:55:25.942 DEBUG:tasks.cephfs.kernel_mount:Unmounting client client.0...
2024-04-21T02:55:25.942 INFO:teuthology.orchestra.run:Running command with timeout 300
2024-04-21T02:55:25.942 DEBUG:teuthology.orchestra.run.smithi043:> sudo umount /home/ubuntu/cephtest/mnt.0 -f
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/platform.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/irqs.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/param.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/i2c.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/entry-macro.S'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/memory.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/system.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/timex.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/io.h'
2024-04-21T02:55:25.991 INFO:tasks.workunit.client.0.smithi043.stdout:removed 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach/vmalloc.h'
2024-04-21T02:55:25.992 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/include/mach': Input/output error
2024-04-21T02:55:25.992 INFO:teuthology.orchestra.run.smithi043.stderr:umount: /home/ubuntu/cephtest/mnt.0: target is busy.
2024-04-21T02:55:25.994 DEBUG:teuthology.orchestra.run:got remote process result: 1
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/Makefile.boot': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/Makefile': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/gpio.c': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/serial.c': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/dma.c': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/i2c.c': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/core.c': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-pnx4008/time.c': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-ebsa110': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/plat-stmp3xxx': Input/output error
2024-04-21T02:55:25.995 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-davinci': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-s3c6400': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-at91': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/boot': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-s3c2410': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-mx25': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/plat-omap': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/plat-orion': Input/output error
2024-04-21T02:55:25.996 INFO:tasks.workunit.client.0.smithi043.stderr:rm: cannot remove 'linux-2.6.33/arch/arm/mach-clps711x': Input/output error

That means some files would be removed just after the mountpoint was unmounted.

Actions

Copy link

#12

Updated by Xiubo Li 8 days ago

It seems the mds.b daemon wasn't brought up in 300s and then the watchdog barked and then all the daemons were killed and all the mountpoints were unmounted during the test was going on:

2024-04-21T02:55:23.499 INFO:tasks.mds_thrash.fs.[cephfs]:no change
2024-04-21T02:55:24.142 INFO:tasks.daemonwatchdog.daemon_watchdog:daemon ceph.mds.b is failed for ~304s
2024-04-21T02:55:24.142 INFO:tasks.daemonwatchdog.daemon_watchdog:BARK! unmounting mounts and killing all daemons
2024-04-21T02:55:24.142 DEBUG:teuthology.orchestra.run.smithi043:> set -ex
2024-04-21T02:55:24.142 DEBUG:teuthology.orchestra.run.smithi043:> dd if=/proc/self/mounts of=/dev/stdout
2024-04-21T02:55:24.172 DEBUG:teuthology.orchestra.run.smithi043:> set -ex
2024-04-21T02:55:24.173 DEBUG:teuthology.orchestra.run.smithi043:> dd if=/proc/self/mounts of=/dev/stdout

Actions

Copy link

#13

Updated by Xiubo Li 8 days ago

Okay, finally it was because the mds.b crashed and this was why it wasn't brought up:

    -7> 2024-04-21T02:50:17.623+0000 7f32c63a7640 10 mds.4.cache  got inode locks [inode 0x608 [...301,head] ~mds0/stray8/ rep@0.2 fragtree_t(*^3) v146838 f(v4 m2024-04-21T02:49:06.802174+0000 303=189+114) n(v8 rc2024-04-21T02:49:06.802174+0000 115=0+115) old_inodes=5 (inest mix r) (ifile mix) 0x55cb01dc0c00]
    -6> 2024-04-21T02:50:17.623+0000 7f32c63a7640 10 mds.4.cache  got inode locks [inode 0x604 [...301,head] ~mds0/stray4/ rep@0.2 v147574 f(v4 m2024-04-21T02:49:25.648924+0000 225=153+72) n(v9 rc2024-04-21T02:49:25.648924+0000 73=0+73) old_inodes=11 (inest mix r) (ifile mix) 0x55cb01f7e580]
    -5> 2024-04-21T02:50:17.623+0000 7f32c63a7640 10 mds.4.cache  got inode locks [inode 0x10000000000 [...301,head] /client.0/ rep@0.2 v3184 f(v0 m2024-04-21T02:40:21.081153+0000 1=0+1) n(v276 rc2024-04-21T02:49:28.824882+0000 b216805083 rs1 18657=17439+1218) old_inodes=226 (inest mix r) | dirfrag=1 0x55cb01eac580]
    -4> 2024-04-21T02:50:17.623+0000 7f32c63a7640 10 mds.4.cache  got inode locks [inode 0x1000000a2ed [...301,head] /client.0/tmp/ rep@0.2 v6175 snaprealm=0x55cb01e69440 f(v0 m2024-04-21T02:42:45.574234+0000 2=1+1) n(v35 rc2024-04-21T02:49:28.824882+0000 b216805083 rs1 18656=17439+1217) old_inodes=2 (inest mix r) 0x55cb01eac000]
    -3> 2024-04-21T02:50:17.623+0000 7f32c63a7640 10 mds.4.cache  got inode locks [inode 0x1 [...301,head] / rep@0.2 v1162 snaprealm=0x55cb01e686c0 f(v0 m2024-04-21T01:57:20.110195+0000 1=0+1) n(v119 rc2024-04-21T02:49:28.467887+0000 b217004861 rs1 18688=17466+1222)/n(v0 rc2024-04-21T01:56:41.512966+0000 1=0+1) old_inodes=168 (inest mix r) | dirfrag=1 discoverbase=0 0x55cb01e70c00]
    -2> 2024-04-21T02:50:17.623+0000 7f32c63a7640 10 mds.4.cache  got inode locks [inode 0x100 [...301,head] ~mds0/ rep@0.2 v1051 snaprealm=0x55cb01e68240 f(v0 10=0+10) n(v193 rc2024-04-21T02:49:30.705857+0000 b4699681 1138=339+799)/n(v0 rc2024-04-21T01:56:41.514395+0000 11=0+11) old_inodes=136 (inest mix r) | dirfrag=1 discoverbase=0 0x55cb01dc0680]
    -1> 2024-04-21T02:50:17.624+0000 7f32c63a7640 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-3201-g0b60fd01/rpm/el9/BUILD/ceph-19.0.0-3201-g0b60fd01/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(ceph::cref_t<MMDSCacheRejoin>&)' thread 7f32c63a7640 time 2024-04-21T02:50:17.624573+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-3201-g0b60fd01/rpm/el9/BUILD/ceph-19.0.0-3201-g0b60fd01/src/mds/MDCache.cc: 5158: FAILED ceph_assert(isolated_inodes.empty())

 ceph version 19.0.0-3201-g0b60fd01 (0b60fd01511511bc020e1a45638ede6ead9e38ec) squid (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x127) [0x7f32cc04aa17]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x24ac24) [0x7f32cc04ac24]
 3: (MDCache::handle_cache_rejoin_ack(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x232f) [0x55cafbf9e7a3]
 4: (MDCache::handle_cache_rejoin(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x395) [0x55cafbf95f75]
 5: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0xec) [0x55cafbfbd400]
 6: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x11d) [0x55cafbe09b8b]
 7: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x28f) [0x55cafbe0e975]
 8: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x94) [0x55cafbe0f29c]
 9: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x2c4) [0x55cafbdf14aa]
 10: /usr/lib64/ceph/libceph-common.so.2(+0x394187) [0x7f32cc194187]
 11: (DispatchQueue::entry()+0x837) [0x7f32cc194cf1]
 12: /usr/lib64/ceph/libceph-common.so.2(+0x474271) [0x7f32cc274271]
 13: (Thread::entry_wrapper()+0x43) [0x7f32cc0247f5]
 14: (Thread::_entry_func(void*)+0xd) [0x7f32cc024811]
 15: /lib64/libc.so.6(+0x89c02) [0x7f32cb689c02]
 16: /lib64/libc.so.6(+0x10ec40) [0x7f32cb70ec40]

     0> 2024-04-21T02:50:17.625+0000 7f32c63a7640 -1 *** Caught signal (Aborted) **
 in thread 7f32c63a7640 thread_name:ms_dispatch

 ceph version 19.0.0-3201-g0b60fd01 (0b60fd01511511bc020e1a45638ede6ead9e38ec) squid (dev)
 1: /lib64/libc.so.6(+0x3e6f0) [0x7f32cb63e6f0]
 2: /lib64/libc.so.6(+0x8b94c) [0x7f32cb68b94c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x262) [0x7f32cc04ab52]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x24ac24) [0x7f32cc04ac24]
 7: (MDCache::handle_cache_rejoin_ack(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x232f) [0x55cafbf9e7a3]
 8: (MDCache::handle_cache_rejoin(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x395) [0x55cafbf95f75]
 9: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0xec) [0x55cafbfbd400]
 10: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x11d) [0x55cafbe09b8b]
 11: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x28f) [0x55cafbe0e975]
 12: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x94) [0x55cafbe0f29c]

Actions

Copy link

#14

Updated by Xiubo Li 8 days ago

This is the same issue with https://tracker.ceph.com/issues/62036, which has already been fixed and it hit again. It should be other cases could cause it instead of the `subtree`.

I need to dig it further to see what has happened.

Actions

Copy link

#15

Updated by Xiubo Li 7 days ago

This is the same issue with https://tracker.ceph.com/issues/62036, which has already been fixed and it hit again. It should be other cases could cause it instead of the `subtree`.

I need to dig it further to see what has happened.

Xiubo Li wrote in #note-14:

This is the same issue with https://tracker.ceph.com/issues/62036, which has already been fixed and it hit again. It should be other cases could cause it instead of the `subtree`.

I need to dig it further to see what has happened.

This time the rejoin has already successfully done just before the crash:

2024-04-21T02:41:46.654+0000 7f32c63a7640  7 mds.4.cache handle_cache_rejoin cache_rejoin ack from mds.0 (130 bytes)
2024-04-21T02:41:46.654+0000 7f32c63a7640  7 mds.4.cache handle_cache_rejoin_ack from mds.0
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache  exporting caps for client.4607 ino 0x10000000000
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.171 send_message_client_counted client.4607 seq 1 client_caps(export ino 0x10000000000 1 seq 0 caps=- dirty=- wanted=- follows 0 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0)
2024-04-21T02:41:46.654+0000 7f32c63a7640  1 -- [v2:172.21.15.73:6835/2782686517,v1:172.21.15.73:6837/2782686517] --> v1:172.21.15.43:0/3153587376 -- client_caps(export ino 0x10000000000 1 seq 0 caps=- dirty=- wanted=- follows 0 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) -- 0x55cb01ddb500 con 0x55cb01e7ed80
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache open_snaprealms
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache send_snaps
2024-04-21T02:41:46.654+0000 7f32c63a7640 10  mds.4.cache.snaprealm(0x3 seq 1 0x55cb01e68480) build_snap_set on snaprealm(0x3 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55cb01e68480)
2024-04-21T02:41:46.654+0000 7f32c63a7640 10  mds.4.cache.snaprealm(0x3 seq 1 0x55cb01e68480) build_snap_trace my_snaps [2fe]
2024-04-21T02:41:46.654+0000 7f32c63a7640 10  mds.4.cache.snaprealm(0x3 seq 1 0x55cb01e68480) check_cache rebuilt 2fe seq 2fe cached_seq 2fe cached_last_created 2fe cached_last_destroyed 2fd)
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache finish_snaprealm_reconnect client.4607 up to date on snaprealm(0x3 seq 1 lc 0 cr 0 cps 1 snaps={} last_modified 0.000000 change_attr 0 0x55cb01e68480)
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache send_snaps
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.171 send_message_client_counted client.4607 seq 2 client_snap(update split=0x3 tracelen=56)
2024-04-21T02:41:46.654+0000 7f32c63a7640  1 -- [v2:172.21.15.73:6835/2782686517,v1:172.21.15.73:6837/2782686517] --> v1:172.21.15.43:0/3153587376 -- client_snap(update split=0x3 tracelen=56) -- 0x55cb01c410e0 con 0x55cb01e7ed80
2024-04-21T02:41:46.654+0000 7f32c63a7640  5 mds.4.cache open_snaprealms has unconnected snaprealm:
2024-04-21T02:41:46.654+0000 7f32c63a7640  5 mds.4.cache  0x1 {client.4607/1}
2024-04-21T02:41:46.654+0000 7f32c63a7640  5 mds.4.cache  0x1000000a2ed {client.4607/2fe}
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache open_snaprealms - all open
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache do_delayed_cap_imports
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 MDSContext::complete: 12C_MDS_VoidFn
2024-04-21T02:41:46.654+0000 7f32c63a7640  1 mds.4.171 rejoin_done
2024-04-21T02:41:46.654+0000 7f32c63a7640 15 mds.4.cache show_subtrees
2024-04-21T02:41:46.654+0000 7f32c63a7640 10 mds.4.cache |__ 4    auth [dir 0x104 ~mds4/ [2,head] auth v=1 cv=0/0 dir_auth=4 state=1073741824 f(v0 10=0+10) n(v0 rc2024-04-21T01:57:04.622273+0000 10=0+10) hs=0+0,ss=0+0 | subtree=1 subtreetemp=0 0x55cb01c4fa80]
2024-04-21T02:41:46.654+0000 7f32c63a7640  7 mds.4.cache show_cache
2024-04-21T02:41:46.654+0000 7f32c63a7640  7 mds.4.cache  unlinked [inode 0x3 [...2,head] #3/ auth v1 snaprealm=0x55cb01e68480 f() n(v0 rc2024-04-21T02:41:31.515383+0000 1=0+1) 0x55cb01e6e000]
2024-04-21T02:41:46.654+0000 7f32c63a7640  7 mds.4.cache  unlinked [inode 0x104 [...2,head] ~mds4/ auth v1 snaprealm=0x55cb01c2dd40 f(v0 10=0+10) n(v0 rc2024-04-21T01:57:04.622273+0000 11=0+11) | dirfrag=1 openingsnapparents=0 0x55cb00ff3180]
2024-04-21T02:41:46.654+0000 7f32c63a7640  7 mds.4.cache   dirfrag [dir 0x104 ~mds4/ [2,head] auth v=1 cv=0/0 dir_auth=4 state=1073741824 f(v0 10=0+10) n(v0 rc2024-04-21T01:57:04.622273+0000 10=0+10) hs=0+0,ss=0+0 | subtree=1 subtreetemp=0 0x55cb01c4fa80]
2024-04-21T02:41:46.654+0000 7f32c63a7640  3 mds.4.171 request_state up:active
2024-04-21T02:41:46.654+0000 7f32c63a7640  5 mds.beacon.b set_want_state: up:rejoin -> up:active
2024-04-21T02:41:46.654+0000 7f32c63a7640  5 mds.beacon.b Sending beacon up:active seq 32
...
    -1> 2024-04-21T02:50:17.624+0000 7f32c63a7640 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-3201-g0b60fd01/rpm/el9/BUILD/ceph-19.0.0-3201-g0b60fd01/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(ceph::cref_t<MMDSCacheRejoin>&)' thread 7f32c63a7640 time 2024-04-21T02:50:17.624573+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.0.0-3201-g0b60fd01/rpm/el9/BUILD/ceph-19.0.0-3201-g0b60fd01/src/mds/MDCache.cc: 5158: FAILED ceph_assert(isolated_inodes.empty())

 ceph version 19.0.0-3201-g0b60fd01 (0b60fd01511511bc020e1a45638ede6ead9e38ec) squid (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x127) [0x7f32cc04aa17]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x24ac24) [0x7f32cc04ac24]
 3: (MDCache::handle_cache_rejoin_ack(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x232f) [0x55cafbf9e7a3]
 4: (MDCache::handle_cache_rejoin(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x395) [0x55cafbf95f75]
 5: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0xec) [0x55cafbfbd400]
 6: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0x11d) [0x55cafbe09b8b]

Actions

Copy link

#16

Updated by Xiubo Li 7 days ago

Yeah, really this time we hit another case. The local MDS was in up:active state but not others, so in this case the local MDS need to start a rejoin too:

2024-04-21T02:50:11.680+0000 7f32c4ba4640 20 mds.4.171 updating export targets, currently 0 ranks are targets
2024-04-21T02:50:11.688+0000 7f32c63a7640  1 -- [v2:172.21.15.73:6835/2782686517,v1:172.21.15.73:6837/2782686517] <== mds.1 v2:172.21.15.43:6836/3580610477 10 ====  ==== 50+0+0 (crc 0 0 0) 0x55cb02051200 con 0x55cb01f78d00
2024-04-21T02:50:11.688+0000 7f32c63a7640 10 quiesce.mds.4 <quiesce_dispatch> got q-db[v:(198:0) sets:0/0] from 9054
2024-04-21T02:50:11.688+0000 7f32c63a7640  3 quiesce.mds.4 <quiesce_dispatch> error (-116) submitting q-db[v:(198:0) sets:0/0] from 9054
2024-04-21T02:50:11.690+0000 7f32c63a7640  1 -- [v2:172.21.15.73:6835/2782686517,v1:172.21.15.73:6837/2782686517] <== mon.1 v2:172.21.15.73:3300/0 205 ==== mdsmap(e 198) ==== 3684+0+0 (secure 0 0 0) 0x55cb01e05e00 con 0x55cb01019180
2024-04-21T02:50:11.690+0000 7f32c63a7640  1 mds.b Updating MDS map to version 198 from mon.1
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.b      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.b  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2,11=minor log segments,12=quiesce subvolumes}
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.b my gid is 8158
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.b map says I am mds.4.171 state up:active
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.b msgr says I am [v2:172.21.15.73:6835/2782686517,v1:172.21.15.73:6837/2782686517]
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.b handle_mds_map: handling map as rank 4
2024-04-21T02:50:11.691+0000 7f32c63a7640  1 mds.4.171 rejoin_joint_start
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.4.cache rejoin_send_rejoins with recovery_set 0,1,2,3
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.4.cache disambiguate_other_imports
2024-04-21T02:50:11.691+0000 7f32c63a7640 10 mds.4.cache rejoin_walk [dir 0x1 / [2,head] rep@0.1 dir_auth=0 state=0 f(v0 m2024-04-21T01:57:20.110195+0000 1=0+1) n(v95 rc2024-04-21T02:41:15.972424+0000 b19129474 rs1 438=425+13) hs=1+0,ss=0+0 | dnwaiter=0 child=1 subtree=1 0x55cb01c4d200]
2024-04-21T02:50:11.691+0000 7f32c63a7640 15 mds.4.cache  add_strong_dirfrag [dir 0x1 / [2,head] rep@0.1 dir_auth=0 state=0 f(v0 m2024-04-21T01:57:20.110195+0000 1=0+1) n(v95 rc2024-04-21T02:41:15.972424+0000 b19129474 rs1 438=425+13) hs=1+0,ss=0+0 | dnwaiter=0 child=1 subtree=1 0x55cb01c4d200]

We should fix this case too.

Actions

Copy link

#17

Updated by Xiubo Li 7 days ago

Status changed from New to Fix Under Review
Pull request ID set to 57124

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #50821

qa: untar_snap_rm failure during mds thrashing

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Patrick Donnelly almost 3 years ago

Updated by Venky Shankar about 2 years ago

Updated by Patrick Donnelly almost 2 years ago

Updated by Venky Shankar 9 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Patrick Donnelly 13 days ago

Updated by Xiubo Li 8 days ago

Updated by Xiubo Li 8 days ago

Updated by Xiubo Li 8 days ago

Updated by Xiubo Li 8 days ago

Updated by Xiubo Li 7 days ago

Updated by Xiubo Li 7 days ago

Updated by Xiubo Li 7 days ago