Bug #51824
openpacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM
0%
Description
We are testing the scalability of pacific snapshots, snap trimming, and num_stray accounting and ran into some crashes today.
We started this exercise with a cluster having num_stray = 22M. (To get to this point, we had created lots of files, snapshotted the dirs, and then removed the snapshots).
Basically, we're trying to understand why num_stray is not decreasing. We tried to list the stray omap entries `rados -p cephfs.meta listomapkeys 600.00000000`, but this is empty.
So we decided to perform a readonly scrub of the ~mdsdir to see what it would report about the stray dirstats.
1. # ceph tell mds.0 scrub start ~mdsdir recursive
This led to a ceph_assert(o->get_num_ref() == 0):
2021-07-23T09:46:24.969+0200 7fd57f258700 1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...) 2021-07-23T09:46:24.969+0200 7fd577248700 0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0 2021-07-23T09:46:24.969+0200 7fd577248700 0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0] 2021-07-23T09:46:24.969+0200 7fd577248700 0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0] 2021-07-23T09:46:26.420+0200 7fd577248700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: In function 'void MDCache::remove_inode( CInode*)' thread 7fd577248700 time 2021-07-23T09:46:26.414153+0200/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: 336: FAILED ceph_assert(o->get_num_ref() == 0) ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fd586e2bb24] 2: /usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7fd586e2bd3e] 3: (MDCache::remove_inode(CInode*)+0x598) [0x55e9b69bfcf8] 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, boost::intrusive_ptr<MutationImpl>&)+0x304) [0x55e9b6a7d354] 5: (MDSContext::complete(int)+0x56) [0x55e9b6bd7b96] 6: (MDSIOContextBase::complete(int)+0x5ac) [0x55e9b6bd83cc] 7: (MDSLogContextBase::complete(int)+0x44) [0x55e9b6bd8644] 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7fd586ecab65] 9: /lib64/libpthread.so.0(+0x814a) [0x7fd585bcb14a] 10: clone()
2. The MDS reconnected, we noticed it was now decreasing the num_strays and using a lot of CPU.
We started the scrub again on mdsdir and got an abort:
2021-07-23T09:52:22.804+0200 7fecdbda7700 1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...) 2021-07-23T09:52:22.804+0200 7fecd3d97700 0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0 2021-07-23T09:52:22.804+0200 7fecd3d97700 0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0] 2021-07-23T09:52:22.804+0200 7fecd3d97700 0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0] 2021-07-23T09:52:41.494+0200 7fecd4d99700 -1 mds.0.cache.strays Rogue reference after purge to [dentry #0x100/stray6/1002112a666 [62f,head] auth (dversion lock) v=38380708 ino=0x1002112a666 state=1073741860 | inodepin=1 purging=1 dirty=0 0x5621bc8d8780] 2021-07-23T09:52:41.495+0200 7fecd4d99700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: In function 'void StrayManager::_purge_stray_purged(CDentry*, bool)' thread 7fecd4d99700 time 2021-07-23T09:52:41.495501+0200 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: 218: ceph_abort_msg("rogue reference to purging inode") ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7fece397c264] 2: (StrayManager::_purge_stray_purged(CDentry*, bool)+0x2db) [0x5621af1667eb] 3: (MDSContext::complete(int)+0x56) [0x5621af2c2b96] 4: (MDSIOContextBase::complete(int)+0x5ac) [0x5621af2c33cc] 5: (Finisher::finisher_thread_entry()+0x1a5) [0x7fece3a1ab65] 6: /lib64/libpthread.so.0(+0x814a) [0x7fece271b14a] 7: clone()
3. After this restart, we started a third scrub on ~mdsdir, causing the mds to miss heartbeats for around a minute. We also noticed that num_strays is now reported as 0, but we're not sure when this changed.
This time the MDS eventually went OOM:
kernel: Out of memory: Killed process 842857 (ceph-mds) total-vm:36420156kB, anon-rss:35940832>
As of now, the MDS is stable and we won't run another scrub.
The mds log is at ceph-post-file: be2e6b89-088c-4cbc-99ed-0a931a60f6a0
We have the coredump from the first assert if it can be useful:
# coredumpctl TIME PID UID GID SIG COREFILE EXE Fri 2021-07-23 09:47:04 CEST 1939250 167 167 6 present /usr/bin/ceph-mds