Project

General

Profile

Actions

Bug #51824

open

pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOM

Added by Dan van der Ster almost 3 years ago. Updated about 1 year ago.

Status:
Pending Backport
Priority:
Normal
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
Yes
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are testing the scalability of pacific snapshots, snap trimming, and num_stray accounting and ran into some crashes today.

We started this exercise with a cluster having num_stray = 22M. (To get to this point, we had created lots of files, snapshotted the dirs, and then removed the snapshots).

Basically, we're trying to understand why num_stray is not decreasing. We tried to list the stray omap entries `rados -p cephfs.meta listomapkeys 600.00000000`, but this is empty.

So we decided to perform a readonly scrub of the ~mdsdir to see what it would report about the stray dirstats.

1. # ceph tell mds.0 scrub start ~mdsdir recursive

This led to a ceph_assert(o->get_num_ref() == 0):

2021-07-23T09:46:24.969+0200 7fd57f258700  1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2021-07-23T09:46:24.969+0200 7fd577248700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]

2021-07-23T09:46:26.420+0200 7fd577248700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: In function 'void MDCache::remove_inode(
CInode*)' thread 7fd577248700 time 2021-07-23T09:46:26.414153+0200/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/MDCache.cc: 336: FAILED ceph_assert(o->get_num_ref() == 0)

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7fd586e2bb24]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x276d3e) [0x7fd586e2bd3e]
 3: (MDCache::remove_inode(CInode*)+0x598) [0x55e9b69bfcf8]
 4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long, boost::intrusive_ptr<MutationImpl>&)+0x304) [0x55e9b6a7d354]
 5: (MDSContext::complete(int)+0x56) [0x55e9b6bd7b96]
 6: (MDSIOContextBase::complete(int)+0x5ac) [0x55e9b6bd83cc]
 7: (MDSLogContextBase::complete(int)+0x44) [0x55e9b6bd8644]
 8: (Finisher::finisher_thread_entry()+0x1a5) [0x7fd586ecab65]
 9: /lib64/libpthread.so.0(+0x814a) [0x7fd585bcb14a]
 10: clone()

2. The MDS reconnected, we noticed it was now decreasing the num_strays and using a lot of CPU.
We started the scrub again on mdsdir and got an abort:

2021-07-23T09:52:22.804+0200 7fecdbda7700  1 mds.cephdata20b-106a771b77 asok_command: scrub start {path=~mdsdir,prefix=scrub start,scrubops=[recursive]} (starting...)
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub queued for path: ~mds0
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub summary: idle+waiting paths [~mds0]
2021-07-23T09:52:22.804+0200 7fecd3d97700  0 log_channel(cluster) log [INF] : scrub summary: active paths [~mds0]

2021-07-23T09:52:41.494+0200 7fecd4d99700 -1 mds.0.cache.strays Rogue reference after purge to [dentry #0x100/stray6/1002112a666 [62f,head] auth (dversion lock) v=38380708 ino=0x1002112a666 state=1073741860 | inodepin=1 purging=1 dirty=0 0x5621bc8d8780]
2021-07-23T09:52:41.495+0200 7fecd4d99700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: In function 'void StrayManager::_purge_stray_purged(CDentry*, bool)' thread 7fecd4d99700 time 2021-07-23T09:52:41.495501+0200
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/mds/StrayManager.cc: 218: ceph_abort_msg("rogue reference to purging inode")

 ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x7fece397c264]
 2: (StrayManager::_purge_stray_purged(CDentry*, bool)+0x2db) [0x5621af1667eb]
 3: (MDSContext::complete(int)+0x56) [0x5621af2c2b96]
 4: (MDSIOContextBase::complete(int)+0x5ac) [0x5621af2c33cc]
 5: (Finisher::finisher_thread_entry()+0x1a5) [0x7fece3a1ab65]
 6: /lib64/libpthread.so.0(+0x814a) [0x7fece271b14a]
 7: clone()

3. After this restart, we started a third scrub on ~mdsdir, causing the mds to miss heartbeats for around a minute. We also noticed that num_strays is now reported as 0, but we're not sure when this changed.
This time the MDS eventually went OOM:

kernel: Out of memory: Killed process 842857 (ceph-mds) total-vm:36420156kB, anon-rss:35940832>

As of now, the MDS is stable and we won't run another scrub.

The mds log is at ceph-post-file: be2e6b89-088c-4cbc-99ed-0a931a60f6a0
We have the coredump from the first assert if it can be useful:

# coredumpctl 
TIME                            PID   UID   GID SIG COREFILE  EXE
Fri 2021-07-23 09:47:04 CEST 1939250   167   167   6 present   /usr/bin/ceph-mds


Related issues 3 (0 open3 closed)

Copied to CephFS - Backport #59263: reef: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOMResolvedDhairya ParmarActions
Copied to CephFS - Backport #59264: pacific: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOMResolvedDhairya ParmarActions
Copied to CephFS - Backport #59265: quincy: pacific scrub ~mds_dir causes stray related ceph_assert, abort and OOMResolvedDhairya ParmarActions
Actions

Also available in: Atom PDF