Project

General

Profile

Actions

Bug #64717

open

MDS stuck in replay/resolve use

Added by Abhishek Lekshmanan 2 months ago. Updated 27 days ago.

Status:
New
Priority:
High
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a cephfs cluster where we ran a lot of metadata intensive workloads with snapshots enabled. In our monitoring, we saw increasing MDLog segments growing to the order of millions and the MDS daemon being killed after that. Since then MDS's have been permanently stuck in up:replay and up:resolve loop. Initial up:replay always ended up getting oom-killed, this still repeated after 1 TiB of swap space. After finally bumping up to 5 TiB of swap we see that the MDS eventually completed up:replay but after a couple of days of up:resolve, it finally crashed. During the up:resolve stage we took a few stacktraces of MDSes to understand if it was doing anything, except one all the threads were waiting on a mutex... for eg. the log thread

TID 761439:
#0  0x00007f455144345c pthread_cond_wait@@GLIBC_2.3.2
#1  0x00007f455086d870 std::condition_variable::wait(std::unique_lock<std::mutex>&)
#2  0x00005594a654ac76 void std::_V2::condition_variable_any::wait<std::unique_lock<ceph::fair_mutex> >(std::unique_lock<ceph::fair_mutex>&)
#3  0x00005594a6542001 MDLog::_submit_thread()
#4  0x00005594a61f8081 MDLog::SubmitThread::entry()
#5  0x00007f455143d1ca start_thread

The only thread making progress was this thread

#0  0x00007f455085f1e3 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
#1  0x00005594a634803c MDCache::recalc_auth_bits(bool)
#2  0x00005594a635cd25 MDCache::maybe_resolve_finish()
#3  0x00005594a635ef0a MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const> const&)
#4  0x00005594a636aaac MDCache::dispatch(boost::intrusive_ptr<Message const> const&)
#5  0x00005594a61de27b MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)
#6  0x00005594a61de8bc MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)
#7  0x00005594a61c7c2f MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)
#8  0x00007f45526cbf18 Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)
#9  0x00007f45526c935f DispatchQueue::entry()
#10 0x00007f45527933d1 DispatchQueue::DispatchThread::entry()
#11 0x00007f455143d1ca start_thread

Eventaully the resolve stage also crashed with the following stacktrace:

     0> 2024-03-04T07:01:46.040+0100 7f454ba21700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f454ba21700 thread_name:admin_socket

 ceph version 17.2.7-1 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f4551447cf0]
 2: (MDSDaemon::dump_status(ceph::Formatter*)+0x1e0) [0x5594a61c00a0]
 3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_tra
its<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<
char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boo
st::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::
basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::fu
nction<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x515) [0x5594a61c5e25]
 4: (MDSSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_t
raits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocato
r<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, b
oost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11
::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::
function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0xa8) [0x5594a61cf258]
 5: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, c
eph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x889) [0x7f4552566459]
 6: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, c
eph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x11e) [0x7f4552566cce]
 7: (AdminSocket::do_accept()+0x443) [0x7f4552567203]
 8: (AdminSocket::entry()+0x500) [0x7f4552568970]
 9: /lib64/libstdc++.so.6(+0xc2b23) [0x7f4550873b23]
 10: /lib64/libpthread.so.0(+0x81ca) [0x7f455143d1ca]
 11: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Before this what MDS was logging was mostly

2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194 handle_mds_map state change up:replay --> up:resolve
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194 resolve_start
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194 reopen_log
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194  recovery set is 0
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194  recovery set is 0
2024-02-28T22:47:44.986+0100 7f4549a1d700  1 mds.miniflax-mds-fee831e8fb parse_caps: cannot decode auth caps buffer of length 0
2024-02-28T22:48:01.820+0100 7f454ba21700  1 mds.miniflax-mds-fee831e8fb asok_command: status {prefix=status} (starting...)
2024-02-28T22:52:52.011+0100 7f4547a19700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-02-28T22:52:52.012+0100 7f4547a19700  0 mds.beacon.miniflax-mds-fee831e8fb Skipping beacon heartbeat to monitors (last acked 4s ago); MDS internal heartbeat is not healthy!
2024-02-28T22:52:52.512+0100 7f4547a19700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

which continued on with increasing timeouts and eventually we also saw the following lines
2024-02-29T00:24:37.539+0100 7f4548a1b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2024-02-28T23:24:37.541020+0100)

and towards the crash the last few lines were

   -6> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0
    -5> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007875+0100)
    -4> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0
    -3> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007888+0100)
    -2> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0
    -1> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007894+0100)


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #64348: mds: possible memory leak in up:rejoin when opening cap inodes (from OFT)TriagedLeonid Usov

Actions
Actions #1

Updated by Abhishek Lekshmanan 2 months ago

  • Description updated (diff)
Actions #2

Updated by Venky Shankar about 2 months ago

  • Related to Bug #64348: mds: possible memory leak in up:rejoin when opening cap inodes (from OFT) added
Actions #3

Updated by Venky Shankar about 2 months ago

  • Category set to Performance/Resource Usage
  • Target version set to v20.0.0
  • Source set to Community (user)
  • Backport set to quincy,reef,squid
  • Severity changed from 3 - minor to 2 - major
  • Component(FS) MDS added
Actions #4

Updated by Venky Shankar about 2 months ago

Hi Abhishek,

Abhishek Lekshmanan wrote:

We have a cephfs cluster where we ran a lot of metadata intensive workloads with snapshots enabled. In our monitoring, we saw increasing MDLog segments growing to the order of millions and the MDS daemon being killed after that.

What configurations were tuned, if any? It would also help if you could share details such as max_mds setting, etc. (or share a sample `fs dump` output). I ask since the default balancer can act notorious (when directories are not [ephemeral] pinned) at times.

Since then MDS's have been permanently stuck in up:replay and up:resolve loop. Initial up:replay always ended up getting oom-killed, this still repeated after 1 TiB of swap space. After finally bumping up to 5 TiB of swap we see that the MDS eventually completed up:replay but after a couple of days of up:resolve, it finally crashed.

With an insanely huge number of log segments, its MDS is going to OOM itself out. The root cause is to figure out why the log segments we not getting periodically trimmed. Do you (by chance) have like a 5 minute debug mds log?

During the up:resolve stage we took a few stacktraces of MDSes to understand if it was doing anything, except one all the threads were waiting on a mutex... for eg. the log thread
[...]

The only thread making progress was this thread
[...]

This needs to be inspected closely. I also linked a tracker (#64348) that's tracking an MDS OOM kill in up:rejoin. I believe there is memory leak (or some kind of faulty logic) that causes the mds to go around in circles for long, eventually OOMing out. Apparently, we are not catching this in qa suite.

Eventaully the resolve stage also crashed with the following stacktrace:
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f4551447cf0]
2: (MDSDaemon::dump_status(ceph::Formatter*)+0x1e0) [0x5594a61c00a0]
3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >

That's a crash when handling an admin socket command. Do you have a core what can be debugged?

Before this what MDS was logging was mostly

2024-02-28T22:52:52.011+0100 7f4547a19700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-02-28T22:52:52.012+0100 7f4547a19700 0 mds.beacon.miniflax-mds-fee831e8fb Skipping beacon heartbeat to monitors (last acked 4s ago); MDS internal heartbeat is not healthy!
2024-02-28T22:52:52.512+0100 7f4547a19700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

That's the internal MDS heartbeat timing out. The MDS periodically resets this when it's iterating over large number of items (inodes, etc..) when holding the mds_lock. The fact that these log messages were seen soon after entering up:resolve makes me thing its the up:replay state that deteriorated operations.

which continued on with increasing timeouts and eventually we also saw the following lines
[...]

and towards the crash the last few lines were
[...]

Actions #5

Updated by Venky Shankar about 2 months ago

Hi Abhishek,

This tracker was discussed in yesterday's cephfs standup. In particular, the mds crash backtrace that you shared:

ceph version 17.2.7-1 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f4551447cf0]
2: (MDSDaemon::dump_status(ceph::Formatter*)+0x1e0) [0x5594a61c00a0]

Is a bit of a red-herring and is likely that the crash occurred in some other thread. So, if it's possible, please share the mds coredump (which could be huge!) or do a thread apply all bt and share the backtraces from all threads.

Actions #6

Updated by Milind Changire about 2 months ago

  • Assignee set to Milind Changire
Actions #7

Updated by Milind Changire 27 days ago

Would introducing states like up:active-replay and up:active-standby-replay help.
This states would imply that the MDS has completed the replay of the journal and is now ready to do normal processing.

The existing states up:replay and up:standby-replay would imply that the MDS is replaying the journal and is not available for normal active work.

As long as the MDS is in the up:replay and up:standby-replay any beacons/heartbeats could then be deemed irrelevant since the MDS wouldn't be in its normal functioning state.

Actions

Also available in: Atom PDF