Bug #64717: MDS stuck in replay/resolve use - CephFS - Ceph

Actions

Copy link

Bug #64717

open

MDS stuck in replay/resolve use

Added by Abhishek Lekshmanan 2 months ago. Updated 27 days ago.

Status:

New

Priority:

High

Assignee:

Milind Changire

Category:

Performance/Resource Usage

Target version:

Ceph - v20.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

quincy,reef,squid

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have a cephfs cluster where we ran a lot of metadata intensive workloads with snapshots enabled. In our monitoring, we saw increasing MDLog segments growing to the order of millions and the MDS daemon being killed after that. Since then MDS's have been permanently stuck in up:replay and up:resolve loop. Initial up:replay always ended up getting oom-killed, this still repeated after 1 TiB of swap space. After finally bumping up to 5 TiB of swap we see that the MDS eventually completed up:replay but after a couple of days of up:resolve, it finally crashed. During the up:resolve stage we took a few stacktraces of MDSes to understand if it was doing anything, except one all the threads were waiting on a mutex... for eg. the log thread

TID 761439:
#0  0x00007f455144345c pthread_cond_wait@@GLIBC_2.3.2
#1  0x00007f455086d870 std::condition_variable::wait(std::unique_lock<std::mutex>&)
#2  0x00005594a654ac76 void std::_V2::condition_variable_any::wait<std::unique_lock<ceph::fair_mutex> >(std::unique_lock<ceph::fair_mutex>&)
#3  0x00005594a6542001 MDLog::_submit_thread()
#4  0x00005594a61f8081 MDLog::SubmitThread::entry()
#5  0x00007f455143d1ca start_thread

The only thread making progress was this thread

#0  0x00007f455085f1e3 std::_Rb_tree_increment(std::_Rb_tree_node_base*)
#1  0x00005594a634803c MDCache::recalc_auth_bits(bool)
#2  0x00005594a635cd25 MDCache::maybe_resolve_finish()
#3  0x00005594a635ef0a MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const> const&)
#4  0x00005594a636aaac MDCache::dispatch(boost::intrusive_ptr<Message const> const&)
#5  0x00005594a61de27b MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)
#6  0x00005594a61de8bc MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)
#7  0x00005594a61c7c2f MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)
#8  0x00007f45526cbf18 Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)
#9  0x00007f45526c935f DispatchQueue::entry()
#10 0x00007f45527933d1 DispatchQueue::DispatchThread::entry()
#11 0x00007f455143d1ca start_thread

Eventaully the resolve stage also crashed with the following stacktrace:

     0> 2024-03-04T07:01:46.040+0100 7f454ba21700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f454ba21700 thread_name:admin_socket

 ceph version 17.2.7-1 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f4551447cf0]
 2: (MDSDaemon::dump_status(ceph::Formatter*)+0x1e0) [0x5594a61c00a0]
 3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_tra
its<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<
char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boo
st::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::
basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::fu
nction<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x515) [0x5594a61c5e25]
 4: (MDSSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_t
raits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocato
r<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, b
oost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11
::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::
function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0xa8) [0x5594a61cf258]
 5: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, c
eph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x889) [0x7f4552566459]
 6: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, c
eph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x11e) [0x7f4552566cce]
 7: (AdminSocket::do_accept()+0x443) [0x7f4552567203]
 8: (AdminSocket::entry()+0x500) [0x7f4552568970]
 9: /lib64/libstdc++.so.6(+0xc2b23) [0x7f4550873b23]
 10: /lib64/libpthread.so.0(+0x81ca) [0x7f455143d1ca]
 11: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Before this what MDS was logging was mostly

2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194 handle_mds_map state change up:replay --> up:resolve
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194 resolve_start
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194 reopen_log
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194  recovery set is 0
2024-02-28T22:47:44.966+0100 7f4549a1d700  1 mds.1.794194  recovery set is 0
2024-02-28T22:47:44.986+0100 7f4549a1d700  1 mds.miniflax-mds-fee831e8fb parse_caps: cannot decode auth caps buffer of length 0
2024-02-28T22:48:01.820+0100 7f454ba21700  1 mds.miniflax-mds-fee831e8fb asok_command: status {prefix=status} (starting...)
2024-02-28T22:52:52.011+0100 7f4547a19700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-02-28T22:52:52.012+0100 7f4547a19700  0 mds.beacon.miniflax-mds-fee831e8fb Skipping beacon heartbeat to monitors (last acked 4s ago); MDS internal heartbeat is not healthy!
2024-02-28T22:52:52.512+0100 7f4547a19700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

which continued on with increasing timeouts and eventually we also saw the following lines

2024-02-29T00:24:37.539+0100 7f4548a1b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2024-02-28T23:24:37.541020+0100)

and towards the crash the last few lines were

   -6> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0
    -5> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007875+0100)
    -4> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0
    -3> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007888+0100)
    -2> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0
    -1> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007894+0100)

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Abhishek Lekshmanan 2 months ago

Description updated (diff)

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Related to Bug #64348: mds: possible memory leak in up:rejoin when opening cap inodes (from OFT) added

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Category set to Performance/Resource Usage
Target version set to v20.0.0
Source set to Community (user)
Backport set to quincy,reef,squid
Severity changed from 3 - minor to 2 - major
Component(FS) MDS added

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Hi Abhishek,

Abhishek Lekshmanan wrote:

We have a cephfs cluster where we ran a lot of metadata intensive workloads with snapshots enabled. In our monitoring, we saw increasing MDLog segments growing to the order of millions and the MDS daemon being killed after that.

What configurations were tuned, if any? It would also help if you could share details such as max_mds setting, etc. (or share a sample `fs dump` output). I ask since the default balancer can act notorious (when directories are not [ephemeral] pinned) at times.

Since then MDS's have been permanently stuck in up:replay and up:resolve loop. Initial up:replay always ended up getting oom-killed, this still repeated after 1 TiB of swap space. After finally bumping up to 5 TiB of swap we see that the MDS eventually completed up:replay but after a couple of days of up:resolve, it finally crashed.

With an insanely huge number of log segments, its MDS is going to OOM itself out. The root cause is to figure out why the log segments we not getting periodically trimmed. Do you (by chance) have like a 5 minute debug mds log?

During the up:resolve stage we took a few stacktraces of MDSes to understand if it was doing anything, except one all the threads were waiting on a mutex... for eg. the log thread
[...]

The only thread making progress was this thread
[...]

This needs to be inspected closely. I also linked a tracker (#64348) that's tracking an MDS OOM kill in up:rejoin. I believe there is memory leak (or some kind of faulty logic) that causes the mds to go around in circles for long, eventually OOMing out. Apparently, we are not catching this in qa suite.

Eventaully the resolve stage also crashed with the following stacktrace:
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f4551447cf0]
2: (MDSDaemon::dump_status(ceph::Formatter*)+0x1e0) [0x5594a61c00a0]
3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >

That's a crash when handling an admin socket command. Do you have a core what can be debugged?

Before this what MDS was logging was mostly

2024-02-28T22:52:52.011+0100 7f4547a19700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
2024-02-28T22:52:52.012+0100 7f4547a19700 0 mds.beacon.miniflax-mds-fee831e8fb Skipping beacon heartbeat to monitors (last acked 4s ago); MDS internal heartbeat is not healthy!
2024-02-28T22:52:52.512+0100 7f4547a19700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s

That's the internal MDS heartbeat timing out. The MDS periodically resets this when it's iterating over large number of items (inodes, etc..) when holding the mds_lock. The fact that these log messages were seen soon after entering up:resolve makes me thing its the up:replay state that deteriorated operations.

which continued on with increasing timeouts and eventually we also saw the following lines
[...]

and towards the crash the last few lines were
[...]

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Hi Abhishek,

This tracker was discussed in yesterday's cephfs standup. In particular, the mds crash backtrace that you shared:

ceph version 17.2.7-1 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0) [0x7f4551447cf0]
2: (MDSDaemon::dump_status(ceph::Formatter*)+0x1e0) [0x5594a61c00a0]

Is a bit of a red-herring and is likely that the crash occurred in some other thread. So, if it's possible, please share the mds coredump (which could be huge!) or do a thread apply all bt and share the backtraces from all threads.

Actions

Copy link

Updated by Milind Changire about 2 months ago

Assignee set to Milind Changire

Actions

Copy link

Updated by Milind Changire 27 days ago

Would introducing states like up:active-replay and up:active-standby-replay help.
This states would imply that the MDS has completed the replay of the journal and is now ready to do normal processing.

The existing states up:replay and up:standby-replay would imply that the MDS is replaying the journal and is not available for normal active work.

As long as the MDS is in the up:replay and up:standby-replay any beacons/heartbeats could then be deemed irrelevant since the MDS wouldn't be in its normal functioning state.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #64717

MDS stuck in replay/resolve use

Updated by Abhishek Lekshmanan 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Venky Shankar about 2 months ago

Updated by Milind Changire about 2 months ago

Updated by Milind Changire 27 days ago