Bug #64717
openMDS stuck in replay/resolve use
0%
Description
We have a cephfs cluster where we ran a lot of metadata intensive workloads with snapshots enabled. In our monitoring, we saw increasing MDLog segments growing to the order of millions and the MDS daemon being killed after that. Since then MDS's have been permanently stuck in up:replay and up:resolve loop. Initial up:replay always ended up getting oom-killed, this still repeated after 1 TiB of swap space. After finally bumping up to 5 TiB of swap we see that the MDS eventually completed up:replay but after a couple of days of up:resolve, it finally crashed. During the up:resolve stage we took a few stacktraces of MDSes to understand if it was doing anything, except one all the threads were waiting on a mutex... for eg. the log thread
TID 761439: #0 0x00007f455144345c pthread_cond_wait@@GLIBC_2.3.2 #1 0x00007f455086d870 std::condition_variable::wait(std::unique_lock<std::mutex>&) #2 0x00005594a654ac76 void std::_V2::condition_variable_any::wait<std::unique_lock<ceph::fair_mutex> >(std::unique_lock<ceph::fair_mutex>&) #3 0x00005594a6542001 MDLog::_submit_thread() #4 0x00005594a61f8081 MDLog::SubmitThread::entry() #5 0x00007f455143d1ca start_thread
The only thread making progress was this thread
#0 0x00007f455085f1e3 std::_Rb_tree_increment(std::_Rb_tree_node_base*) #1 0x00005594a634803c MDCache::recalc_auth_bits(bool) #2 0x00005594a635cd25 MDCache::maybe_resolve_finish() #3 0x00005594a635ef0a MDCache::handle_resolve(boost::intrusive_ptr<MMDSResolve const> const&) #4 0x00005594a636aaac MDCache::dispatch(boost::intrusive_ptr<Message const> const&) #5 0x00005594a61de27b MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool) #6 0x00005594a61de8bc MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&) #7 0x00005594a61c7c2f MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&) #8 0x00007f45526cbf18 Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&) #9 0x00007f45526c935f DispatchQueue::entry() #10 0x00007f45527933d1 DispatchQueue::DispatchThread::entry() #11 0x00007f455143d1ca start_thread
Eventaully the resolve stage also crashed with the following stacktrace:
0> 2024-03-04T07:01:46.040+0100 7f454ba21700 -1 *** Caught signal (Segmentation fault) ** in thread 7f454ba21700 thread_name:admin_socket ceph version 17.2.7-1 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f4551447cf0] 2: (MDSDaemon::dump_status(ceph::Formatter*)+0x1e0) [0x5594a61c00a0] 3: (MDSDaemon::asok_command(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_tra its<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator< char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boo st::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11:: basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std::fu nction<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x515) [0x5594a61c5e25] 4: (MDSSocketHook::call_async(std::basic_string_view<char, std::char_traits<char> >, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_t raits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocato r<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, b oost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11 ::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, ceph::Formatter*, ceph::buffer::v15_2_0::list const&, std:: function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0xa8) [0x5594a61cf258] 5: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, c eph::buffer::v15_2_0::list const&, std::function<void (int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::v15_2_0::list&)>)+0x889) [0x7f4552566459] 6: (AdminSocket::execute_command(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, c eph::buffer::v15_2_0::list const&, std::ostream&, ceph::buffer::v15_2_0::list*)+0x11e) [0x7f4552566cce] 7: (AdminSocket::do_accept()+0x443) [0x7f4552567203] 8: (AdminSocket::entry()+0x500) [0x7f4552568970] 9: /lib64/libstdc++.so.6(+0xc2b23) [0x7f4550873b23] 10: /lib64/libpthread.so.0(+0x81ca) [0x7f455143d1ca] 11: clone() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Before this what MDS was logging was mostly
2024-02-28T22:47:44.966+0100 7f4549a1d700 1 mds.1.794194 handle_mds_map state change up:replay --> up:resolve 2024-02-28T22:47:44.966+0100 7f4549a1d700 1 mds.1.794194 resolve_start 2024-02-28T22:47:44.966+0100 7f4549a1d700 1 mds.1.794194 reopen_log 2024-02-28T22:47:44.966+0100 7f4549a1d700 1 mds.1.794194 recovery set is 0 2024-02-28T22:47:44.966+0100 7f4549a1d700 1 mds.1.794194 recovery set is 0 2024-02-28T22:47:44.986+0100 7f4549a1d700 1 mds.miniflax-mds-fee831e8fb parse_caps: cannot decode auth caps buffer of length 0 2024-02-28T22:48:01.820+0100 7f454ba21700 1 mds.miniflax-mds-fee831e8fb asok_command: status {prefix=status} (starting...) 2024-02-28T22:52:52.011+0100 7f4547a19700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s 2024-02-28T22:52:52.012+0100 7f4547a19700 0 mds.beacon.miniflax-mds-fee831e8fb Skipping beacon heartbeat to monitors (last acked 4s ago); MDS internal heartbeat is not healthy! 2024-02-28T22:52:52.512+0100 7f4547a19700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15.000000954s
which continued on with increasing timeouts and eventually we also saw the following lines
2024-02-29T00:24:37.539+0100 7f4548a1b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2024-02-28T23:24:37.541020+0100)
and towards the crash the last few lines were
-6> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0 -5> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007875+0100) -4> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0 -3> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007888+0100) -2> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _finish_auth 0 -1> 2024-03-04T07:01:46.007+0100 7f4549a1d700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2024-03-04T07:01:16.007894+0100)