Project

General

Profile

Actions

Bug #58239

closed

pacific: src/mon/Monitor.cc: FAILED ceph_assert(osdmon()->is_writeable())

Added by Kamoltat (Junior) Sirivadhna over 1 year ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is not deterministic, but when ``run_osd`` in qa/standalone/ceph-helpers.sh can result in timeout when trying to spin up an OSD. Usually when the script spins up >= 8 OSDs.

Example of this: qa/standalone/mon/mon-stretched-cluster.sh

First encountered in Teuthology:

/a/yuriw-2022-11-30_15:10:52-rados-wip-yuri3-testing-2022-11-28-0750-pacific-distro-default-smithi/7098562

2022-11-30T16:20:43.717 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/mon/mon-stretched-cluster.sh:138: TEST_stretched_cluster_failover_add_three_osds:  run_osd td/mon-stretched-cluster 8
2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:633: run_osd:  local dir=td/mon-stretched-cluster
2022-11-30T16:20:43.718 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:634: run_osd:  shift
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:635: run_osd:  local id=8
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:636: run_osd:  shift
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:637: run_osd:  local osd_data=td/mon-stretched-cluster/8
2022-11-30T16:20:43.719 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:639: run_osd:  local 'ceph_args=--fsid=dedcc9bc-5228-4881-99b7-42ec5a2680d8 --auth-supported=none  --mon-host=127.0.0.1:7139,127.0.0.1:7141,127.0.0.1:7142,127.0.0.1:7143,127.0.0.1:7144'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:640: run_osd:  ceph_args+=' --osd-failsafe-full-ratio=.99'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:641: run_osd:  ceph_args+=' --osd-journal-size=100'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:642: run_osd:  ceph_args+=' --osd-scrub-load-threshold=2000'
2022-11-30T16:20:43.720 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:643: run_osd:  ceph_args+=' --osd-data=td/mon-stretched-cluster/8'
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:644: run_osd:  ceph_args+=' --osd-journal=td/mon-stretched-cluster/8/journal'
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:645: run_osd:  ceph_args+=' --chdir='
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:646: run_osd:  ceph_args+=
2022-11-30T16:20:43.721 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:647: run_osd:  ceph_args+=' --run-dir=td/mon-stretched-cluster'
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd:  get_asok_path
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:116: get_asok_path:  local name=
2022-11-30T16:20:43.722 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:117: get_asok_path:  '[' -n '' ']'
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  get_asok_dir
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:108: get_asok_dir:  '[' -n '' ']'
2022-11-30T16:20:43.725 INFO:tasks.workunit.client.0.smithi038.stderr:///home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:111: get_asok_dir:  echo /tmp/ceph-asok.99133
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:120: get_asok_path:  echo '/tmp/ceph-asok.99133/$cluster-$name.asok'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:648: run_osd:  ceph_args+=' --admin-socket=/tmp/ceph-asok.99133/$cluster-$name.asok'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:649: run_osd:  ceph_args+=' --debug-osd=20'
2022-11-30T16:20:43.726 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:650: run_osd:  ceph_args+=' --debug-ms=1'
2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:651: run_osd:  ceph_args+=' --debug-monc=20'
2022-11-30T16:20:43.727 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:652: run_osd:  ceph_args+=' --log-file=td/mon-stretched-cluster/$name.log'

...

2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:[errno 110] RADOS timed out (error connecting to the cluster)
2022-12-01T03:36:43.128 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:929: wait_for_osd:  sleep 1
2022-12-01T03:36:44.129 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd:  (( i++ ))
2022-12-01T03:36:44.130 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:926: wait_for_osd:  (( i < 300 ))
2022-12-01T03:36:44.131 INFO:tasks.workunit.client.0.smithi038.stdout:7
2022-12-01T03:36:44.132 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:927: wait_for_osd:  echo 7
2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd:  ceph osd dump
2022-12-01T03:36:44.133 INFO:tasks.workunit.client.0.smithi038.stderr:/home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:928: wait_for_osd:  grep 'osd.8 up'
2022-12-01T03:38:58.027 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2022-12-01T03:38:58.034 DEBUG:teuthology.task.console_log:Killing console logger for smithi038
2022-12-01T03:38:58.038 DEBUG:teuthology.exit:Finished running handlers

This happened due to a crash.

2022-11-30T19:16:36.111 INFO:tasks.workunit.client.0.smithi038.stderr://home/ubuntu/cephtest/clone.client.0/qa/standalone/ceph-helpers.sh:201: teardown:  echo 'ERROR: Failure due to cores found'
2022-11-30T19:16:36.111 INFO:tasks.workunit.client.0.smithi038.stdout:ERROR: Failure due to cores found
2022-11-30T16:20:55.309+0000 7f1c58e03700 -1 /build/ceph-16.2.10-1291-g8dc57b56/src/mon/Monitor.cc: In function 'void Monitor::trigger_degraded_stretch_mode(const std::set<std::__cxx11::basic_string<char> >&, const std::set<int>&)' thread 7f1c58e03700 time 2022-11-30T16:20:55.304971+0000
/build/ceph-16.2.10-1291-g8dc57b56/src/mon/Monitor.cc: 6760: FAILED ceph_assert(osdmon()->is_writeable())

 ceph version 16.2.10-1291-g8dc57b56 (8dc57b5624d909c1714a82ba84ff24e49431ea76) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f1c637f5d55]
 2: /usr/lib/ceph/libceph-common.so.2(+0x267f5d) [0x7f1c637f5f5d]
 3: (Monitor::trigger_degraded_stretch_mode(std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::set<int, std::less<int>, std::allocator<int> > const&)+0x20c) [0x5607602aa84c]
 4: (Monitor::maybe_go_degraded_stretch_mode()+0x2b6) [0x5607602aac06]
 5: (Context::complete(int)+0xd) [0x5607602d1b8d]
 6: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5607603068f0]
 7: (PaxosService::_active()+0xb3) [0x560760414543]
 8: (Context::complete(int)+0xd) [0x5607602d1b8d]
 9: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xb0) [0x5607603068f0]
 10: (Paxos::finish_round()+0x8c) [0x56076040918c]
 11: (Paxos::commit_finish()+0x8be) [0x56076040bb9e]
 12: (C_Committed::finish(int)+0x4a) [0x56076041123a]
 13: (Context::complete(int)+0xd) [0x5607602d1b8d]
 14: (MonitorDBStore::C_DoTransaction::finish(int)+0xba) [0x560760410efa]
 15: (Context::complete(int)+0xd) [0x5607602d1b8d]
 16: (Finisher::finisher_thread_entry()+0x19d) [0x7f1c638aaa9d]
 17: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f1c632c8609]
 18: clone()

Related issues 1 (0 open1 closed)

Related to RADOS - Backport #58337: pacific: mon-stretched_cluster: degraded stretched mode lead to Monitor crashRejectedKamoltat (Junior) SirivadhnaActions
Actions #1

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Assignee set to Kamoltat (Junior) Sirivadhna
Actions #2

Updated by Neha Ojha over 1 year ago

  • Subject changed from qa/standalone/ceph-helpers.sh:652: run_osd: RADOS timed out (error connecting to the cluster) to src/mon/Monitor.cc: FAILED ceph_assert(osdmon()->is_writeable())
  • Description updated (diff)
Actions #4

Updated by Laura Flores over 1 year ago

The failure reproduced 7 times out of 50.

Actions #5

Updated by Laura Flores over 1 year ago

Analyzing the coredump:

Looking at the backtrace (same as above, but here, the frames are numbered)...

(gdb) bt
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000560760571e50 in reraise_fatal (signum=6) at ./src/global/signal_handler.cc:332
#2  handle_fatal_signal (signum=6) at ./src/global/signal_handler.cc:332
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5  0x00007f1c62d9e859 in __GI_abort () at abort.c:79
#6  0x00007f1c637f5db0 in ceph::__ceph_assert_fail(char const*, char const*, int, char const*) () from /usr/lib/ceph/libceph-common.so.2
#7  0x00007f1c637f5f5d in ceph::__ceph_assert_fail(ceph::assert_data const&) () from /usr/lib/ceph/libceph-common.so.2
#8  0x00005607602aa84c in Monitor::trigger_degraded_stretch_mode (this=0x56076432fa00, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
dead_mons=std::set with 2 elements, dead_buckets=std::set with 1 element = {...}) at /usr/include/c++/9/ext/new_allocator.h:89
#9  0x00005607602aac06 in Monitor::maybe_go_degraded_stretch_mode (this=0x56076432fa00) at ./src/mon/Monitor.cc:6752
#10 0x00005607602d1b8d in Context::complete (this=0x5607634e8cf0, r=<optimized out>) at ./src/include/Context.h:99
#11 0x00005607603068f0 in finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > > (cct=0x560763596000, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
finished=std::__cxx11::list, result=result@entry=0) at ./src/log/SubsystemMap.h:72
#12 0x0000560760414543 in PaxosService::_active (this=0x5607635f0800) at ./src/mon/PaxosService.cc:349
#13 0x00005607602d1b8d in Context::complete (this=0x5607634e8b70, r=<optimized out>) at ./src/include/Context.h:99
#14 0x00005607603068f0 in finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > > (cct=0x560763596000, Python Exception <class 'AttributeError'> 'NoneType' object has no attribute 'pointer': 
finished=empty std::__cxx11::list, result=result@entry=0) at ./src/log/SubsystemMap.h:72
#15 0x000056076040918c in Paxos::finish_round (this=<optimized out>) at ./src/mon/Paxos.cc:1083
#16 0x000056076040bb9e in Paxos::commit_finish (this=<optimized out>) at ./src/mon/Paxos.cc:949
#17 0x000056076041123a in C_Committed::finish (this=0x5607634e9640, r=<optimized out>) at ./src/mon/Paxos.cc:835
#18 0x00005607602d1b8d in Context::complete (this=0x5607634e9640, r=<optimized out>) at ./src/include/Context.h:99
#19 0x0000560760410efa in MonitorDBStore::C_DoTransaction::finish (this=0x5607663e3710, r=<optimized out>) at ./src/mon/MonitorDBStore.h:388
#20 0x00005607602d1b8d in Context::complete (this=0x5607663e3710, r=<optimized out>) at ./src/include/Context.h:99
#21 0x00007f1c638aaa9d in Finisher::finisher_thread_entry() () from /usr/lib/ceph/libceph-common.so.2
#22 0x00007f1c632c8609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x00007f1c62e9b133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Going to frame 9 shows the state of things right before we entered `trigger_degraded_stretch_mode`:

(gdb) f 9
#9  0x00005607602aac06 in Monitor::maybe_go_degraded_stretch_mode (this=0x56076432fa00) at ./src/mon/Monitor.cc:6752
6752        trigger_degraded_stretch_mode(matched_down_mons, matched_down_buckets);

(gdb) info locals
__func__ = "maybe_go_degraded_stretch_mode" 
__PRETTY_FUNCTION__ = <optimized out>
mi = <optimized out>
ci = {first = "zone", second = "arbiter"}
filtered_dead_buckets = std::map with 1 element = {["pze"] = std::set with 2 elements = {[0] = "c", [1] = "d"}}
matched_down_buckets = std::set with 1 element = {[0] = -7}
matched_down_mons = std::set with 2 elements = {[0] = "c", [1] = "d"}
dead = <optimized out>

Actions #6

Updated by Neha Ojha over 1 year ago

  • Subject changed from src/mon/Monitor.cc: FAILED ceph_assert(osdmon()->is_writeable()) to pacific: src/mon/Monitor.cc: FAILED ceph_assert(osdmon()->is_writeable())
Actions #7

Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from New to Resolved
  • Pull request ID set to 49412
Actions #8

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Status changed from Resolved to In Progress
  • Pull request ID deleted (49412)

This bug is not yet resolved, also removing PR number since
49412 is a revert PR.

Actions #9

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Status changed from In Progress to Resolved
  • Pull request ID set to 49412

My mistake, this issue is resolved because we have reverted https://github.com/ceph/ceph/pull/48803

Revert PR: https://github.com/ceph/ceph/pull/49412.

We still need to check main and quincy whether this issue is caused by https://github.com/ceph/ceph/pull/48803 or not.

We can do this by running teuthology test on both main and quincy.

Actions #10

Updated by Kamoltat (Junior) Sirivadhna over 1 year ago

  • Related to Backport #58337: pacific: mon-stretched_cluster: degraded stretched mode lead to Monitor crash added
Actions

Also available in: Atom PDF