Bug #1694: monitor crash: FAILED assert(get_max_osd() >= crush.get_max_devices()) - Ceph - Ceph

Actions

Copy link

Bug #1694

closed

monitor crash: FAILED assert(get_max_osd() >= crush.get_max_devices())

Added by Wido den Hollander over 12 years ago. Updated over 12 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

Monitor

Target version:

v0.39

% Done:

Spent time:

4:30 h

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I just did a fresh install of my cluster and after starting I saw my monitors go down with:

Nov  8 14:40:35 monitor-sec mon.sec[1611]: ./osd/OSDMap.h: In function 'int OSDMap::_pg_to_osds(const pg_pool_t&, pg_t, std::vector<int>&)', in thread '7f9c04742700'#012./osd/OSDMap.h: 454: FAILED assert(get_max_osd() >= crush.get_max_devices())
Nov  8 14:40:35 monitor-sec mon.sec[1611]:  ceph version 0.37-314-g40843eb (commit:40843eb36c3c029925d62f35aa8a4dee2876381c)#012 1: /usr/bin/ceph-mon() [0x45c27f]#012 2: (PGMonitor::send_pg_creates()+0x15b4) [0x4c6974]#012 3: (PGMonitor::update_from_paxos()+0x4ef) [0x4c959f]#012 4: (PaxosService::_active()+0x39) [0x4861e9]#012 5: (Context::complete(int)+0xa) [0x472aca]#012 6: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x482e2a]#012 7: (Paxos::handle_lease(MMonPaxos*)+0x36b) [0x47d35b]#012 8: (Paxos::dispatch(PaxosServiceMessage*)+0x21b) [0x481d3b]#012 9: (Monitor::_ms_dispatch(Message*)+0x84e) [0x47062e]#012 10: (Monitor::ms_dispatch(Message*)+0x35) [0x479485]#012 11: (SimpleMessenger::dispatch_entry()+0x84b) [0x56d05b]#012 12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45fd5c]#012 13: (()+0x7efc) [0x7f9c08007efc]#012 14: (clone()+0x6d) [0x7f9c06a4189d]
Nov  8 14:40:35 monitor-sec mon.sec[1611]:  ceph version 0.37-314-g40843eb (commit:40843eb36c3c029925d62f35aa8a4dee2876381c)#012 1: /usr/bin/ceph-mon() [0x45c27f]#012 2: (PGMonitor::send_pg_creates()+0x15b4) [0x4c6974]#012 3: (PGMonitor::update_from_paxos()+0x4ef) [0x4c959f]#012 4: (PaxosService::_active()+0x39) [0x4861e9]#012 5: (Context::complete(int)+0xa) [0x472aca]#012 6: (finish_contexts(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xca) [0x482e2a]#012 7: (Paxos::handle_lease(MMonPaxos*)+0x36b) [0x47d35b]#012 8: (Paxos::dispatch(PaxosServiceMessage*)+0x21b) [0x481d3b]#012 9: (Monitor::_ms_dispatch(Message*)+0x84e) [0x47062e]#012 10: (Monitor::ms_dispatch(Message*)+0x35) [0x479485]#012 11: (SimpleMessenger::dispatch_entry()+0x84b) [0x56d05b]#012 12: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45fd5c]#012 13: (()+0x7efc) [0x7f9c08007efc]#012 14: (clone()+0x6d) [0x7f9c06a4189d]

This seems to be due to 885d71481bf06915569fadb938a0245097f2a9e0

On that specific monitor I checked the OSD maps and this showed:

root@monitor-sec:/var/lib/ceph/mon.sec/osdmap_full# osdmaptool --print 3
osdmaptool: osdmap file '3'
epoch 3
fsid 4bd06b88-1d07-53de-ea22-73f1fb4fe0c4
created 2011-11-08 14:09:40.450317
modifed 2011-11-08 14:39:01.498778
flags 

pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 2496 pgp_num 2496 lpg_num 2 lpgp_num 2 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 2496 pgp_num 2496 lpg_num 2 lpgp_num 2 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 2496 pgp_num 2496 lpg_num 2 lpgp_num 2 last_change 1 owner 0

max_osd 39
osd.17 up   in  weight 1 up_from 2 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6803/5177 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6804/5177 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6805/5177
osd.20 up   in  weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6800/8492 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6801/8492 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6802/8492
osd.22 up   in  weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6806/8693 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6807/8693 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6808/8693
osd.29 up   in  weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6803/5478 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6804/5478 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6805/5478
osd.30 up   in  weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6806/5600 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6807/5600 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6808/5600
osd.31 up   in  weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6809/5711 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6810/5711 [2a00:f10:11b:cef0:225:90ff:fe33:499c]:6811/5711
osd.36 up   in  weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6800/5663 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6801/5663 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6802/5663
osd.37 up   in  weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6803/5753 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6804/5753 [2a00:f10:11b:cef0:225:90ff:fe33:49a4]:6805/5753

root@monitor-sec:/var/lib/ceph/mon.sec/osdmap_full#

I extracted the crushmap (attached) out the osdmap and that shows:

root@monitor-sec:~# cat /root/crushmap.txt |grep item|grep osd|wc -l
40
root@monitor-sec:~#

I don't see why this assert should come up? 39 (max_osd) is less then 40 (max devices).

Could it be a problem that the "devices" were renamed to "items" in the crushmap? I haven't dumped max_devices out of the crushmap to test it though.

Files

crushmap.txt (3.61 KB) crushmap.txt

Wido den Hollander, 11/08/2011 07:01 AM

Actions

Copy link

Updated by Wido den Hollander over 12 years ago

I just made a small adjustment to crushtool so it would print max_devices:

root@monitor-sec:~# ./crushtool -d crushmap -o crushmap.txt 
max_devices 40
root@monitor-sec:~#

That seems OK?

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version set to v0.39

max_osd in the osdmap needs to be >= the max_devices in the crush map. how did you set up the cluster? did mkcephfs generate teh crush map or did you feed one in manually?

Actions

Copy link

Updated by Wido den Hollander over 12 years ago

Aha! Read that wrong, tnx.

I used mkcephfs to generate the crushmap, I did not write my own.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Can you try this and see if there is a mismatch?

osdmaptool --create-from-conf -c your.ceph.conf osdmap
osdmaptool -p osdmap | grep max
osdmaptool --export-crush crushmap osdmap
crushtool -d crushmap | grep device | tail -1

Actions

Copy link

Updated by Sage Weil over 12 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Wido den Hollander over 12 years ago

Ok, I've ran those commands and it gives me:

root@monitor:~# osdmaptool -p osdmap | grep max
max_osd 39
root@monitor:~#

So, that is one short of the 40 I have.

root@monitor:~# crushtool -d crushmap | grep device | tail -1
device 39 osd.39
root@monitor:~#

That is correct, also, if I check the crushmap:

root@monitor:~# crushtool -d crushmap | grep device | grep osd | wc -l
40
root@monitor:~#

So, max_osd is set one short in the osdmap.

Actions

Copy link

Updated by Wido den Hollander over 12 years ago

The monitor that was generating the osdmap was running 5bd029ef01fcb59bea9170af563c3499cce1e8c4 and that failed.

I just ran it with the latest master and that gave me a map with max_osd = 40, but I don't see a change in the last 24 hours, or did I miss that?

I'll update asap with a test with the latest master.

Actions

Copy link

Updated by Sage Weil over 12 years ago

Assignee set to Sage Weil

Great. Can you attach (or email) the ceph.conf you're using?

Thanks!

Actions

Copy link

Updated by Sage Weil over 12 years ago

oh nevermind, didn't see that second comment. the fix is 0bcdd4f3b2a2dba405639122b84f7aad978f347b, which comes after 5bd029ef01fcb59bea9170af563c3499cce1e8c4.

Actions

Copy link

#10

Updated by Sage Weil over 12 years ago

Status changed from Need More Info to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #1694

monitor crash: FAILED assert(get_max_osd() >= crush.get_max_devices())

Updated by Wido den Hollander over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Wido den Hollander over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Wido den Hollander over 12 years ago

Updated by Wido den Hollander over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago