Project

General

Profile

Bug #3715

Crash during 0.55 -> 0.56 upgrade

Added by Faidon Liambotis about 11 years ago. Updated about 11 years ago.

Status:
Duplicate
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I started upgrading my 0.55.1 cluster to 0.56 and at one point in the middle of the upgrade, all 0.55.1 OSDs started to crash at the same time. Restarting them didn't fix it but upgrading them to 0.56 too did. I didn't get a chance to get debug logs, but I do have backtraces. Platform is Ubuntu 12.04 LTS with ceph.com binary packages.

4 6900.3__shadow__x20Y7D5BxFlJ-prC9UGtn-T1fwKU9j1_1 [??? refcount.put] 3.4863bd4b) v4
-15> 2013-01-02 15:05:33.036824 7ffa4301b700 -1 ./messages/MOSDOp.h: In function 'bool MOSDOp::check_rmw(int)' thread 7ffa4301b700 time 2013-01-02 15:05:33.035794
./messages/MOSDOp.h: 57: FAILED assert(rmw_flags)
ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b)
1: (OSD::handle_op(std::tr1::shared_ptr<OpRequest>)+0x12aa) [0x6174da]
2: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xe9) [0x61ef69]
3: (OSD::_dispatch(Message*)+0x26e) [0x626fbe]
4: (OSD::ms_dispatch(Message*)+0x1ba) [0x62772a]
5: (DispatchQueue::entry()+0x349) [0x8ae079]
6: (DispatchQueue::DispatchThread::entry()+0xd) [0x8071fd]
7: (()+0x7e9a) [0x7ffa4f625e9a]
8: (clone()+0x6d) [0x7ffa4e0a9cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
14> 2013-01-02 15:05:33.039943 7ffa42019700  1 - 10.64.0.176:6866/16116 <== osd.11 10.64.0.173:6834/10428 119 ==== pg_info(1 pgs e10603:3.2eea) v3 ==== 512+0+0 (1285696962 0 0) 0x1afa2c40 con 0x1
aabb160
13> 2013-01-02 15:05:33.262252 7ffa40816700 1 - 10.64.0.176:6867/16116 <== osd.16 10.64.0.174:0/3734 231 ==== osd_ping(ping e10603 stamp 2013-01-02 15:05:33.261347) v2 ==== 47+0+0 (1605449094 0
0) 0x27f67c40 con 0x19b5f840
[...]
--- end dump of recent events ---
2013-01-02 15:05:33.743016 7ffa4301b700 -1 ** Caught signal (Aborted) *
in thread 7ffa4301b700
ceph version 0.55.1 (8e25c8d984f9258644389a18997ec6bdef8e056b)
1: /usr/bin/ceph-osd() [0x771c2a]
2: (()+0xfcb0) [0x7ffa4f62dcb0]
3: (gsignal()+0x35) [0x7ffa4dfec425]
4: (abort()+0x17b) [0x7ffa4dfefb8b]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7ffa4e93e69d]
6: (()+0xb5846) [0x7ffa4e93c846]
7: (()+0xb5873) [0x7ffa4e93c873]
8: (()+0xb596e) [0x7ffa4e93c96e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x81bbbf]
10: (OSD::handle_op(std::tr1::shared_ptr&lt;OpRequest&gt;)+0x12aa) [0x6174da]
11: (OSD::dispatch_op(std::tr1::shared_ptr&lt;OpRequest&gt;)+0xe9) [0x61ef69]
12: (OSD::_dispatch(Message*)+0x26e) [0x626fbe]
13: (OSD::ms_dispatch(Message*)+0x1ba) [0x62772a]
14: (DispatchQueue::entry()+0x349) [0x8ae079]
15: (DispatchQueue::DispatchThread::entry()+0xd) [0x8071fd]
16: (()+0x7e9a) [0x7ffa4f625e9a]
17: (clone()+0x6d) [0x7ffa4e0a9cbd]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Related issues

Duplicates Ceph - Bug #3731: rados.h: recent change to CEPH_OSD_OP_CALL constitutes an incompatible protocol change Resolved 01/04/2013

History

#1 Updated by Sage Weil about 11 years ago

  • Status changed from New to 12

is someone sending an MOSDOp that has no ops? init_op_flags() is called before can_*(), so this sounds like an empty message.

(11:06:25 PM) paravoid: I think the crashes started when I upgraded radosgw

#2 Updated by Ian Colle about 11 years ago

  • Assignee set to caleb miles

#3 Updated by Sage Weil about 11 years ago

  • Status changed from 12 to Duplicate

this was #3731

Also available in: Atom PDF