Bug #3459: osd crash in CephXAuthorizer::verify_reply - Ceph - Ceph

Actions

Copy link

Bug #3459

closed

osd crash in CephXAuthorizer::verify_reply

Added by Tamilarasi muthamizhan over 11 years ago. Updated over 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Greg Farnum

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Log: ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108

2012-11-06 21:13:39.757184 1cd7e700 -1 ** Caught signal (Aborted) *
in thread 1cd7e700

ceph version 0.53-618-g15b3d98 (15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x7486ba]
 2: (()+0xfcb0) [0x5043cb0]
 3: (gsignal()+0x35) [0x69a4445]
 4: (abort()+0x17b) [0x69a7bab]
 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x621569d]
 6: (()+0xb5846) [0x6213846]
 7: (()+0xb5873) [0x6213873]
 8: (()+0xb596e) [0x621396e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x7f3fd7]
 10: (CephXAuthorizer::verify_reply(ceph::buffer::list::iterator&)+0xeb) [0x74d69b]
 11: (Pipe::connect()+0x18a3) [0x896dc3]
 12: (Pipe::writer()+0x4cd) [0x8a051d]
 13: (Pipe::Writer::entry()+0xd) [0x8a1f6d]
 14: (()+0x7e9a) [0x503be9a]
 15: (clone()+0x6d) [0x6a604bd]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

config file:
ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108$ cat config.yaml
kernel: &id001
kdb: true
sha1: 22cddde104d715600a4c218bf9224923208afe90
nuke-on-error: true
overrides:
ceph:
conf:
global:
ms inject socket failures: 5000
coverage: true
fs: btrfs
log-whitelist:
- slow request
sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
valgrind:
mds:
- --tool=memcheck
mon:
- --tool=memcheck
osd:
- --tool=memcheck
s3tests:
branch: master
workunit:
sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
roles:
- - mon.a
- mon.c
- osd.0
- osd.1
- osd.2
- - mon.b
- mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
ubuntu@plana44.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYE0eu9E8TQwtUy89Wldp54VbNBEoO9XQf77eXXzzmNwYUFRrNX0mZV/I8GqyRJuMrPG8V4aZBthBHTtnEmQ6RAS7fVdthi/hEgwnM9cAqY3KX9mR5xJnHBc/fa5KLrnSr3Wrztf42PpQNEN5Tk55K6wWUlZOTHU3vE0j3kF+YQ5FeBhQbghztHPKFR8bOmZJp9TpbXgbvEM2RWr9bYtro1KuQOgrairyVVNWdAuwZuxSQT4soyHoSkY9JmeXKsNRAOamxH9w57mDC3PXui7r6Fp8OCWSK+GmlLTtPaZtulSCcucaZtpVae7F4s9JNxaRl5RxuUtwMRfgAHGlL2BZv
ubuntu@plana55.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCdrzGTR0Fbl6sedYlwlX+FlmF6fuE3l/RTu2kzOkmG47rPEn5CI37Injb7Epc50RXCbUIfzmDqtEY6uZT3YssYrE4jvhQlynPndbn1KmiTbgxTyuumGXv7O4OOntezighA1W49phUNZys1DhdEEO8VSQAIdHrBgBLhY9DDgC4LAhrP4BSbDTN0rUXtYYHBj4aa3sJV0o3sKjpsyjjlieEQnto6JkjK6EGZCSuY+AyMZyLJjFTgMwJ9i4aC5eZoWZAWSDfDsxo8PtFR+kjUmz5uiheyn5lAzKBxmd4ZNojf7wOhSGia0ghbtUeQkdoRZXZhP2ourNn3uAguf1xt43kX
ubuntu@plana61.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOTCMIScDTmD9NkfsWU7xeyZ+WOXai5izYeliiXDSjJC3bT6r8Fp+rhPfcHCVHiw++VsbvKZtkhjCSnJTVPWCdpRDghzJ3nZUBImWRo3PmHo1etQpCeimaOrIJ2q0ChN5jmSOqy5B+Z4om2vXBtBY6nkdTxDOr2+MH3NrSPkQSFB0zO+VPuwKXsemeUC6urb2IZZpxY3cxNq4fafTF9PROpgOnIA+o3igyU4duKEjnCzTHZjw/PL7Eph/7p6+UQgrUwe7pgVzT+2MM0zcBtBSXNqs3dCGmpvUapOkBlDoIX02EkWRNpkM3vfeFt1EFC17B5vd61Kg40bYUG8qWGR0T
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph: null
- mon_recovery: null

ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108$ cat summary.yaml
ceph-sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
client.0-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
description: collection:rados-verify clusters:fixed-3.yaml fs:btrfs.yaml msgr-failures:few.yaml
tasks:mon_recovery.yaml validater:valgrind.yaml
duration: 749.3446509838104
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
/tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/chdir-coredump
valgrind --suppressions=/tmp/cephtest/valgrind.supp --xml=yes --xml-file=/tmp/cephtest/archive/log/valgrind/osd.1.log
--tool=memcheck /tmp/cephtest/binary/usr/local/bin/ceph-osd -f -i 1 -c /tmp/cephtest/ceph.conf'''
flavor: notcmalloc
mon.a-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
mon.b-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
owner: scheduled_teuthology@teuthology
success: false

Actions

Copy link

Updated by Sage Weil over 11 years ago

Priority changed from Normal to Urgent

ubuntu@teuthology:/a/sage-2012-11-12_16:44:02-regression-master-wip-3.4-basic/13948

Actions

Copy link

Updated by Sage Weil over 11 years ago

Subject changed from osd crash in the nightly run to osd crash in CephXAuthorizer::verify_reply

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from New to Resolved

this should be fixed by the new guards around decrypt_decode().

Actions

Copy link

Updated by Dan Mick over 11 years ago

A user reports this same crash today in IRC with 0.55:

https://pastee.org/f4dgd

Actions

Copy link

Updated by Tamilarasi muthamizhan over 11 years ago

Status changed from Resolved to In Progress

Actions

Copy link

Updated by Sage Weil over 11 years ago

wth, i could have sworn i pushed something that added a try/catch block around the decode, but now i don't see it. pushed wip-3459 that does just that. which means there is probably a dup bug in the tracker somewhere with the same crash...

Actions

Copy link

Updated by Sage Weil over 11 years ago

Status changed from In Progress to Fix Under Review

the try/catch may be treating hte symptom, but it's definitley correct, and the binary for the qa run is long gone so i can't get anything else useful out of the failure. i think we merge the patch and wait for this to strike again (or not!)

Actions

Copy link

Updated by Greg Farnum over 11 years ago

Assignee set to Greg Farnum

Actions

Copy link

Updated by Greg Farnum over 11 years ago

The patch looks fine on its face but several tests in the suite failed. I need to track down if they're familiar errors to anybody and look a little more closely into a couple of them. If you're interested, here they are....

failed tests:11922: rados test workunit failed
CommandFailedError: Command failed with status 1: 'mkdir ~~p -~~ /tmp/cephtest/mnt.0/client.0/tmp && cd -- /tmp/cephtest/mnt.0/client.0/tmp && CEPH_REF=f957cd57c513d7f45b0d0ab1c3db6c4ccbbc110b PATH="$PATH:/tmp/cephtest/binary/usr/local/bin" LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/tmp/cephtest/binary/usr/local/lib" CEPH_CONF="/tmp/cephtest/ceph.conf" CEPH_SECRET_FILE="/tmp/cephtest/data/client.0.secret" CEPH_ID="0" PYTHONPATH="$PYTHONPATH:/tmp/cephtest/binary/usr/local/lib/python2.7/dist-packages:/tmp/cephtest/binary/usr/local/lib/python2.6/dist-packages" /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/workunit.client.0/rados/test.sh'

11963: still running...

11964: rados command failed
CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-cover
age /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.0.keyring --name client.0 -p data bench 1200 write'"

11970: rados command failed
CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-cover
age /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.0.keyring --name client.0 -p data bench 1200 write'"

11975: osd crashed on a startup (thrashing, I assume?)
osd/OSD.cc: 2434: FAILED assert(pg)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x11a4ac1]
2: (OSD::disconnect_session_watches(OSD::Session*)+0x2a7) [0xea6ef5]
3: (OSD::ms_handle_reset(Connection*)+0x155) [0xea761d]
4: (Messenger::ms_deliver_handle_reset(Connection*)+0x4b) [0x126b419]
5: (DispatchQueue::entry()+0x176) [0x126a4be]
6: (DispatchQueue::DispatchThread::entry()+0x1c) [0x118ac14]
7: (Thread::_entry_func(void*)+0x23) [0x11932ad]
8: (()+0x7e9a) [0x7fc73f45de9a]
9: (clone()+0x6d) [0x7fc73d5e84bd]

11976: both a ceph and a rados command failed. How did they both get the chance to do so?
2012-12-11T23:10:44.121 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-
coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd in 4'
2012-12-11T23:10:50.039 INFO:teuthology.task.radosbench.radosbench.2.err:error during benchmark: -2
2012-12-11T23:10:50.040 INFO:teuthology.task.radosbench.radosbench.2.err:error 2: (2) No such file or directory

CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.1.keyring --name client.1 -p data bench 1200 write'"

Actions

Copy link

#10

Updated by Sage Weil over 11 years ago

these all appear to be unrelated. i had broken tests in my lock teuthology repo, or they were other bugs.

except one new one, opening that now

Actions

Copy link

#11

Updated by Sage Weil over 11 years ago

Status changed from Fix Under Review to Resolved

other bug is #3414, but it doesn't appear related.

going to merge this change in.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3459

osd crash in CephXAuthorizer::verify_reply

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Dan Mick over 11 years ago

Updated by Tamilarasi muthamizhan over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Greg Farnum over 11 years ago

Updated by Greg Farnum over 11 years ago

Updated by Sage Weil over 11 years ago

Updated by Sage Weil over 11 years ago