Project

General

Profile

Actions

Bug #3459

closed

osd crash in CephXAuthorizer::verify_reply

Added by Tamilarasi muthamizhan over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Log: ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108

2012-11-06 21:13:39.757184 1cd7e700 -1 ** Caught signal (Aborted) *
in thread 1cd7e700

ceph version 0.53-618-g15b3d98 (15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3)
1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x7486ba]
2: (()+0xfcb0) [0x5043cb0]
3: (gsignal()+0x35) [0x69a4445]
4: (abort()+0x17b) [0x69a7bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x621569d]
6: (()+0xb5846) [0x6213846]
7: (()+0xb5873) [0x6213873]
8: (()+0xb596e) [0x621396e]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x7f3fd7]
10: (CephXAuthorizer::verify_reply(ceph::buffer::list::iterator&)+0xeb) [0x74d69b]
11: (Pipe::connect()+0x18a3) [0x896dc3]
12: (Pipe::writer()+0x4cd) [0x8a051d]
13: (Pipe::Writer::entry()+0xd) [0x8a1f6d]
14: (()+0x7e9a) [0x503be9a]
15: (clone()+0x6d) [0x6a604bd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

config file:
ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108$ cat config.yaml
kernel: &id001
kdb: true
sha1: 22cddde104d715600a4c218bf9224923208afe90
nuke-on-error: true
overrides:
ceph:
conf:
global:
ms inject socket failures: 5000
coverage: true
fs: btrfs
log-whitelist:
- slow request
sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
valgrind:
mds:
- --tool=memcheck
mon:
- --tool=memcheck
osd:
- --tool=memcheck
s3tests:
branch: master
workunit:
sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
roles:
- - mon.a
- mon.c
- osd.0
- osd.1
- osd.2
- - mon.b
- mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDYE0eu9E8TQwtUy89Wldp54VbNBEoO9XQf77eXXzzmNwYUFRrNX0mZV/I8GqyRJuMrPG8V4aZBthBHTtnEmQ6RAS7fVdthi/hEgwnM9cAqY3KX9mR5xJnHBc/fa5KLrnSr3Wrztf42PpQNEN5Tk55K6wWUlZOTHU3vE0j3kF+YQ5FeBhQbghztHPKFR8bOmZJp9TpbXgbvEM2RWr9bYtro1KuQOgrairyVVNWdAuwZuxSQT4soyHoSkY9JmeXKsNRAOamxH9w57mDC3PXui7r6Fp8OCWSK+GmlLTtPaZtulSCcucaZtpVae7F4s9JNxaRl5RxuUtwMRfgAHGlL2BZv
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCdrzGTR0Fbl6sedYlwlX+FlmF6fuE3l/RTu2kzOkmG47rPEn5CI37Injb7Epc50RXCbUIfzmDqtEY6uZT3YssYrE4jvhQlynPndbn1KmiTbgxTyuumGXv7O4OOntezighA1W49phUNZys1DhdEEO8VSQAIdHrBgBLhY9DDgC4LAhrP4BSbDTN0rUXtYYHBj4aa3sJV0o3sKjpsyjjlieEQnto6JkjK6EGZCSuY+AyMZyLJjFTgMwJ9i4aC5eZoWZAWSDfDsxo8PtFR+kjUmz5uiheyn5lAzKBxmd4ZNojf7wOhSGia0ghbtUeQkdoRZXZhP2ourNn3uAguf1xt43kX
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOTCMIScDTmD9NkfsWU7xeyZ+WOXai5izYeliiXDSjJC3bT6r8Fp+rhPfcHCVHiw++VsbvKZtkhjCSnJTVPWCdpRDghzJ3nZUBImWRo3PmHo1etQpCeimaOrIJ2q0ChN5jmSOqy5B+Z4om2vXBtBY6nkdTxDOr2+MH3NrSPkQSFB0zO+VPuwKXsemeUC6urb2IZZpxY3cxNq4fafTF9PROpgOnIA+o3igyU4duKEjnCzTHZjw/PL7Eph/7p6+UQgrUwe7pgVzT+2MM0zcBtBSXNqs3dCGmpvUapOkBlDoIX02EkWRNpkM3vfeFt1EFC17B5vd61Kg40bYUG8qWGR0T
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph: null
- mon_recovery: null

ubuntu@teuthology:/a/teuthology-2012-11-05_19:00:02-regression-master-testing-gcov/10108$ cat summary.yaml
ceph-sha1: 15b3d98fc4d9d1371253edf0c4c77f7e8932ecf3
client.0-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
description: collection:rados-verify clusters:fixed-3.yaml fs:btrfs.yaml msgr-failures:few.yaml
tasks:mon_recovery.yaml validater:valgrind.yaml
duration: 749.3446509838104
failure_reason: 'Command failed with status 1: ''/tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage
/tmp/cephtest/archive/coverage /tmp/cephtest/daemon-helper term /tmp/cephtest/chdir-coredump
valgrind --suppressions=/tmp/cephtest/valgrind.supp --xml=yes --xml-file=/tmp/cephtest/archive/log/valgrind/osd.1.log
--tool=memcheck /tmp/cephtest/binary/usr/local/bin/ceph-osd -f -i 1 -c /tmp/cephtest/ceph.conf'''
flavor: notcmalloc
mon.a-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
mon.b-kernel-sha1: 22cddde104d715600a4c218bf9224923208afe90
owner: scheduled_teuthology@teuthology
success: false

Actions #1

Updated by Sage Weil over 11 years ago

  • Priority changed from Normal to Urgent

ubuntu@teuthology:/a/sage-2012-11-12_16:44:02-regression-master-wip-3.4-basic/13948

Actions #2

Updated by Sage Weil over 11 years ago

  • Subject changed from osd crash in the nightly run to osd crash in CephXAuthorizer::verify_reply
Actions #3

Updated by Sage Weil over 11 years ago

  • Status changed from New to Resolved

this should be fixed by the new guards around decrypt_decode().

Actions #4

Updated by Dan Mick over 11 years ago

A user reports this same crash today in IRC with 0.55:

https://pastee.org/f4dgd

Actions #5

Updated by Tamilarasi muthamizhan over 11 years ago

  • Status changed from Resolved to In Progress
Actions #6

Updated by Sage Weil over 11 years ago

wth, i could have sworn i pushed something that added a try/catch block around the decode, but now i don't see it. pushed wip-3459 that does just that. which means there is probably a dup bug in the tracker somewhere with the same crash...

Actions #7

Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to Fix Under Review

the try/catch may be treating hte symptom, but it's definitley correct, and the binary for the qa run is long gone so i can't get anything else useful out of the failure. i think we merge the patch and wait for this to strike again (or not!)

Actions #8

Updated by Greg Farnum over 11 years ago

  • Assignee set to Greg Farnum
Actions #9

Updated by Greg Farnum over 11 years ago

The patch looks fine on its face but several tests in the suite failed. I need to track down if they're familiar errors to anybody and look a little more closely into a couple of them. If you're interested, here they are....

failed tests:11922: rados test workunit failed
CommandFailedError: Command failed with status 1: 'mkdir p - /tmp/cephtest/mnt.0/client.0/tmp && cd -- /tmp/cephtest/mnt.0/client.0/tmp && CEPH_REF=f957cd57c513d7f45b0d0ab1c3db6c4ccbbc110b PATH="$PATH:/tmp/cephtest/binary/usr/local/bin" LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/tmp/cephtest/binary/usr/local/lib" CEPH_CONF="/tmp/cephtest/ceph.conf" CEPH_SECRET_FILE="/tmp/cephtest/data/client.0.secret" CEPH_ID="0" PYTHONPATH="$PYTHONPATH:/tmp/cephtest/binary/usr/local/lib/python2.7/dist-packages:/tmp/cephtest/binary/usr/local/lib/python2.6/dist-packages" /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/workunit.client.0/rados/test.sh'

11963: still running...

11964: rados command failed
CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-cover
age /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.0.keyring --name client.0 -p data bench 1200 write'"

11970: rados command failed
CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-cover
age /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.0.keyring --name client.0 -p data bench 1200 write'"

11975: osd crashed on a startup (thrashing, I assume?)
osd/OSD.cc: 2434: FAILED assert(pg)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x95) [0x11a4ac1]
2: (OSD::disconnect_session_watches(OSD::Session*)+0x2a7) [0xea6ef5]
3: (OSD::ms_handle_reset(Connection*)+0x155) [0xea761d]
4: (Messenger::ms_deliver_handle_reset(Connection*)+0x4b) [0x126b419]
5: (DispatchQueue::entry()+0x176) [0x126a4be]
6: (DispatchQueue::DispatchThread::entry()+0x1c) [0x118ac14]
7: (Thread::_entry_func(void*)+0x23) [0x11932ad]
8: (()+0x7e9a) [0x7fc73f45de9a]
9: (clone()+0x6d) [0x7fc73d5e84bd]

11976: both a ceph and a rados command failed. How did they both get the chance to do so?
2012-12-11T23:10:44.121 DEBUG:teuthology.orchestra.run:Running: 'LD_LIBRARY_PRELOAD=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-
coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/ceph -k /tmp/cephtest/ceph.keyring -c /tmp/cephtest/ceph.conf --concise osd in 4'
2012-12-11T23:10:50.039 INFO:teuthology.task.radosbench.radosbench.2.err:error during benchmark: -2
2012-12-11T23:10:50.040 INFO:teuthology.task.radosbench.radosbench.2.err:error 2: (2) No such file or directory

CommandFailedError: Command failed with status 1: "/bin/sh -c 'LD_LIBRARY_PATH=/tmp/cephtest/binary/usr/local/lib /tmp/cephtest/enable-coredump /tmp/cephtest/binary/usr/local/bin/ceph-coverage /tmp/cephtest/archive/coverage /tmp/cephtest/binary/usr/local/bin/rados -c /tmp/cephtest/ceph.conf -k /tmp/cephtest/data/client.1.keyring --name client.1 -p data bench 1200 write'"

Actions #10

Updated by Sage Weil over 11 years ago

these all appear to be unrelated. i had broken tests in my lock teuthology repo, or they were other bugs.

except one new one, opening that now

Actions #11

Updated by Sage Weil over 11 years ago

  • Status changed from Fix Under Review to Resolved

other bug is #3414, but it doesn't appear related.

going to merge this change in.

Actions

Also available in: Atom PDF