Project

General

Profile

Bug #3142

osd: crash induced by fsx workload

Added by Sage Weil over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

kernel: &id001
  kdb: true
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    conf:
      client:
        rbd cache: true
      global:
        ms inject socket failures: 5000
      osd:
        debug osd: 20
        debug ms: 1
    fs: ext4
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down
    - objects unfound and apparently lost
- thrashosds:
    timeout: 1200
- rbd_fsx:
    clients:
    - client.0
    ops: 2000

osd.2.log.gz (223 KB) Sage Weil, 10/20/2012 01:32 PM

History

#1 Updated by Sage Weil over 11 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2012-09-11_02:00:03-regression-testing-testing-basic/20743

#2 Updated by Dan Mick over 11 years ago

  • Assignee set to Dan Mick

#3 Updated by Dan Mick over 11 years ago

Attempting a bisect from master to stable. Using

cd /src/ceph/ceph
git describe
make distclean && ./do_autogen.sh && make -j 16
/src/ceph/teuthology/virtualenv/bin/teuthology --lock ~/src/ceph/teuthology/fsx.yaml || exit 127

as the command to bisect run.

#4 Updated by Sage Weil over 11 years ago

ubuntu@teuthology:/a/teuthology-2012-09-21_19:00:08-regression-master-testing-gcov/27383

#5 Updated by Sage Weil over 11 years ago

  • Status changed from New to 12

heap corruption? this hardly narrows it down, but from ubuntu@teuthology:/a/teuthology-2012-09-22_19:00:05-regression-master-testing-gcov/27938

2012-09-22 23:17:46.933546 7f0d9171f700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f0d9171f700

 ceph version 0.51-690-g720a301 (commit:720a30173dc73b4e696ba4b8e0c977dd4f4db858)
 1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x82865a]
 2: (()+0xfcb0) [0x7f0da27aecb0]
 3: (tcmalloc::CentralFreeList::FetchFromSpans()+0x27) [0x7f0da1854df7]
 4: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x107) [0x7f0da1855167]
 5: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x5d) [0x7f0da1857cad]
 6: (tc_new()+0x486) [0x7f0da1866c76]
 7: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x3579) [0x5ba899]
 8: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x482) [0x6f45a2]
 9: (OSD::dequeue_op(PG*)+0x40f) [0x614c7f]
 10: (OSD::OpWQ::_process(PG*)+0x15) [0x67bf45]
 11: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x672292]
 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0x73d) [0x9017bd]
 13: (ThreadPool::WorkThread::entry()+0x18) [0x9051d8]
 14: (Thread::_entry_func(void*)+0x12) [0x8f3492]
 15: (()+0x7e9a) [0x7f0da27a6e9a]
 16: (clone()+0x6d) [0x7f0da0b4a4bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

maybe we should reproduce with the notcmalloc gitbuilder and see if we get a more usable core file

#6 Updated by Sage Weil over 11 years ago

  • Assignee changed from Dan Mick to Sage Weil

#7 Updated by Sage Weil over 11 years ago

i got a log for

   -19> 2012-10-19 14:24:20.776308 7f63fbdcd700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)' thread 7f63fbdcd700 time 2012-10-19 14:24:20.774375
osd/ReplicatedPG.cc: 3268: FAILED assert(obc->unconnected_watchers.count(entity))

 ceph version 0.53-393-g50bb659 (commit:50bb65963c16bcf892157bd19a308ae593215f84)
 1: (ReplicatedPG::do_osd_op_effects(ReplicatedPG::OpContext*)+0x2200) [0x56cb80]
 2: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x623) [0x590f03]
 3: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>)+0x1dc5) [0x5942e5]
 4: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x325) [0x66e2a5]
 5: (OSD::dequeue_op(PG*)+0x2fd) [0x5ceb0d]
 6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x542) [0x7f32a2]
 7: (ThreadPool::WorkThread::entry()+0x10) [0x7f5240]
 8: (()+0x7e9a) [0x7f640d157e9a]
 9: (clone()+0x6d) [0x7f640b4fb4bd]

#8 Updated by Sage Weil over 11 years ago

  • Status changed from 12 to 7

fix for the watcher thing merged to next branch, yay! hopefully that was the root cause for the mysterious nightly failures with bogus core files too.

#9 Updated by Sage Weil over 11 years ago

  • Status changed from 7 to Resolved

Also available in: Atom PDF