Project

General

Profile

Actions

Bug #385

closed

Failed assertion in Locker::scatter_nudge

Added by Wido den Hollander over 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I updated issue #312 but Gregory told me that it was another issue.

19:47 < gregaf> wido: your recent MDS crash is actually a different issue from #312, involving the distributed lock manager
19:48 < gregaf> are your MDSes just refusing to come up now, or is your cluster working again?
19:50 < gregaf> and what version of the code were you running when it crashed the first time?

The last log lines:

10.08.27_08:33:54.023625 7f33ea334710 mds0.journal try_to_expire waiting for nest flush on [inode 10000058e7b [...2,head] /static/kernel/linux/kernel/people/lenb/acpi/ auth v358 f(v5 m10.08.06_21:49:46.000183 3=0+3) n(v47 rc10.08.09_13:39:19.000312 b75920353 3260=3160+100) (inest sync dirty) (ifile sync dirty) (iversion lock) | dirtyscattered dirfrag dirty 0x7631e40]
10.08.27_08:33:54.023664 7f33ea334710 mds0.locker scatter_nudge auth, scatter/unscattering (inest sync dirty) on [inode 10000058e7b [...2,head] /static/kernel/linux/kernel/people/lenb/acpi/ auth v358 f(v5 m10.08.06_21:49:46.000183 3=0+3) n(v47 rc10.08.09_13:39:19.000312 b75920353 3260=3160+100) (inest sync dirty) (ifile sync dirty) (iversion lock) | dirtyscattered dirfrag dirty 0x7631e40]
10.08.27_08:33:54.023690 7f33ea334710 mds0.locker simple_lock on (inest sync dirty) on [inode 10000058e7b [...2,head] /static/kernel/linux/kernel/people/lenb/acpi/ auth v358 f(v5 m10.08.06_21:49:46.000183 3=0+3) n(v47 rc10.08.09_13:39:19.000312 b75920353 3260=3160+100) (inest sync dirty) (ifile sync dirty) (iversion lock) | dirtyscattered dirfrag dirty 0x7631e40]
10.08.27_08:33:54.023716 7f33ea334710 mds0.locker scatter_nudge oh, stable again already.
mds/Locker.cc: In function 'void Locker::scatter_nudge(ScatterLock*, Context*, bool)':
mds/Locker.cc:3290: FAILED assert(!c)
 1: (LogSegment::try_to_expire(MDS*)+0x10f0) [0x636770]
 2: (MDLog::try_expire(LogSegment*)+0x1d) [0x62ec2d]
 3: (MDLog::trim(int)+0x628) [0x62f598]
 4: (MDS::tick()+0x552) [0x498372]
 5: (SafeTimer::EventWrapper::finish(int)+0x269) [0x6b27d9]
 6: (Timer::timer_entry()+0x7bc) [0x6b4bac]
 7: (Timer::TimerThread::entry()+0xd) [0x4777cd]
 8: (Thread::_entry_func(void*)+0xa) [0x48a73a]
 9: (()+0x69ca) [0x7f33edc9c9ca]
 10: (clone()+0x6d) [0x7f33ecc546fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

The cores, binaries and logfiles are uploaded to logger.ceph.widodh.nl:/srv/ceph/issues/mds_crash_locker_scatter_nudge

The timestamps of all the files were preserved.

Actions #1

Updated by Sage Weil over 13 years ago

Wido, can you let me know if this works?

diff --git a/src/mds/CInode.h b/src/mds/CInode.h
index 77a768d..4cbcf05 100644
--- a/src/mds/CInode.h
+++ b/src/mds/CInode.h
@@ -701,7 +701,12 @@ public:
        lock->set_state(LOCK_EXCL);
       else if (issued & CEPH_CAP_GWR)
        lock->set_state(LOCK_MIX);
-      else
+      else if (lock->is_dirty()) {
+       if (is_replicated())
+         lock->set_state(LOCK_MIX);
+       else
+         lock->set_state(LOCK_LOCK);
+      } else
        lock->set_state(LOCK_SYNC);
     } else {
       if (lock->is_xlocked())

Actions #2

Updated by Wido den Hollander over 13 years ago

No, it doesn't.

I had to apply the patch manually, please confirm it is OK what I did:

    if (is_auth()) {
      if (issued & CEPH_CAP_GEXCL)
        lock->set_state(LOCK_EXCL);
      else if (issued & CEPH_CAP_GWR)
        lock->set_state(LOCK_MIX);
      else if (lock->is_dirty()) {
       if (is_replicated())
         lock->set_state(LOCK_MIX);
       else
         lock->set_state(LOCK_LOCK);
      } else
        lock->set_state(LOCK_SYNC);
    } else {
      if (lock->is_xlocked())
        lock->set_state(LOCK_LOCK);
      else
        lock->set_state(LOCK_SYNC);  // might have been lock, previously
    }

The MDS crashed again, I placed the new core-dump on logger.ceph.widodh.nl ( core.cmds.node13.32754 )

Actions #3

Updated by Sage Weil over 13 years ago

Hi Wido,

Sorry I don't have time to really focus on this (vacation this week), but I pushed something that may take care of it to the mds_replay_lock_states branch. Can you let me know if that does the trick?

commit:0857fecbea00092251d28bc2e7625fd65bea3953

Thanks-

Actions #4

Updated by Wido den Hollander over 13 years ago

I tried this branch today, no luck, both MDS'es still crashed.

Uploaded two new core files to logger.ceph.widodh.nl:/srv/ceph/issues/mds_crash_locker_scatter_nudge

Actions #5

Updated by Sage Weil over 13 years ago

  • Assignee set to Sage Weil
  • Priority changed from Normal to Immediate
Actions #6

Updated by Sage Weil over 13 years ago

Ok this was a case of bad C++ method overloading (parent was const, child was not). Bah. Fixed by commit:ca048fb92c79cab0c0d0e6ee1cee11a037a20931.

Actions #7

Updated by Sage Weil over 13 years ago

  • Status changed from New to Resolved
Actions #8

Updated by Sage Weil over 13 years ago

rebased to commit:86986925fc10cf1632df41997d929547866109c5

Actions #9

Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (1)

Bulk updating project=ceph category=mds bugs so that I can remove the MDS category from the Ceph project to avoid confusion.

Actions

Also available in: Atom PDF