Project

General

Profile

Actions

Bug #1001

closed

dead mds remains up, won't let others take over

Added by Alexandre Oliva about 13 years ago. Updated about 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Monitor
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

3-node cluster, with 3 mons, 3 mdses (all configured for standby-replay), 3 osdes (but node 0 down, because I'm using a kernel ceph mount on it, and that tends to deadlock when uploading lots of data).

While loading up data from node 0, I started a highly-parallel build on node 1, on the btrfs that also holds data for mon1 (but not osd1). Shortly thereafter, the ceph filesystem came to a halt shortly thereafter, and mdses started to disappear from the mds dump output, although ceph -w didn't report any changes for a few minutes.

The mdses eventually came back into standby or standby-replay, but they wouldn't be activated.

The mdsmap history for mon0, and the mon.0.log starting some 90 minutes before the pause, are attached. The problem ocurred between 15:35 and 15:41.


Files

mon0-mdsmap.tar.xz (1.83 MB) mon0-mdsmap.tar.xz mon0 logs and mdsmap history Alexandre Oliva, 04/12/2011 01:09 PM
Actions #1

Updated by Alexandre Oliva about 13 years ago

All 3 nodes were running 0.26 plus stable patches c494689062c9, plus a patch that relaxes the journaler _trim_finish assertion from “to >” to “to >=”.

The reason I mentioned that a dead mds remained up was that sage noticed that the “up” set in the mds dump output listed mds 4477 as active, although 4477 was not listed as one of the mdses any more.

Actions #2

Updated by Sage Weil about 13 years ago

  • Category set to Monitor
  • Status changed from New to In Progress
  • Assignee set to Sage Weil
  • Target version set to v0.27
  • Translation missing: en.field_story_points set to 2
Actions #3

Updated by Sage Weil about 13 years ago

  • Status changed from In Progress to 7

mds went from up:replay to up:standby-replay:

vapre:src 01:08 PM $ ./mdsmaptool -p tmp/mon0/current/mdsmap/236
./mdsmaptool: mdsmap file 'tmp/mon0/current/mdsmap/236'
epoch   236
flags   0
created 2011-04-08 12:05:20.886321
modified        2011-04-12 11:36:52.231101
tableserver     0
root    0
session_timeout 60
session_autoclose       300
last_failure    236
last_failure_osd_epoch  102
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
max_mds 1
in      0
up      {0=4477}
failed
stopped
4477:   172.31.160.7:6804/7910 '2' mds0.39 up:replay seq 2 laggy since 2011-04-12 11:36:52.221977 (standby for rank -2)
vapre:src 01:09 PM $ ./mdsmaptool -p tmp/mon0/current/mdsmap/237
./mdsmaptool: mdsmap file 'tmp/mon0/current/mdsmap/237'
epoch   237
flags   0
created 2011-04-08 12:05:20.886321
modified        2011-04-12 11:36:53.932304
tableserver     0
root    0
session_timeout 60
session_autoclose       300
last_failure    236
last_failure_osd_epoch  102
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
max_mds 1
in      0
up      {0=4477}
failed
stopped
4477:   172.31.160.7:6804/7910 '2' mds0.39 up:standby-replay seq 2 laggy since 2011-04-12 11:36:52.221977 (standby for rank 0)

The log level is low, but I'm 90% sure this is fixed by 5e27a079e8cfb4c90de1d36bfef0065d9a5cbb14.

Actions #4

Updated by Sage Weil about 13 years ago

  • Status changed from 7 to Resolved
Actions

Also available in: Atom PDF