Bug #1001: dead mds remains up, won't let others take over - Ceph - Ceph

Actions

Copy link

Bug #1001

closed

dead mds remains up, won't let others take over

Added by Alexandre Oliva about 13 years ago. Updated about 13 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

Monitor

Target version:

v0.27

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

3-node cluster, with 3 mons, 3 mdses (all configured for standby-replay), 3 osdes (but node 0 down, because I'm using a kernel ceph mount on it, and that tends to deadlock when uploading lots of data).

While loading up data from node 0, I started a highly-parallel build on node 1, on the btrfs that also holds data for mon1 (but not osd1). Shortly thereafter, the ceph filesystem came to a halt shortly thereafter, and mdses started to disappear from the mds dump output, although ceph -w didn't report any changes for a few minutes.

The mdses eventually came back into standby or standby-replay, but they wouldn't be activated.

The mdsmap history for mon0, and the mon.0.log starting some 90 minutes before the pause, are attached. The problem ocurred between 15:35 and 15:41.

Files

mon0-mdsmap.tar.xz (1.83 MB) mon0-mdsmap.tar.xz

mon0 logs and mdsmap history

Alexandre Oliva, 04/12/2011 01:09 PM

Actions

Copy link

Updated by Alexandre Oliva about 13 years ago

All 3 nodes were running 0.26 plus stable patches c494689062c9, plus a patch that relaxes the journaler _trim_finish assertion from “to >” to “to >=”.

The reason I mentioned that a dead mds remained up was that sage noticed that the “up” set in the mds dump output listed mds 4477 as active, although 4477 was not listed as one of the mdses any more.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Category set to Monitor
Status changed from New to In Progress
Assignee set to Sage Weil
Target version set to v0.27
Translation missing: en.field_story_points set to 2

Actions

Copy link

Updated by Sage Weil about 13 years ago

Status changed from In Progress to 7

mds went from up:replay to up:standby-replay:

vapre:src 01:08 PM $ ./mdsmaptool -p tmp/mon0/current/mdsmap/236
./mdsmaptool: mdsmap file 'tmp/mon0/current/mdsmap/236'
epoch   236
flags   0
created 2011-04-08 12:05:20.886321
modified        2011-04-12 11:36:52.231101
tableserver     0
root    0
session_timeout 60
session_autoclose       300
last_failure    236
last_failure_osd_epoch  102
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
max_mds 1
in      0
up      {0=4477}
failed
stopped
4477:   172.31.160.7:6804/7910 '2' mds0.39 up:replay seq 2 laggy since 2011-04-12 11:36:52.221977 (standby for rank -2)
vapre:src 01:09 PM $ ./mdsmaptool -p tmp/mon0/current/mdsmap/237
./mdsmaptool: mdsmap file 'tmp/mon0/current/mdsmap/237'
epoch   237
flags   0
created 2011-04-08 12:05:20.886321
modified        2011-04-12 11:36:53.932304
tableserver     0
root    0
session_timeout 60
session_autoclose       300
last_failure    236
last_failure_osd_epoch  102
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object}
max_mds 1
in      0
up      {0=4477}
failed
stopped
4477:   172.31.160.7:6804/7910 '2' mds0.39 up:standby-replay seq 2 laggy since 2011-04-12 11:36:52.221977 (standby for rank 0)

The log level is low, but I'm 90% sure this is fixed by 5e27a079e8cfb4c90de1d36bfef0065d9a5cbb14.

Actions

Copy link

Updated by Sage Weil about 13 years ago

Status changed from 7 to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #1001

dead mds remains up, won't let others take over

Updated by Alexandre Oliva about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago

Updated by Sage Weil about 13 years ago