Project

General

Profile

Actions

Bug #23503

closed

mds: crash during pressure test

Added by wei jin about 6 years ago. Updated about 6 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version: 12.2.4
10 mds, 9 active + 1 standby
disabled dir fragmentation

We created 9 directories, and pnined them to active mds(one dir, one active mds). Then run our script in each dir (decompress file with 100000 small files to different subdirs.)

mds.A crash log:

2018-03-29 10:11:19.451099 7faf33d69700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'MDRequestRef MDCache::request_get(metareqid_t)' thread 7faf33d69700 time 2018-03-29 10:11:19.439
198
/build/ceph-12.2.4/src/mds/MDCache.cc: 9043: FAILED assert(p != active_requests.end())

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x555e64b178d2]
2: (MDCache::request_get(metareqid_t)+0x24f) [0x555e648c735f]
3: (Server::handle_slave_request_reply(MMDSSlaveRequest*)+0x2ca) [0x555e6487d9ea]
4: (Server::handle_slave_request(MMDSSlaveRequest*)+0x94f) [0x555e6487f01f]
5: (Server::dispatch(Message*)+0x383) [0x555e6487faa3]
6: (MDSRank::handle_deferrable_message(Message*)+0x7fc) [0x555e647f510c]
7: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x555e6480258b]
8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x555e64803355]
9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x555e647ecb13]
10: (DispatchQueue::entry()+0x7ca) [0x555e64e16eda]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x555e64b9c5ad]
12: (()+0x8064) [0x7faf38b41064]
13: (clone()+0x6d) [0x7faf37c2c62d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Before the crash, we observed subdir migration:
mds.1.migrator nicely exporting to mds.0 [dir 0x20013442763 /tmp/n20-064-085/n20-064-085_9275/
......
mds.1.migrator nicely exporting to mds.0 [dir 0x200134734a7 /tmp/n20-064-085/n20-064-085_9274/

The 'base' dir, such as n20-064-085, is pinned, however, the subdir can still be migrated to other ranks, is it expected behavior? Can we disable the migration completely?

Seems the migration is not stable enough, it is very easy to stall the whole filesystem. I tested it without pin, then made a choice to pin dir so that I could use multiple active mds.


Related issues 2 (0 open2 closed)

Related to CephFS - Bug #23518: mds: crash when failoverResolvedZheng Yan03/30/2018

Actions
Is duplicate of CephFS - Bug #23059: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t)ResolvedZheng Yan02/21/2018

Actions
Actions #1

Updated by wei jin about 6 years ago

After crash, the standby mds took it over, however, we observed another crash:

2018-03-29 10:25:04.719502 7f5ae5ad2700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)' thread 7f5ae5ad2700 time 2018-03-29 10:
25:04.716917
/build/ceph-12.2.4/src/mds/MDCache.cc: 5087: FAILED assert(session)

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ba1428d8d2]
2: (MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)+0x2422) [0x55ba14071542]
3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x233) [0x55ba1407def3]
4: (MDCache::dispatch(Message*)+0xa5) [0x55ba1407e045]
5: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x55ba13f6aecc]
6: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x55ba13f7858b]
7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55ba13f79355]
8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55ba13f62b13]
9: (DispatchQueue::entry()+0x7ca) [0x55ba1458ceda]
10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ba143125ad]
11: (()+0x8064) [0x7f5aea8aa064]
12: (clone()+0x6d) [0x7f5ae999562d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #2

Updated by Patrick Donnelly about 6 years ago

  • Subject changed from luminous: mds crash during pressure test to mds: crash during pressure test
  • Status changed from New to Duplicate
Actions #3

Updated by Patrick Donnelly about 6 years ago

  • Is duplicate of Bug #23059: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) added
Actions #4

Updated by Patrick Donnelly about 6 years ago

wei jin wrote:

After crash, the standby mds took it over, however, we observed another crash:

This smells like a different bug. Please open a separate issue.

Actions #5

Updated by wei jin about 6 years ago

Patrick Donnelly wrote:

wei jin wrote:

After crash, the standby mds took it over, however, we observed another crash:

This smells like a different bug. Please open a separate issue.

Done. https://tracker.ceph.com/issues/23518

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

Actions #6

Updated by Patrick Donnelly about 6 years ago

wei jin wrote:

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

No. Export pins are applied recursively.

Actions #7

Updated by wei jin about 6 years ago

Patrick Donnelly wrote:

wei jin wrote:

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

No. Export pins are applied recursively.

Thanks. I saw your mail in mail list, which mentioned patch https://github.com/ceph/ceph/pull/19220/commits/fb7a4cf2aaf68dc5e16733d8daf2e1bf716f183a.

It seems it is just a log issue.

Actions #8

Updated by Zheng Yan about 6 years ago

  • Related to Bug #23518: mds: crash when failover added
Actions

Also available in: Atom PDF