Bug #23503: mds: crash during pressure test - CephFS - Ceph

Actions

Copy link

Bug #23503

closed

mds: crash during pressure test

Added by wei jin about 6 years ago. Updated about 6 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version: 12.2.4
10 mds, 9 active + 1 standby
disabled dir fragmentation

We created 9 directories, and pnined them to active mds(one dir, one active mds). Then run our script in each dir (decompress file with 100000 small files to different subdirs.)

mds.A crash log:

2018-03-29 10:11:19.451099 7faf33d69700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'MDRequestRef MDCache::request_get(metareqid_t)' thread 7faf33d69700 time 2018-03-29 10:11:19.439
198
/build/ceph-12.2.4/src/mds/MDCache.cc: 9043: FAILED assert(p != active_requests.end())

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x555e64b178d2]
 2: (MDCache::request_get(metareqid_t)+0x24f) [0x555e648c735f]
 3: (Server::handle_slave_request_reply(MMDSSlaveRequest*)+0x2ca) [0x555e6487d9ea]
 4: (Server::handle_slave_request(MMDSSlaveRequest*)+0x94f) [0x555e6487f01f]
 5: (Server::dispatch(Message*)+0x383) [0x555e6487faa3]
 6: (MDSRank::handle_deferrable_message(Message*)+0x7fc) [0x555e647f510c]
 7: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x555e6480258b]
 8: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x555e64803355]
 9: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x555e647ecb13]
 10: (DispatchQueue::entry()+0x7ca) [0x555e64e16eda]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x555e64b9c5ad]
 12: (()+0x8064) [0x7faf38b41064]
 13: (clone()+0x6d) [0x7faf37c2c62d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Before the crash, we observed subdir migration:
mds.1.migrator nicely exporting to mds.0 [dir 0x20013442763 /tmp/n20-064-085/n20-064-085_9275/
......
mds.1.migrator nicely exporting to mds.0 [dir 0x200134734a7 /tmp/n20-064-085/n20-064-085_9274/

The 'base' dir, such as n20-064-085, is pinned, however, the subdir can still be migrated to other ranks, is it expected behavior? Can we disable the migration completely?

Seems the migration is not stable enough, it is very easy to stall the whole filesystem. I tested it without pin, then made a choice to pin dir so that I could use multiple active mds.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by wei jin about 6 years ago

After crash, the standby mds took it over, however, we observed another crash:

2018-03-29 10:25:04.719502 7f5ae5ad2700 -1 /build/ceph-12.2.4/src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)' thread 7f5ae5ad2700 time 2018-03-29 10:
25:04.716917
/build/ceph-12.2.4/src/mds/MDCache.cc: 5087: FAILED assert(session)

ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x55ba1428d8d2]
 2: (MDCache::handle_cache_rejoin_ack(MMDSCacheRejoin*)+0x2422) [0x55ba14071542]
 3: (MDCache::handle_cache_rejoin(MMDSCacheRejoin*)+0x233) [0x55ba1407def3]
 4: (MDCache::dispatch(Message*)+0xa5) [0x55ba1407e045]
 5: (MDSRank::handle_deferrable_message(Message*)+0x5bc) [0x55ba13f6aecc]
 6: (MDSRank::_dispatch(Message*, bool)+0x1db) [0x55ba13f7858b]
 7: (MDSRankDispatcher::ms_dispatch(Message*)+0x15) [0x55ba13f79355]
 8: (MDSDaemon::ms_dispatch(Message*)+0xf3) [0x55ba13f62b13]
 9: (DispatchQueue::entry()+0x7ca) [0x55ba1458ceda]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x55ba143125ad]
 11: (()+0x8064) [0x7f5aea8aa064]
 12: (clone()+0x6d) [0x7f5ae999562d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Subject changed from luminous: mds crash during pressure test to mds: crash during pressure test
Status changed from New to Duplicate

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

Is duplicate of Bug #23059: mds: FAILED assert (p != active_requests.end()) in MDRequestRef MDCache::request_get(metareqid_t) added

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

wei jin wrote:

After crash, the standby mds took it over, however, we observed another crash:

This smells like a different bug. Please open a separate issue.

Actions

Copy link

Updated by wei jin about 6 years ago

Patrick Donnelly wrote:

wei jin wrote:

After crash, the standby mds took it over, however, we observed another crash:

This smells like a different bug. Please open a separate issue.

Done. https://tracker.ceph.com/issues/23518

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

Actions

Copy link

Updated by Patrick Donnelly about 6 years ago

wei jin wrote:

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

No. Export pins are applied recursively.

Actions

Copy link

Updated by wei jin about 6 years ago

Patrick Donnelly wrote:

wei jin wrote:

Hi, Patrick, I have a question: after pinning base dir, will subdirs still be migrated to other active MDSs when heavy load?

No. Export pins are applied recursively.

Thanks. I saw your mail in mail list, which mentioned patch https://github.com/ceph/ceph/pull/19220/commits/fb7a4cf2aaf68dc5e16733d8daf2e1bf716f183a.

It seems it is just a log issue.

Actions

Copy link