Project

General

Profile

Actions

Bug #53192

open

High cephfs MDS latency and CPU load with snapshots and unlink operations

Added by Andras Pataki over 2 years ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
Normal
Category:
Performance/Resource Usage
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
squid,reef,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
snapshots
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have recently enabled snapshots on our large Nautilus cluster (running 14.2.20) and our fairly smooth running cephfs installation became quite unhappy - with latencies increasing to such an extent than listing directories on clients to many seconds to minutes even. Need some help figuring out where the problem lies and whether there is any triage steps we could take to bring our cluster to a more usable performance level.

It looks like the performance problems relate to removing lots of files, like running rm -rf on a subtree or several subtrees in parallel. Before enabling snapshots, rm -rf of a tree typically runs at about 400 unlinks/second on a single client (single threaded remove) and the MDS load was relatively light.

After snapshots were enabled, we created one snapshot per day in the root of cephfs, but kept only one active snapshot (as an initial steps), so each day we created one new snapshot and removed the previous day's older snapshot. What we noticed was that running 'rm -rf' dropped to about 20-30 unlinks/second, and while these removes were running, the latency on the MDS for all operations (even lookups and getattr's) increased by several orders of magnitude (100 to 1000 times). As soon as the rm processes were killed, latency came back to normal within a few seconds.

stracing the rm process shows that most removes proceed quite fast (1-2ms), but some get stuck for several hundred milliseconds. An example strace:

    09:23:54.854668 unlinkat(7, "ebdyc.p", 0) = 0 <0.001106>
    09:23:54.855864 unlinkat(7, "Dyy.p", 0) = 0 <0.352603>
    09:23:55.208547 unlinkat(7, "u.p", 0)   = 0 <0.002474>
    09:23:55.211098 unlinkat(7, "Dxy.p", 0) = 0 <0.001317>
    09:23:55.212491 unlinkat(7, "v.p", 0)   = 0 <0.001437>
    09:23:55.214005 unlinkat(7, "Dxx.p", 0) = 0 <0.001506>
    09:23:55.215622 unlinkat(6, "timestep_35050", AT_REMOVEDIR) = 0 <0.001874>
    09:23:55.221117 unlinkat(7, "ebdyc.p", 0) = 0 <0.335875>
    09:23:55.557076 unlinkat(7, "Dyy.p", 0) = 0 <0.001657>
    09:23:55.558809 unlinkat(7, "u.p", 0)   = 0 <0.001450>
    09:23:55.560349 unlinkat(7, "Dxy.p", 0) = 0 <0.001507>
    09:23:55.561940 unlinkat(7, "v.p", 0)   = 0 <0.001349>
    09:23:55.563365 unlinkat(7, "Dxx.p", 0) = 0 <0.001182>
    09:23:55.564668 unlinkat(6, "timestep_45250", AT_REMOVEDIR) = 0 <0.002633>
    09:23:55.885885 unlinkat(7, "ebdyc.p", 0) = 0 <0.005142>
    09:23:55.891111 unlinkat(7, "Dyy.p", 0) = 0 <0.001656>
    09:23:55.892844 unlinkat(7, "u.p", 0)   = 0 <0.001685>
    09:23:55.894589 unlinkat(7, "Dxy.p", 0) = 0 <0.001407>
    09:23:55.896056 unlinkat(7, "v.p", 0)   = 0 <0.001520>
    09:23:55.897669 unlinkat(7, "Dxx.p", 0) = 0 <0.001601>

On the MDS side, there is a thread called 'fn_anonymous' that goes to 100% CPU utilization when these removes are running. Attaching a debugger shows stack traces that look like this:

#0  0x0000555555787d4c in CInode::get_oldest_parent_dn (this=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.h:780
#1  CInode::is_ancestor_of (this=0x55561ebbee00, other=other@entry=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:917
#2  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557c3a500, child=0x5555c1ad7900) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#3  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55561ebbee00, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#4  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55561ebbee00, next_snaprealm=0x555571b24000, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#5  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55561ebbee00) at /usr/include/c++/8/bits/stl_list.h:1122
#6  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555557055680, mdr=..., dn=0x5558e50845a0, straydn=0x55582c1981e0, dnpv=513) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#7  0x00005555558067df in Context::complete (r=0, this=0x555715b043c0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#8  MDSContext::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#9  0x0000555555806a8f in MDSIOContextBase::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#10 0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#11 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#12 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#13 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

#0  0x0000555555787d37 in CInode::is_ancestor_of (this=0x55580f021800, other=0x5557c1b1c300, other@entry=0x5557c9146700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CDentry.h:150
#1  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557ea0280, child=0x55584e1c1b80) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#2  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55580f021800, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#3  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55580f021800, next_snaprealm=0x55556425f140, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#4  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55580f021800) at /usr/include/c++/8/bits/stl_list.h:1122
#5  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555556438500, mdr=..., dn=0x5558541e8f00, straydn=0x555946731860, dnpv=77222) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#6  0x00005555558067df in Context::complete (r=0, this=0x555943b55f90) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#7  MDSContext::complete (this=0x555943b55f90, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#8  0x0000555555806a8f in MDSIOContextBase::complete (this=0x555943b55f90, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#9  0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#10 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#11 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#12 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

The total CPU utilization of the MDS around these times is perhaps 120%, out of which 100% is this fn_anonymous thread, and the rest of the threads are quite idle - presumably waiting for this thread to do it work.

I'm measuring latency of the MDS using the perf stat counters - taking a snapshot of the counters every 5 seconds and looking at the latency counters like:

        "reply_latency": {
            "avgcount": 862377798,
            "sum": 11712188.751232780,
            "avgtime": 0.013581273
        },

and doing (sum2 - sum1) / (avgcount2 - avgcount1) over the interval to find the interval latency.

When not doing removes: typical reply latencies are under 1ms - even processing a thousand lookups. The MDS CPU utilization is also <50%.

MDS:
    req/s:  3101.7,  reply/s:  3101.7,  lat:  0.0001
Latencies:
    req_create      :  req/s:   12.2,  lat:  0.0008
    req_getattr     :  req/s:   83.4,  lat:  0.0000
    req_lookup      :  req/s: 1102.1,  lat:  0.0000
    req_open        :  req/s:    8.3,  lat:  0.0105
    req_readdir     :  req/s:    1.0,  lat:  0.0010
    req_rename      :  req/s:    1.8,  lat:  0.0001
    req_setfilelock :  req/s: 1890.4,  lat:  0.0000
    req_unlink      :  req/s:    2.6,  lat:  0.0001

When the removes run, both the overall latency and all the operation latencies increase by 2 to 3 orders of magnitude - and this fn_anonymous thread gets really busy. Since lookup latencies are a tens of seconds - cephfs is pretty much unusuable (listing directories takes a minute sometimes or more). Also, unlinks run at only about 20-30 per second:

MDS:
    req/s:  3916.6,  reply/s:  3917.0,  lat:  0.3086
Latencies:
    req_create      :  req/s:  426.4,  lat:  0.5452
    req_getattr     :  req/s:   58.3,  lat:  0.0269
    req_lookup      :  req/s: 3340.6,  lat:  0.2881
    req_mkdir       :  req/s:    0.6,  lat:  0.0417
    req_open        :  req/s:   51.1,  lat:  0.1990
    req_readdir     :  req/s:    3.1,  lat:  0.0338
    req_rename      :  req/s:    2.2,  lat:  0.3481
    req_rmdir       :  req/s:    3.3,  lat:  0.0545
    req_rmxattr     :  req/s:    0.2,  lat:  0.3395
    req_setattr     :  req/s:    0.4,  lat:  0.1154
    req_setfilelock :  req/s:    8.8,  lat:  0.0672
    req_unlink      :  req/s:   21.5,  lat:  0.0113

Since this problem started, I've disabled snapshot creation and removed all snapshots. Unfortunately this has not helped. It seems that removes of trees that were created before snapshots are painfully slow still, but removes of trees created after snapshots were stopped go fast without any issues.

Also, all other parts of the cluster are healthly. The metadata pool is stored on separate nodes on NVMe's, and those OSDs are quite lightly loaded - so the entire performance issue seems to be within the MDS.
I'm very happy to provide any additional details/logs or try potential solutions. Any help would be much appreciated.


Related issues 4 (1 open3 closed)

Has duplicate CephFS - Bug #66248: Performance degradation after creating snapshotsDuplicatePatrick Donnelly

Actions
Copied to CephFS - Backport #65293: quincy: High cephfs MDS latency and CPU load with snapshots and unlink operationsFix Under ReviewPatrick DonnellyActions
Copied to CephFS - Backport #65294: reef: High cephfs MDS latency and CPU load with snapshots and unlink operationsResolvedPatrick DonnellyActions
Copied to CephFS - Backport #65295: squid: High cephfs MDS latency and CPU load with snapshots and unlink operationsResolvedPatrick DonnellyActions
Actions

Also available in: Atom PDF