Bug #53192: High cephfs MDS latency and CPU load with snapshots and unlink operations - CephFS - Ceph

Actions

Copy link

Bug #53192

open

High cephfs MDS latency and CPU load with snapshots and unlink operations

Added by Andras Pataki over 2 years ago. Updated about 2 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Patrick Donnelly

Category:

Performance/Resource Usage

Target version:

Ceph - v19.0.0

% Done:

Source:

Community (user)

Tags:

backport_processed

Backport:

squid,reef,quincy

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.20

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

snapshots

Pull request ID:

55768

Crash signature (v1):

Crash signature (v2):

Description

We have recently enabled snapshots on our large Nautilus cluster (running 14.2.20) and our fairly smooth running cephfs installation became quite unhappy - with latencies increasing to such an extent than listing directories on clients to many seconds to minutes even. Need some help figuring out where the problem lies and whether there is any triage steps we could take to bring our cluster to a more usable performance level.

It looks like the performance problems relate to removing lots of files, like running rm -rf on a subtree or several subtrees in parallel. Before enabling snapshots, rm -rf of a tree typically runs at about 400 unlinks/second on a single client (single threaded remove) and the MDS load was relatively light.

After snapshots were enabled, we created one snapshot per day in the root of cephfs, but kept only one active snapshot (as an initial steps), so each day we created one new snapshot and removed the previous day's older snapshot. What we noticed was that running 'rm -rf' dropped to about 20-30 unlinks/second, and while these removes were running, the latency on the MDS for all operations (even lookups and getattr's) increased by several orders of magnitude (100 to 1000 times). As soon as the rm processes were killed, latency came back to normal within a few seconds.

stracing the rm process shows that most removes proceed quite fast (1-2ms), but some get stuck for several hundred milliseconds. An example strace:

    09:23:54.854668 unlinkat(7, "ebdyc.p", 0) = 0 <0.001106>
    09:23:54.855864 unlinkat(7, "Dyy.p", 0) = 0 <0.352603>
    09:23:55.208547 unlinkat(7, "u.p", 0)   = 0 <0.002474>
    09:23:55.211098 unlinkat(7, "Dxy.p", 0) = 0 <0.001317>
    09:23:55.212491 unlinkat(7, "v.p", 0)   = 0 <0.001437>
    09:23:55.214005 unlinkat(7, "Dxx.p", 0) = 0 <0.001506>
    09:23:55.215622 unlinkat(6, "timestep_35050", AT_REMOVEDIR) = 0 <0.001874>
    09:23:55.221117 unlinkat(7, "ebdyc.p", 0) = 0 <0.335875>
    09:23:55.557076 unlinkat(7, "Dyy.p", 0) = 0 <0.001657>
    09:23:55.558809 unlinkat(7, "u.p", 0)   = 0 <0.001450>
    09:23:55.560349 unlinkat(7, "Dxy.p", 0) = 0 <0.001507>
    09:23:55.561940 unlinkat(7, "v.p", 0)   = 0 <0.001349>
    09:23:55.563365 unlinkat(7, "Dxx.p", 0) = 0 <0.001182>
    09:23:55.564668 unlinkat(6, "timestep_45250", AT_REMOVEDIR) = 0 <0.002633>
    09:23:55.885885 unlinkat(7, "ebdyc.p", 0) = 0 <0.005142>
    09:23:55.891111 unlinkat(7, "Dyy.p", 0) = 0 <0.001656>
    09:23:55.892844 unlinkat(7, "u.p", 0)   = 0 <0.001685>
    09:23:55.894589 unlinkat(7, "Dxy.p", 0) = 0 <0.001407>
    09:23:55.896056 unlinkat(7, "v.p", 0)   = 0 <0.001520>
    09:23:55.897669 unlinkat(7, "Dxx.p", 0) = 0 <0.001601>

On the MDS side, there is a thread called 'fn_anonymous' that goes to 100% CPU utilization when these removes are running. Attaching a debugger shows stack traces that look like this:

#0  0x0000555555787d4c in CInode::get_oldest_parent_dn (this=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.h:780
#1  CInode::is_ancestor_of (this=0x55561ebbee00, other=other@entry=0x555843e14700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:917
#2  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557c3a500, child=0x5555c1ad7900) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#3  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55561ebbee00, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#4  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55561ebbee00, next_snaprealm=0x555571b24000, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#5  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55561ebbee00) at /usr/include/c++/8/bits/stl_list.h:1122
#6  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555557055680, mdr=..., dn=0x5558e50845a0, straydn=0x55582c1981e0, dnpv=513) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#7  0x00005555558067df in Context::complete (r=0, this=0x555715b043c0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#8  MDSContext::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#9  0x0000555555806a8f in MDSIOContextBase::complete (this=0x555715b043c0, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#10 0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#11 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#12 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#13 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

#0  0x0000555555787d37 in CInode::is_ancestor_of (this=0x55580f021800, other=0x5557c1b1c300, other@entry=0x5557c9146700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CDentry.h:150
#1  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557ea0280, child=0x55584e1c1b80) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#2  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55580f021800, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#3  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55580f021800, next_snaprealm=0x55556425f140, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#4  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55580f021800) at /usr/include/c++/8/bits/stl_list.h:1122
#5  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555556438500, mdr=..., dn=0x5558541e8f00, straydn=0x555946731860, dnpv=77222) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025
#6  0x00005555558067df in Context::complete (r=0, this=0x555943b55f90) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/include/Context.h:77
#7  MDSContext::complete (this=0x555943b55f90, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:29
#8  0x0000555555806a8f in MDSIOContextBase::complete (this=0x555943b55f90, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:114
#9  0x0000555555806d54 in MDSLogContextBase::complete (this=<optimized out>, r=0) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/MDSContext.cc:123
#10 0x00007fffef02062d in Finisher::finisher_thread_entry() () from target:/usr/lib64/ceph/libceph-common.so.0
#11 0x00007fffecd7d14a in start_thread () from target:/lib64/libpthread.so.0
#12 0x00007fffeb896dc3 in clone () from target:/lib64/libc.so.6

The total CPU utilization of the MDS around these times is perhaps 120%, out of which 100% is this fn_anonymous thread, and the rest of the threads are quite idle - presumably waiting for this thread to do it work.

I'm measuring latency of the MDS using the perf stat counters - taking a snapshot of the counters every 5 seconds and looking at the latency counters like:

        "reply_latency": {
            "avgcount": 862377798,
            "sum": 11712188.751232780,
            "avgtime": 0.013581273
        },

and doing (sum2 - sum1) / (avgcount2 - avgcount1) over the interval to find the interval latency.

When not doing removes: typical reply latencies are under 1ms - even processing a thousand lookups. The MDS CPU utilization is also <50%.

MDS:
    req/s:  3101.7,  reply/s:  3101.7,  lat:  0.0001
Latencies:
    req_create      :  req/s:   12.2,  lat:  0.0008
    req_getattr     :  req/s:   83.4,  lat:  0.0000
    req_lookup      :  req/s: 1102.1,  lat:  0.0000
    req_open        :  req/s:    8.3,  lat:  0.0105
    req_readdir     :  req/s:    1.0,  lat:  0.0010
    req_rename      :  req/s:    1.8,  lat:  0.0001
    req_setfilelock :  req/s: 1890.4,  lat:  0.0000
    req_unlink      :  req/s:    2.6,  lat:  0.0001

When the removes run, both the overall latency and all the operation latencies increase by 2 to 3 orders of magnitude - and this fn_anonymous thread gets really busy. Since lookup latencies are a tens of seconds - cephfs is pretty much unusuable (listing directories takes a minute sometimes or more). Also, unlinks run at only about 20-30 per second:

MDS:
    req/s:  3916.6,  reply/s:  3917.0,  lat:  0.3086
Latencies:
    req_create      :  req/s:  426.4,  lat:  0.5452
    req_getattr     :  req/s:   58.3,  lat:  0.0269
    req_lookup      :  req/s: 3340.6,  lat:  0.2881
    req_mkdir       :  req/s:    0.6,  lat:  0.0417
    req_open        :  req/s:   51.1,  lat:  0.1990
    req_readdir     :  req/s:    3.1,  lat:  0.0338
    req_rename      :  req/s:    2.2,  lat:  0.3481
    req_rmdir       :  req/s:    3.3,  lat:  0.0545
    req_rmxattr     :  req/s:    0.2,  lat:  0.3395
    req_setattr     :  req/s:    0.4,  lat:  0.1154
    req_setfilelock :  req/s:    8.8,  lat:  0.0672
    req_unlink      :  req/s:   21.5,  lat:  0.0113

Since this problem started, I've disabled snapshot creation and removed all snapshots. Unfortunately this has not helped. It seems that removes of trees that were created before snapshots are painfully slow still, but removes of trees created after snapshots were stopped go fast without any issues.

Also, all other parts of the cluster are healthly. The metadata pool is stored on separate nodes on NVMe's, and those OSDs are quite lightly loaded - so the entire performance issue seems to be within the MDS.
I'm very happy to provide any additional details/logs or try potential solutions. Any help would be much appreciated.

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Andras Pataki over 2 years ago

I've done some further debugging to understand the MDS performance problem that has been impacting us more. The finisher thread (fn_anonymous) uses 100% CPU on remove operations - it turns out remove directory operations. Stack traces look like this:

#0  0x0000555555787d37 in CInode::is_ancestor_of (this=0x55580f021800, other=0x5557c1b1c300, other@entry=0x5557c9146700) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CDentry.h:150
#1  0x00005555557e4147 in SnapRealm::split_at (this=this@entry=0x555557ea0280, child=0x55584e1c1b80) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/SnapRealm.cc:593
#2  0x00005555557a1a5f in CInode::open_snaprealm (this=0x55580f021800, nosplit=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:2764
#3  0x00005555557a1e96 in CInode::pop_projected_snaprealm (this=0x55580f021800, next_snaprealm=0x55556425f140, early=<optimized out>) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/CInode.cc:586
#4  0x00005555557a2462 in CInode::early_pop_projected_snaprealm (this=this@entry=0x55580f021800) at /usr/include/c++/8/bits/stl_list.h:1122
#5  0x00005555555d1dc0 in Server::_unlink_local_finish (this=0x555556438500, mdr=..., dn=0x5558541e8f00, straydn=0x555946731860, dnpv=77222) at /usr/src/debug/ceph-14.2.20-0.el8.x86_64/src/mds/Server.cc:7025

In SnapRealm::split_at, there are two loops that seem to be the root cause of the MDS performance problem:

  dout(10) << " open_children are " << open_children << dendl;
  for (set<SnapRealm*>::iterator p = open_children.begin();
       p != open_children.end(); ) {
   ...

and

  // split inodes_with_caps
  for (elist<CInode*>::iterator p = inodes_with_caps.begin(member_offset(CInode, item_caps));
       !p.end(); ) {
    CInode *in = *p;
    ++p;
    // does inode fall within the child realm?
    if (child->inode->is_ancestor_of(in)) {
      dout(20) << " child gets " << *in << dendl;
      in->move_to_realm(child);
    } else {
      dout(20) << "    keeping " << *in << dendl;
    }
  }

The second loop goes through hundreds of thousands of inodes and does nothing (ends in the else case of the if since they aren't repated to the directory). For the trace cases I looked at, the first loop over open_children also loops through over ten thousand entries. It looks like these two loops are responsible for tens of milliseconds of time spent.

This is done in the finisher thread, which there is only one of, and looks like many MDS operations go through it to be committed. As a consequence, high latency on directory removes makes the whole MDS quite unresponsive.

Questions I still don't understand:

Do we have to call open_snaprealm/split_at for each directory that is removed? If not, when is this called? For our use case, it seems to be called really a lot (especially given how heavy of a loop it needs to do).
Do these calls have to hold up all other operations in the MDS?
How would this work with a relatively large MDS cache with hundreds of clients?

And most importantly - what are practical ways we could mitigate/triage this problem?

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Thanks for opening a ticket and doing some digging, Andras,

Andras Pataki wrote:

I've done some further debugging to understand the MDS performance problem that has been impacting us more. The finisher thread (fn_anonymous) uses 100% CPU on remove operations - it turns out remove directory operations. Stack traces look like this:

[...]

In SnapRealm::split_at, there are two loops that seem to be the root cause of the MDS performance problem:

[...]

and

[...]

The second loop goes through hundreds of thousands of inodes and does nothing (ends in the else case of the if since they aren't repated to the directory). For the trace cases I looked at, the first loop over open_children also loops through over ten thousand entries. It looks like these two loops are responsible for tens of milliseconds of time spent.

Yes, it seems so.

This is done in the finisher thread, which there is only one of, and looks like many MDS operations go through it to be committed. As a consequence, high latency on directory removes makes the whole MDS quite unresponsive.
Questions I still don't understand:

Do we have to call open_snaprealm/split_at for each directory that is removed? If not, when is this called? For our use case, it seems to be called really a lot (especially given how heavy of a loop it needs to do).

I'm not sure but it seems at least we could optimize the trivial case of an empty directory.

Do these calls have to hold up all other operations in the MDS?

The fair_mutex I mentioned in the mailing list thread would help but you'd still have a long queue of operations in the finisher thread which would slow down other things.

How would this work with a relatively large MDS cache with hundreds of clients?

A large cache adds to the work because, for a snapshot that covers the whole file system, the MDS will look at every inode in cache.

And most importantly - what are practical ways we could mitigate/triage this problem?

Well, you can ask users to move the subtrees to a global trash directory rather than `rm -rf` (yuck!). Then write a script that gently deletes it.

I will try to reproduce this and create a fix.

Thanks Andras.

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Just want to share an update that I was able to reproduce the problem. I'll test out a fix soon...

Actions

Copy link

Updated by Andras Pataki over 2 years ago

That's great news! I'm also trying to set up a test environment to reproduce the issue outside our main cluster (eyeing to test a potential fix there). Thanks Patrick.

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Status changed from New to Fix Under Review
Assignee set to Patrick Donnelly
Target version set to v17.0.0
Source set to Community (user)
Backport set to pacific,octopus
Pull request ID set to 43911
Component(FS) MDS added
Labels (FS) snapshots added

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Andras, I think there are other inefficiencies I've not yet identified but this fix addresses your specific problem.

Actions

Copy link

Updated by Andras Pataki over 2 years ago

Thanks Patrick! How safe do you feel this patch is? Does it need a lot of testing or is it safe to deploy?

Actions

Copy link

Updated by Patrick Donnelly over 2 years ago

Andras Pataki wrote:

Thanks Patrick! How safe do you feel this patch is? Does it need a lot of testing or is it safe to deploy?

It is safe to deploy in my opinion.

Actions

Copy link

Updated by Greg Farnum about 2 years ago

Hmm, the stuff in split_at looks like we can and should just swap the logic -- instead of iterating over all inodes in the parent snaprealm, we should look at all the open inodes underneath the newly-created snaprealm's inode. That logic is slightly more involved (we have to descend the hierarchy instead of iterating through a list) but should not be a big deal.

There are some other uses of the inodes_with_caps list in MDCache which may not all be so easy, but I didn't look at them in as much detail. We're doing notifies to the clients of all inodes in do_realm_invalidate_and_update_notify, for instance, but it looks like that only applies to children of things we're renaming et al -- which is both necessary and also generally a smaller set.

Actions

Copy link

#10

Updated by Laura Flores over 1 year ago

This topic was discussed during the User + Dev meeting on Aug. 8th, 2022. One revelation that came of the meeting (as per Greg) was that unmounting clients at night could help.

Greg also recommended that Andras try out the methods mentioned in note#9, and update the Tracker with any additional findings.

I also looked into Patrick's PR, and I see it's currently in draft form. The last comment there was:

From Patrick:

So this doesn't work because directories may be "empty" because there are no HEAD dentries but some snapshotted dentries exist, i.e. CDir::get_num_snap_items() > 0. So there will be some inodes with caps, potentially, that refer to the snapshotted inodes associated with those dentries.

The solution, I believe, is to create a hierarchical data structure part of CDir which allows tracking child inodes with caps or something that allows us to optimize the common cases (small directories with few or 0 snapshotted dentries).

So it looks like the final solution may just need some more reviews/refactoring.

Actions

Copy link

#11

Updated by Florian Pritz about 1 year ago

Has there been any progress on this in the last year? We believe we are hitting the same issue on a production cluster (all NVMe OSDs, v16.2.11), but so far we are unable to reproduce it with an artificial test (see below) on the same cluster.

The symptoms we see are very bad `rm -rf` performance (handfull of ops/s) and also bad access performance for other clients on the same cluster. At the same time we see 100% CPU usage on the active MDS node. Prometheus metrics show that `rmdir` operations that usually complete in 1-2ms, instead take 20ms, 60ms, or in one case even 350ms (for many minutes without change).

One single example from an MDS log dump:

2023-04-18T16:45:18.130+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997200 readdir #0x1000344befc 2023-04-18T16:45:18.129945+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:18.214+0200 7f8d6732d700  5 mds.0.log _submit_thread 29859233665684~2795 : EUpdate scatter_writebehind [metablob 0x1000344befc, 2 dirs]
2023-04-18T16:45:18.290+0200 7f8d6732d700  5 mds.0.log _submit_thread 29859233668499~5378 : EUpdate unlink_local [metablob 0x1000344befc, 4 dirs]
2023-04-18T16:45:18.294+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997207 rmdir #0x1000344befc/818 2023-04-18T16:45:18.289945+0200 caller_uid=0, caller_gid=0{}) v2

For the artificial test we create 400k directories (4000 first level, 15 next level, then 3 more levels with 1 directory each, then a file with some random data in each leaf directory) and then run `rm -rf` on the top level directory. However, even if I create and remove a snapshot at the root level of the filesystem and run the `rm -rf` after that, I cannot seem to trigger the bug.

A failover to a standby MDS seems to temporarily fix the problem, but I assume that's mostly related to the cache starting out empty since it comes back eventually.

Does anyone have a reliable way to reproduce this (Patrick mentioned he can repro) or some insight into which parameters affect the likelyhood of this happening? Or even better, has any of the mentioned refactoring happened in other places and would this bug be fixed in newer versions? We are currently running ceph 16.2.11, but it appears that an upgrade to the latest 17.2.x would not yet fix this, correct?

Actions

Copy link

#12

Updated by Florian Pritz about 1 year ago

The log I copied is not actually showing the rmdir latency since I filtered incorrectly, sorry about that. This one is just a sample from a grep for `client.188008615` showing consecutive operations from a single client:

2023-04-18T16:45:20.922+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997269 rmdir #0x10003671e40/d9e 2023-04-18T16:45:20.917957+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.014+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997270 rmdir #0x10003671e3f/daa 2023-04-18T16:45:21.013957+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.106+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997271 rmdir #0x10003671e3e/3f4 2023-04-18T16:45:21.101958+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.194+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997272 rmdir #0x1000353cc7f/0e3 2023-04-18T16:45:21.193958+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.282+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997273 rmdir #0x1000343178d/371 2023-04-18T16:45:21.281959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.370+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997274 readdir #0x1000401b907 2023-04-18T16:45:21.369959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.374+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997275 readdir #0x1000401b908 2023-04-18T16:45:21.369959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.374+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997276 readdir #0x1000401b909 2023-04-18T16:45:21.373959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.378+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997277 readdir #0x1000401b90a 2023-04-18T16:45:21.373959+0200 caller_uid=0, caller_gid=0{}) v2
2023-04-18T16:45:21.378+0200 7f8d6e33b700  4 mds.0.server handle_client_request client_request(client.188008615:5997278 readdir #0x1000401b90b 2023-04-18T16:45:21.377959+0200 caller_uid=0, caller_gid=0{}) v2

Actions

Copy link

#13

Updated by Konstantin Shalygin 5 months ago

Target version changed from v17.0.0 to v19.0.0
Backport changed from pacific,octopus to reef quincy pacific

Actions

Copy link

#14

Updated by Venky Shankar 3 months ago

Hi Andras,

Nice to hear from you in the User/Dev monthly meet-up. You had a question related to what exactly is SnapRealm::split_at() function doing (which is the source of latency when snapshots are taken in cephfs) - I replied to it in the PR itself: https://github.com/ceph/ceph/pull/43911#pullrequestreview-883184649

Copying here for brevity:

--
OK, so the "parent" snaprealm tracks all the inodes under its tree which have caps and the root directory always has a snaprealm which implies that every inode under it used by a client is in the inodes_with_caps list for the root directory. And then when a child snapshot is created, the split_at iterates through the entire, probably millions entries and "moves" those inodes which now fall under the child snap realm.
--

As you can see, even when no snapshots exists, the root snaprealm tracks all inodes which have (client) caps (SnapRealm::inodes_with_caps) and when a snapshot is taken anywhere in the tree, this (potentially huge) list is iterated to move inodes which would now fall under the child snaprealm.

If these inode (which have caps) are tracked at the directory level under which they exists, the iteration part when splitting a snaprealm can be avoided, however, there is also the case when these scattered lists would need to be merged (gather).

Actions

Copy link

#15

Updated by Patrick Donnelly 3 months ago

Backport changed from reef quincy pacific to squid,reef,quincy
Pull request ID changed from 43911 to 55768

Actions

Copy link

#16

Updated by Abhishek Lekshmanan 2 months ago

Hi Venky, Patrick

further to our talk, we saw the MDS growing with a lot of log segments and crashing in the up:replay/resolve loop. I reported what we saw as a new issue, let us know if this is somehow related and what we can do to move things along
https://tracker.ceph.com/issues/64717

Actions

Copy link

#17

Updated by Venky Shankar 2 months ago

Hi Andras,

Patrick has a proposed fix that optimizes the iteration - https://github.com/ceph/ceph/pull/55768

I ran it through fs suite and the qa run looks good. Patrick did a sample benchmark with this change and the results show improvement in the split path. See commit details in a0ccb79fa0806792c7ee666c667328a8aeb09e97 in the above PR.

I'll be merging the change sometime this week once we get some of the unrelated qa failures sorted out. If you are for it, feel free to cherry-pick the changes and run it through the workload (or something similar to it) which initially prompted the creation of this tracker :)

Actions

Copy link

#18

Updated by Venky Shankar 2 months ago

Abhishek Lekshmanan wrote:

Hi Venky, Patrick

further to our talk, we saw the MDS growing with a lot of log segments and crashing in the up:replay/resolve loop. I reported what we saw as a new issue, let us know if this is somehow related and what we can do to move things along
https://tracker.ceph.com/issues/64717

Thanks for the report, Abhishek. I've left some comments on the above tracker. Let's continue the discussion there.

Actions

Copy link

#19

Updated by Venky Shankar about 2 months ago

Status changed from Fix Under Review to Pending Backport

PR https://github.com/ceph/ceph/pull/55768 has been merged in main. We would be backporting that to supported releases since the change optimizes the problematic loop (SnapRealm::split_at) and improvements were observed with test runs.

Actions

Copy link

#20