Bug #40700: memory usage of: radosgw-admin bucket rm - rgw - Ceph

Actions

Copy link

Bug #40700

closed

memory usage of: radosgw-admin bucket rm

Added by Harald Staub almost 5 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

Mark Kogan

Target version:

Ceph - v15.0.0

% Done:

Source:

Tags:

Backport:

nautilus,mimic,luminous

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

30174

Crash signature (v1):

Crash signature (v2):

Description

Cluster is Nautilus 14.2.1, 500 OSDs with BlueStore. Both of the RadosGW pools that are involved here (for data and for index) are replicated and without SSDs.

Steps that led to the problem:

1. There is a bucket $BIG_BUCKET with about 60 M objects, with 1024 shards.

2. radosgw-admin bucket rm --bucket=$BIG_BUCKET --bypass-gc --purge-objects

3. After several hours, the removal command was killed by the out-of-memory killer. Then looking at the graphs, we see a continuous increase of memory usage for this process, about +24 GB per day. Removal rate is about 3 M objects per day.

So with this bucket with 60 M objects, we would need about 480 GB of RAM to come through.

Expected behaviour:

Bucket removal with radosgw-admin should work with a somewhat limited amount of memory, also with buckets with lots of objects.

Some additional information:

The killed remove command can just be called again, but it will be killed again before it finishes. Also, it has to run some time until it continues to actually remove objects. This "wait time" is also increasing. Last time, after about 16 M objects already removed, the wait time was nearly 9 hours. Also during this time, there is a memory ramp, but not so steep.

Harry

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Casey Bodley almost 5 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Paul Emmerich almost 5 years ago

I've also got two clusters here with this problem, one is running 14.2.1 (50M objects in a bucket) and one 13.2.5 (450M objects in a bucket).

Looks like radosgw-admin uses libc malloc, so it's hard to say what the memory is being used for

Actions

Copy link

Updated by Casey Bodley almost 5 years ago

Status changed from New to 12
Assignee set to J. Eric Ivancich

Actions

Copy link

Updated by Casey Bodley almost 5 years ago

Assignee changed from J. Eric Ivancich to Mark Kogan

Actions

Copy link

Updated by Mark Kogan almost 5 years ago

Investigating this issue,
it is possible to alleviate the "wait time" increasing incrementally after each iteration of

radosgw-admin bucket rm --bucket=$BIG_BUCKET --bypass-gc --purge-objects

by running

radosgw-admin bucket check --bucket=$BIG_BUCKET --fix

between each itteration of radosgw-admin bucket rm operations.

Actions

Copy link

Updated by J. Eric Ivancich almost 5 years ago

That's interesting, Mark!

So the bucket index is left in an unsynchronized state (i.e., original state) when bucket removal is terminated part-way through. And then when bucket removal is restarted, it begins by trying to re-remove those same objects at the head of the bucket index all over again, causing a delay before forward progress is made.

Since the bucket removal is generally expected to complete, there "should" be no need to update the bucket index at "check-points" during the bucket removal process.

If terminating bucket removal is semi-expected (possibly through manual admin intervention), it seems that updating the index after every 100,000 to 1,000,000 objects is removed would mitigate this, without creating a lot of overhead.

And would there be any benefit to removing the objects from back to front in the bucket index? In other words, is there an easy way to truncate the index of its tail members, making the update of the bucket index quick?

Actions

Copy link

Updated by Mark Kogan over 4 years ago

Pull request ID set to 30174

Actions

Copy link

Updated by Mark Kogan over 4 years ago

Update -
found the source of the memory growth:

src/rgw/rgw_rados.cc

RGWObjState *RGWObjectCtx::get_state(const rgw_obj& obj) {
  RGWObjState *result;
  typename std::map<rgw_obj, RGWObjState>::iterator iter;
  lock.lock_shared();
  assert (!obj.empty());
  iter = objs_state.find(obj);
  if (iter != objs_state.end()) {
    result = &iter->second;
    lock.unlock_shared();
  } else {
    lock.unlock_shared();
    lock.lock();
    result = &objs_state[obj];    <--------------
    lock.unlock();
  }
  return result;
}

Submitted proposed fix PR.

Actions

Copy link

Updated by J. Eric Ivancich over 4 years ago

Status changed from 12 to 17
Target version set to v15.0.0
Backport set to nautilus,mimic,luminous

Actions

Copy link

#10

Updated by Abhishek Lekshmanan over 4 years ago

Status changed from 17 to 7

Actions

Copy link

#11

Updated by J. Eric Ivancich over 4 years ago

Status changed from 7 to Pending Backport

Actions

Copy link

#12

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41858: nautilus: memory usage of: radosgw-admin bucket rm added

Actions

Copy link

#13

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41859: mimic: memory usage of: radosgw-admin bucket rm added

Actions

Copy link

#14

Updated by Nathan Cutler over 4 years ago

Copied to Backport #41860: luminous: memory usage of: radosgw-admin bucket rm added

Actions

Copy link

#15

Updated by Nathan Cutler over 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #40700

memory usage of: radosgw-admin bucket rm

Updated by Casey Bodley almost 5 years ago

Updated by Paul Emmerich almost 5 years ago

Updated by Casey Bodley almost 5 years ago

Updated by Casey Bodley almost 5 years ago

Updated by Mark Kogan almost 5 years ago

Updated by J. Eric Ivancich almost 5 years ago

Updated by Mark Kogan over 4 years ago

Updated by Mark Kogan over 4 years ago

Updated by J. Eric Ivancich over 4 years ago

Updated by Abhishek Lekshmanan over 4 years ago

Updated by J. Eric Ivancich over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 3 years ago