Project

General

Profile

Actions

Bug #44660

open

Multipart re-uploads cause orphan data

Added by Chris Jones about 4 years ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
-
Target version:
% Done:

0%

Source:
Tags:
multipart gc backport_processed
Backport:
pacific quincy reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The impact of this issue is accumulation of large amounts (hundreds of TB) of orphan data, which we are unable to clean up due to the orphans find tool being impractical to run on a 2PB cluster due to the time required, as well as due to a memory leak. After several weeks of the orphan find tool running, it consumes up to 1TB of RAM and then terminates with an out of memory issue.

For background information, this issue is replicable on ANY known version of CEPH of Jewel or greater, including new versions of Nautilus we tested as of a few months ago.

It is related to the following item, which was opened around 3 years ago, but it has not received any attention in quite some time.

https://tracker.ceph.com/issues/16767

In a nutshell, the one certain condition under which this occurs is when re-uploading one or more parts of a multipart upload as documented in the issue above.

There is an attached bash script (ceph-leaked-mp-populater.sh) that should be able to recreate the issue on any ceph version Jewel and up. This includes the ceph-daemon docker containers.
The results below are after running the ceph-leaked-mp-populater.sh script.

  1. radosgw-admin bucket stats
    NOTES:
    This bucket was empty prior to uploading ONE mp file of 25MB, with 5 uploads(parts) of 5MB each, and with each of the 5 parts re-uploaded using the same upload id before completing the mp upload.
    This bucket should contain only ONE object, the completed file, and should only be 25MB in size, however, it is also reflecting the leaked parts and shows an additional 25MB.

root@rgw09:~# radosgw-admin bucket stats -b mybucket {
"bucket": "mybucket",
"zonegroup": "1d0e4456-f8d2-4e4c-abc1-1db8e834507d",
"placement_rule": "default-placement",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2",
"marker": "3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2",
"index_type": "Normal",
"owner": "admin",
"ver": "0#1,1#1,2#1,3#1,4#27,5#1,6#1,7#1,8#1,9#1,10#1,11#1,12#1",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0",
"mtime": "2020-03-17 14:35:44.895102",
"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#,11#,12#",
"usage": {
"rgw.main": {
"size": 52428800,
"size_actual": 52428800,
"size_utilized": 52428800,
"size_kb": 51200,
"size_kb_actual": 51200,
"size_kb_utilized": 51200,
"num_objects": 6
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

  1. radosgw-admin bucket list
    NOTE:
    This but illustrates that there are invalid objects remaining in the bucket after the multipart upload is completed with reuploaded parts.
    The objects with the "_multipart" prefix are entries in the index pool that are added when the parts are re-uploaded. Note that the SAME UPLOAD ID was used for the reupload, however the created index entry reflects a new/different upload ID for each of the parts. This is the source of the incorrect bucket stats.
    There is a correlating set of objects in the data pool representing the ORIGINALLY UPLOADED PARTS (with the ORIGINAL UPLOAD ID) that are not removed. This is the source of the orphan data in the data pool.
    To restate:
    The index pool reflects entries with A DIFFERENT UPLOAD ID than the one used to actually upload the object, and are not cleaned up on completion of the upload
    The data pool retains the ORIGINAL parts from the first uploads of those parts, and are not deleted when the multipart is completed.

root@rgw09:~# radosgw-admin bucket list -b mybucket
[ {
"name": "_multipart_mymp1.7CZU8PpPYSF8lfuu6iyVXkMqHGowZ_v.3",
"instance": "",
"ver": {
"pool": 24,
"epoch": 537
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 5242880,
"mtime": "2020-03-17 14:36:16.286084Z",
"etag": "9e0372262b7b4b72a47e019b4bd1c890",
"owner": "admin",
"owner_display_name": "administrator",
"content_type": "",
"accounted_size": 5242880,
"user_data": ""
},
"tag": "_yD_noQmPhQ_mHaPhTpsC2A2H6czMaCo",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}, {
"name": "_multipart_mymp1.7QV1hCCj4Zx12G6Cz8CmaQGx4_Uvd97.5",
"instance": "",
"ver": {
"pool": 24,
"epoch": 739
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 5242880,
"mtime": "2020-03-17 14:36:24.108514Z",
"etag": "9572a6537c6d5e14edffa1b1e6a34b72",
"owner": "admin",
"owner_display_name": "administrator",
"content_type": "",
"accounted_size": 5242880,
"user_data": ""
},
"tag": "_rMiNuSVe5_8hsMB-IQ5O9ecU25BXgVX",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}, {
"name": "_multipart_mymp1.XGyHz_JrQaqlwhDIHfLgfcY7ms7ppe4.4",
"instance": "",
"ver": {
"pool": 24,
"epoch": 542
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 5242880,
"mtime": "2020-03-17 14:36:20.180951Z",
"etag": "30a9025baa371b42241a453d183f98d2",
"owner": "admin",
"owner_display_name": "administrator",
"content_type": "",
"accounted_size": 5242880,
"user_data": ""
},
"tag": "_HirjTlD2ccPbmMru_zQ1kDW2KpyFUX1",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}, {
"name": "_multipart_mymp1.Yf--dgWYbAirt3Vm1RblGLOdZe8__bZ.1",
"instance": "",
"ver": {
"pool": 24,
"epoch": 234
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 5242880,
"mtime": "2020-03-17 14:36:08.470327Z",
"etag": "bbae3bd3fa90f2df80d29eccf57635bc",
"owner": "admin",
"owner_display_name": "administrator",
"content_type": "",
"accounted_size": 5242880,
"user_data": ""
},
"tag": "_SynsNOIYTQOvg7czESjjJhqSt9Mp2um",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}, {
"name": "_multipart_mymp1.iNL-YL9K92asvy2L6OX0QdNNncTvnW6.2",
"instance": "",
"ver": {
"pool": 24,
"epoch": 212
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 5242880,
"mtime": "2020-03-17 14:36:12.382395Z",
"etag": "8de31bf63197b779138d4eb661b3047d",
"owner": "admin",
"owner_display_name": "administrator",
"content_type": "",
"accounted_size": 5242880,
"user_data": ""
},
"tag": "_NIk9FcgAcX4sIV2Jpe949wztMNKwAEL",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}, {
"name": "mymp1",
"instance": "",
"ver": {
"pool": 24,
"epoch": 223
},
"locator": "",
"exists": "true",
"meta": {
"category": 1,
"size": 26214400,
"mtime": "2020-03-17 14:36:25.239287Z",
"etag": "37f7055208439684f087af5d7746ccad-5",
"owner": "admin",
"owner_display_name": "administrator",
"content_type": "",
"accounted_size": 26214400,
"user_data": ""
},
"tag": "3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.6857325",
"flags": 0,
"pending_map": [],
"versioned_epoch": 0
}
]

  1. Showing that all of the original (should have been replaced) parts are still in the data pool, and will never be removed by ceph. These are the objects which are incorrectly bloating our data pool.
    In the following listing... the data pool objects that have the ORIGINAL UPLOAD ID are the invalid pieces. These are the originally uploaded parts of the multipart upload, and were replaced by the items having the different/unique upload ids.

The original invalid objects are noted by me in the output below:

root@rgw09:~# rados p default.rgw.buckets.data ls | grep '3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2' | sort
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2_mymp1
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.7CZU8PpPYSF8lfuu6iyVXkMqHGowZ_v.3 <--
These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.7QV1hCCj4Zx12G6Cz8CmaQGx4_Uvd97.5 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.iNL-YL9K92asvy2L6OX0QdNNncTvnW6.2 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.XGyHz_JrQaqlwhDIHfLgfcY7ms7ppe4.4 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.Yf--dgWYbAirt3Vm1RblGLOdZe8__bZ.1 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.7CZU8PpPYSF8lfuu6iyVXkMqHGowZ_v.3_1 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.7QV1hCCj4Zx12G6Cz8CmaQGx4_Uvd97.5_1 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.iNL-YL9K92asvy2L6OX0QdNNncTvnW6.2_1 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.XGyHz_JrQaqlwhDIHfLgfcY7ms7ppe4.4_1 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.Yf--dgWYbAirt3Vm1RblGLOdZe8__bZ.1_1 <--- These are the replacement parts that are now the valid ones
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.1 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.2 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.3 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.4 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.5 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.1_1 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.2_1 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.3_1 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.4_1 <--- original part that was replaced (now orphan data)
3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.5_1 <--- original part that was replaced (now orphan data)

  1. radosgw-admin object stat
    NOTE: You can see in the object stat manifest section below that ceph is notating the replacement parts as the valid ones.
    root@rgw09:~# radosgw-admin object stat -b mybucket -o mymp1 {
    "name": "mymp1",
    "size": 26214400,
    "policy": {
    "acl": {
    "acl_user_map": [ {
    "user": "admin",
    "acl": 15
    }
    ],
    "acl_group_map": [],
    "grant_map": [ {
    "id": "admin",
    "grant": {
    "type": {
    "type": 0
    },
    "id": "admin",
    "email": "",
    "permission": {
    "flags": 15
    },
    "name": "administrator",
    "group": 0,
    "url_spec": ""
    }
    }
    ]
    },
    "owner": {
    "id": "admin",
    "display_name": "administrator"
    }
    },
    "etag": "37f7055208439684f087af5d7746ccad-5",
    "tag": "3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.6857325",
    "manifest": {
    "objs": [],
    "obj_size": 26214400,
    "explicit_objs": "false",
    "head_size": 0,
    "max_head_size": 0,
    "prefix": "mymp1.Yf--dgWYbAirt3Vm1RblGLOdZe8__bZ",
    "rules": [ {
    "key": 0,
    "val": {
    "start_part_num": 1,
    "start_ofs": 0,
    "part_size": 5242880,
    "stripe_max_size": 4194304,
    "override_prefix": ""
    }
    }, {
    "key": 5242880,
    "val": {
    "start_part_num": 2,
    "start_ofs": 5242880,
    "part_size": 5242880,
    "stripe_max_size": 4194304,
    "override_prefix": "mymp1.iNL-YL9K92asvy2L6OX0QdNNncTvnW6"
    }
    }, {
    "key": 10485760,
    "val": {
    "start_part_num": 3,
    "start_ofs": 10485760,
    "part_size": 5242880,
    "stripe_max_size": 4194304,
    "override_prefix": "mymp1.7CZU8PpPYSF8lfuu6iyVXkMqHGowZ_v"
    }
    }, {
    "key": 15728640,
    "val": {
    "start_part_num": 4,
    "start_ofs": 15728640,
    "part_size": 5242880,
    "stripe_max_size": 4194304,
    "override_prefix": "mymp1.XGyHz_JrQaqlwhDIHfLgfcY7ms7ppe4"
    }
    }, {
    "key": 20971520,
    "val": {
    "start_part_num": 5,
    "start_ofs": 20971520,
    "part_size": 5242880,
    "stripe_max_size": 4194304,
    "override_prefix": "mymp1.7QV1hCCj4Zx12G6Cz8CmaQGx4_Uvd97"
    }
    }
    ],
    "tail_instance": "",
    "tail_placement": {
    "bucket": {
    "name": "mybucket",
    "marker": "3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2",
    "bucket_id": "3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2",
    "tenant": "",
    "explicit_placement": {
    "data_pool": "",
    "data_extra_pool": "",
    "index_pool": ""
    }
    },
    "placement_rule": "default-placement"
    }
    },
    "attrs": {
    "user.rgw.pg_ver": "",
    "user.rgw.source_zone": ".!�\u0011",
    "user.rgw.tail_tag": "3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.6857325"
    }
    }
  1. radosgw-admin gc list (after running radosgw-admin gc process)
    root@rgw09:~# radosgw-admin gc list
    []
  1. radosgw-admin gc list --include-all (after running radosgw-admin gc process)
    []
  1. Showing that the invalid items still remain after garbage collection
    root@rgw09:~# rados -p default.rgw.buckets.data ls | grep '3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2' | sort
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.2
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.3
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.4
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.5
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.7CZU8PpPYSF8lfuu6iyVXkMqHGowZ_v.3
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.7QV1hCCj4Zx12G6Cz8CmaQGx4_Uvd97.5
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.iNL-YL9K92asvy2L6OX0QdNNncTvnW6.2
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.XGyHz_JrQaqlwhDIHfLgfcY7ms7ppe4.4
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.Yf--dgWYbAirt3Vm1RblGLOdZe8__bZ.1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2_mymp1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.1_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.2_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.3_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.4_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.5_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.7CZU8PpPYSF8lfuu6iyVXkMqHGowZ_v.3_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.7QV1hCCj4Zx12G6Cz8CmaQGx4_Uvd97.5_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.iNL-YL9K92asvy2L6OX0QdNNncTvnW6.2_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.XGyHz_JrQaqlwhDIHfLgfcY7ms7ppe4.4_1
    3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.Yf--dgWYbAirt3Vm1RblGLOdZe8__bZ.1_1
  1. Result of radosgw-admin orphans find
    NOTE:
    On a small cluster its very fast and feasible to run without memory leak issues.
    On large clusters, it takes literally MONTHS to run, and must run without interruption or you lose a significant portion of your progress, particularly during the data pool dump phase.
    It resumes at checkpoints, but in some cases, will restart from the beginning.
    Also, it can not complete successfully on large clusters due to a memory leak issue that, on our 2PB cluster (total data pool size is approx 1500TB, with around 800TB of valid data based on summing up the individual bucket totals, and around 700TB of suspected orphan data), on a VM with 1TB of RAM (yes 1TB) it will run out of memory after a few weeks of continuous running.
    Restarting just starts over at the checkpoint, and it continues to fail at about the same point on each restart.
    Note that (amongst other things) it identifies the original parts mentioned above as leaked data.
root@rgw09:~# radosgw-admin orphans find -p default.rgw.buckets.data --orphan-stale-secs 60 --job-id mybucket-2 --yes-i-really-mean-it
  • some logging ommitted
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.2
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.3
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.4
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.5
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.1_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.2_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.3_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.4_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~Z60OreS-MgiJXv4-9uCBsf5EWhuHPrr.5_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.2
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.3
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.4
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__multipart_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.5
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.1_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.2_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.3_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.4_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.1__shadow_mymp1.2~TpAvTseKBOv1aX1e1E8Tid19BMJUSKL.5_1
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.1 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.2 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.3 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.4 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__multipart_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.5 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.1_1 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.2_1 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.3_1 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.4_1 <--- leaked original multipart part
    leaked: 3647aa76-2877-4c76-8ef6-c56377ee1ae1.6963188.2__shadow_mymp1.2~v7BGHFNfjFxvPEPYd-wwgT72aNwAdw6.5_1 <--- leaked original multipart part

Files

ceph-leaked-mp-populater.sh (3.08 KB) ceph-leaked-mp-populater.sh Chris Jones, 03/17/2020 06:20 PM

Related issues 4 (0 open4 closed)

Related to rgw - Bug #16767: RadosGW Multipart Cleanup FailureResolved

Actions
Copied to rgw - Backport #59566: reef: Multipart re-uploads cause orphan dataResolvedActions
Copied to rgw - Backport #59567: quincy: Multipart re-uploads cause orphan dataDuplicateActions
Copied to rgw - Backport #59568: pacific: Multipart re-uploads cause orphan dataRejectedActions
Actions #1

Updated by Chris Jones about 4 years ago

What I know is that ceph intentionally creates a new random oid for the reuploaded parts. My understanding that, among other things, this is to prevent race conditions and/or issues with simultaneous uploads of the same part (two instances of the client uploading the same part at the same time).

When the multipart complete is issued, ceph knows how to look up the valid pieces via some stored metadata (manifest maybe?), however it appears the original (no longer needed) parts are NOT kept track of by any mechanism, and thus are just abandoned in the data pool.

I am not not yet clear why the index entries for the replacement parts remain in the index. I could be way off here, but it looks like it is only referencing "src_obj" at the end of the following function for removal, and not the new replacement part index entries:

rgw_op.cc function void RGWCompleteMultipart::execute()
<--- code ommitted --->
rgw_obj_index_key remove_key;
src_obj.key.get_index_key(&remove_key);
remove_objs.push_back(remove_key);

Actions #2

Updated by Casey Bodley about 4 years ago

  • Status changed from New to Triaged
  • Assignee set to Matt Benjamin
  • Tags set to multipart gc
Actions #3

Updated by Casey Bodley about 3 years ago

  • Related to Bug #16767: RadosGW Multipart Cleanup Failure added
Actions #4

Updated by Gavin Chen about 3 years ago

Just to add onto this with our own findings:

The orphan issue seems to be related to bucket sharding. Running

# radosgw-admin bucket check --check-objects --fix --bucket <bucket name>

Removed the orphaned data from multipart uploads, but only when bucket shards are set to 0. It fails to clean data on buckets with shards > 0.

The ticket points out that remaining orphans have a different id than the pieces which are recombined. The fact that these pieces aren’t visible from a client service accessing ceph metadata sounds like its an issue with the way ceph is tracking the pieces internally. The method for cataloging multipart objects seems to not have taken into account shards or forgotten the system “shadow tags” added onto the upload ID to differentiate them.

Setting bucket shards to a lower amount doesn’t change anything. After running this command than a bucket check, the orphaned data remains.

[root@os1-sin1 ~]# radosgw-admin reshard add --bucket test --num-shards 7 --yes-i-really-mean-it
[root@os1-sin1 ~]# radosgw-admin reshard list
[ {
"time": "2020-09-24T17:14:42.189517Z",
"tenant": "",
"bucket_name": "test",
"bucket_id": "d8c6ebd1-2bab-414d-9d6b-73bf9bc8fc5a.12045805.1",
"new_instance_id": "",
"old_num_shards": 11,
"new_num_shards": 7
}
]

However setting bucket shards to 0 then running the bucket check command removed the orphan data.

[root@os1-sin1 ~]# radosgw-admin reshard add --bucket test --num-shards 0 --yes-i-really-mean-it
[root@os1-sin1 ~]# radosgw-admin reshard list
[ {
"time": "2020-09-24T17:23:34.843021Z",
"tenant": "",
"bucket_name": "test",
"bucket_id": "d8c6ebd1-2bab-414d-9d6b-73bf9bc8fc5a.14335315.1",
"new_instance_id": "",
"old_num_shards": 7,
"new_num_shards": 0
}
]
Actions #5

Updated by Tino Lehnig almost 2 years ago

Gavin Chen wrote:

Removed the orphaned data from multipart uploads, but only when bucket shards are set to 0. It fails to clean data on buckets with shards > 0.

This doesn't work anymore for us since version 17.2.1. The command runs without an error on an affected bucket, shows a lot of removing manifest part from index messages, but doesn't actually do anything. The number of objects inside the bucket doesn't change.

I'm certain that this workaround did work in version 17.2.0 and before. Is this a new bug? And is there anything else we can do to get rid of the orphaned data? So far the only working method we have found is to sync the bucket with aws cli and just delete the bucket with orphan data, which is not always feasible.

Note: The orphaned data was most likely created before the upgrade to 17.2.1 in our case. Maybe that is causing an issue? I haven't seen a case with new orphaned data after the upgrade yet.

Actions #6

Updated by Dhairya Parmar over 1 year ago

Writing on behalf of Ulrich Klein <>, he wanted to add some info to this tracker, below is the data he provided me:

TLDR:
Repeated multipart uploads via S3/RGW create orphaned multipart upload objects which can't be deleted.
The fragments become part of the "used space" for a user and thus screw up space accounting.
In addition these fragments eat up space on the system, and there is no way to tell how much is lost over time.
Not nice.

Searching for orphans via rgw-orphan-list and deleting the found orphaned objects doesn't help much for accounting and is dangerous (as it says) for the health of the system.

This has been the case (for us) since early Pacific releases and has not changed one bit up to my current 17.2.3
Yes. I'll update to the latest version eventually. Had that suggestion a few times before, without any change in behavior.

How I can easily reproduce the problem:
---------------------------------------
My small test system. It's the primary of a multi-site setup, I.e. no resharding. But the system and version at least since early Pacific don't matter.

#ceph -s
cluster:
id: c05843a6-f154-11ec-a618-0de3c7490b0c
health: HEALTH_OK

services:
mon: 5 daemons, quorum maxvm1,maxvm2,maxvm3,maxvm5,maxvm6 (age 5h)
mgr: maxvm1.gykqvn(active, since 5h), standbys: maxvm2.uieuyz, maxvm4.xqlsdy
osd: 12 osds: 12 up (since 5h), 12 in (since 8w)
rgw: 3 daemons active (3 hosts, 1 zones)

data:
pools: 10 pools, 289 pgs
objects: 38.82k objects, 124 GiB
usage: 252 GiB used, 3.3 TiB / 3.5 TiB avail
pgs: 289 active+clean

#ceph orch ps
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
alertmanager.maxvm3 maxvm3 *:9093,9094 running (5h) 7m ago 3M 35.7M - 44a71f29f42b 6bc13919d589
crash.grafavm grafavm running (7d) 7m ago 2w 21.4M - 17.2.3 7080ad88bed7 ac97a757922b
crash.maxvm1 maxvm1 running (5h) 7m ago 3M 7071k - 17.2.3 7080ad88bed7 f075f517a8e1
crash.maxvm2 maxvm2 running (5h) 7m ago 3M 19.4M - 17.2.3 7080ad88bed7 370095c77547
crash.maxvm3 maxvm3 running (5h) 7m ago 3M 7629k - 17.2.3 7080ad88bed7 6aac212190a3
crash.maxvm4 maxvm4 running (5h) 7m ago 3M 7982k - 17.2.3 7080ad88bed7 57d017d9bbc2
crash.maxvm5 maxvm5 running (5h) 7m ago 3M 7944k - 17.2.3 7080ad88bed7 b7faea6ea28c
crash.maxvm6 maxvm6 running (5h) 7m ago 3M 9235k - 17.2.3 7080ad88bed7 0d156c89fad7
grafana.maxvm1 maxvm1 *:3000 running (5h) 7m ago 3M 115M - 8.3.5 046209d1c628 2d478303af2b
mgr.maxvm1.gykqvn maxvm1 *:8443,9283 running (5h) 7m ago 3M 630M - 17.2.3 7080ad88bed7 82c614421f4b
mgr.maxvm2.uieuyz maxvm2 *:8443,9283 running (5h) 7m ago 3M 509M - 17.2.3 7080ad88bed7 50649fb04ee6
mgr.maxvm4.xqlsdy maxvm4 *:8443,9283 running (5h) 7m ago 3M 522M - 17.2.3 7080ad88bed7 9f89cb894c23
mon.maxvm1 maxvm1 running (5h) 7m ago 3M 478M 2048M 17.2.3 7080ad88bed7 676a2bff6dfc
mon.maxvm2 maxvm2 running (5h) 7m ago 3M 466M 2048M 17.2.3 7080ad88bed7 0bc966a58462
mon.maxvm3 maxvm3 running (5h) 7m ago 3M 472M 2048M 17.2.3 7080ad88bed7 9415094eb85a
mon.maxvm5 maxvm5 running (5h) 7m ago 3M 469M 2048M 17.2.3 7080ad88bed7 43174ecc265f
mon.maxvm6 maxvm6 running (5h) 7m ago 3M 465M 2048M 17.2.3 7080ad88bed7 f407c9f1347f
node-exporter.grafavm grafavm *:9100 running (7d) 7m ago 2w 17.8M - bb203ba967a8 64a51468e763
node-exporter.maxvm1 maxvm1 *:9100 running (5h) 7m ago 3M 22.4M - bb203ba967a8 31d90265f164
node-exporter.maxvm2 maxvm2 *:9100 running (5h) 7m ago 3M 25.0M - bb203ba967a8 2bd9fa6feb24
node-exporter.maxvm3 maxvm3 *:9100 running (5h) 7m ago 3M 24.8M - bb203ba967a8 189c93f71251
node-exporter.maxvm4 maxvm4 *:9100 running (5h) 7m ago 3M 22.6M - bb203ba967a8 69166f53968a
node-exporter.maxvm5 maxvm5 *:9100 running (5h) 7m ago 3M 22.7M - bb203ba967a8 10c9eaf15107
node-exporter.maxvm6 maxvm6 *:9100 running (5h) 7m ago 3M 22.4M - bb203ba967a8 d7381d02b817
osd.0 maxvm1 running (5h) 7m ago 8w 836M 4096M 17.2.3 7080ad88bed7 9217317e6d50
osd.1 maxvm2 running (5h) 7m ago 8w 903M 4096M 17.2.3 7080ad88bed7 84f3accce498
osd.2 maxvm3 running (5h) 7m ago 8w 916M 4096M 17.2.3 7080ad88bed7 715d2c0751c0
osd.3 maxvm4 running (5h) 7m ago 8w 1085M 4096M 17.2.3 7080ad88bed7 79a0d68f2cf0
osd.4 maxvm5 running (5h) 7m ago 8w 843M 4096M 17.2.3 7080ad88bed7 1db0ad36a5fd
osd.5 maxvm6 running (5h) 7m ago 8w 1035M 4096M 17.2.3 7080ad88bed7 4d908daa7057
osd.6 maxvm1 running (5h) 7m ago 8w 897M 4096M 17.2.3 7080ad88bed7 262c0b437132
osd.7 maxvm2 running (5h) 7m ago 8w 978M 4096M 17.2.3 7080ad88bed7 a6f8994da77c
osd.8 maxvm3 running (5h) 7m ago 8w 900M 4096M 17.2.3 7080ad88bed7 7c93e034f1cb
osd.9 maxvm4 running (5h) 7m ago 8w 788M 4096M 17.2.3 7080ad88bed7 ce9554342946
osd.10 maxvm5 running (5h) 7m ago 8w 961M 4096M 17.2.3 7080ad88bed7 07fe69fb59b7
osd.11 maxvm6 running (5h) 7m ago 8w 850M 4096M 17.2.3 7080ad88bed7 8cb1fd47f5e1
prometheus.maxvm2 maxvm2 *:9095 running (5h) 7m ago 3M 367M - 49058af74c32 ebb8034d22d0
rgw.max.maxvm4.kokduw maxvm4 *:8080 running (5h) 7m ago 10w 273M - 17.2.3 7080ad88bed7 1882fe5f9ccb
rgw.max.maxvm5.aunemj maxvm5 *:8080 running (5h) 7m ago 10w 268M - 17.2.3 7080ad88bed7 f4feebceda1a
rgw.max.maxvm6.lrgyyj maxvm6 *:8080 running (5h) 7m ago 10w 270M - 17.2.3 7080ad88bed7 a1246877cef8

All arm64 VMs running Ubuntu 22.04, set up via cephadm, working just fine otherwise, except for the missing ingress for which
ceph always pulls an x86_64 container image for keepalived, so I had to set up haproxy/keepalived manually ================================================================================================================================

My test execution which works every single time:
------------------------------------------------
#create ec21, not strictly necessary, but as we use EC-Data pools it makes the pool "similar"
ceph osd erasure-code-profile set ec21 k=2 m=1

#Create pools for new placement target, so only these pools get screwed up
ceph osd pool create max.rgw.mptest.data erasure ec21 --autoscale-mode=on
ceph osd pool application enable max.rgw.mptest.data rgw
ceph osd pool create max.rgw.mptest.index replicated --autoscale-mode=on
ceph osd pool application enable max.rgw.mptest.index rgw
ceph osd pool create max.rgw.mptest.non-ec replicated --autoscale-mode=on
ceph osd pool application enable max.rgw.mptest.non-ec rgw

#create zonegroup placement target
radosgw-admin zonegroup placement add --rgw-zonegroup maxzg --placement-id mptest

#create zone placement target
radosgw-admin zone placement add --rgw-zone max --placement-id mptest --data-pool max.rgw.mptest.data --index-pool max.rgw.mptest.index --data-extra-pool max.rgw.mptest.non-ec

radosgw-admin period update --commit

#Create user mptester
radosgw-admin user create --uid=mptester --display-name="MP Tester" --email=

#make that user use the new placement target, the only one using this placement target
radosgw-admin user modify --uid=mptester --placement-id mptest

rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
.mgr 4.7 MiB 2 0 6 0 0 0 15207 37 MiB 10455 176 MiB 0 B 0 B
.rgw.root 672 KiB 57 0 171 0 0 0 13466 16 MiB 616 422 KiB 0 B 0 B
max.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B
...
max.rgw.log 5.9 MiB 826 0 2478 0 0 0 6611719 4.9 GiB 1418196 191 MiB 0 B 0 B
max.rgw.meta 793 KiB 75 0 225 0 0 0 30367 25 MiB 3068 1.2 MiB 0 B 0 B
max.rgw.mptest.data 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
max.rgw.mptest.index 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
max.rgw.mptest.non-ec 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B
max.rgw.otp 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B

total_objects 38814
total_used 251 GiB
total_avail 3.3 TiB
total_space 3.5 TiB

#Go to client machine #====================
#use rclone ...
#Create a bucket for mptester
rclone mkdir mptester:/mptest

#On Ceph #=======
adosgw-admin bucket stats --bucket=mptest {
"bucket": "mptest",
"num_shards": 11,
"tenant": "",
"zonegroup": "d8ea45b1-d527-427d-ba1e-fd9cdfe526f8",
"placement_rule": "mptest",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1",
"marker": "47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1",
"index_type": "Normal",
"owner": "mptester",
"ver": "0#1,1#1,2#1,3#1,4#1,5#1,6#1,7#1,8#1,9#1,10#1",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
"mtime": "0.000000",
"creation_time": "2022-10-11T11:27:45.247514Z",
"max_marker": "0#,1#,2#,3#,4#,5#,6#,7#,8#,9#,10#",
"usage": {},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

#on Client machine ... #=====================
#have a dir mptestfiles with just one 8G file:
ls l mptestfiles/
total 16797696
-rw-r----
1 mptester staff 8589934592 Oct 11 12:23 file1

#rclone sync the dir to the bucket, interrupting the transfer 5 times at about 50%
rclone -P -v sync mptestfiles mptester:/mptest
^C

#then let one finish
rclone -P -v sync mptestfiles mptester:/mptest

#Let's check for unfinished MP uploads:
rclone backend list-multipart-uploads mptester:/mptest
"mptest": [ {
"Initiated": "2022-10-11T10:32:54.116Z",
"Initiator": {
"DisplayName": "MP Tester",
"ID": "mptester"
},
"Key": "file1",
"Owner": {
"DisplayName": "MP Tester",
"ID": "mptester"
},
"StorageClass": "STANDARD",
"UploadId": "2~zHESigB_NAcofmgz7VZ592IYykGChha"
}
]
}

#.. and clean up the visible one(s) using s3cmd
s3cmd abortmp s3://mptest/file1 "2~zHESigB_NAcofmgz7VZ592IYykGChha"
s3://mptest/file1

#check again ...
rclone backend list-multipart-uploads mptester:/mptest {
"mptest": []
}

#.... So, LOOKS clean

#Back to ceph system: #====================
radosgw-admin bucket stats --bucket=mptest {
"bucket": "mptest",
"num_shards": 11,
"tenant": "",
"zonegroup": "d8ea45b1-d527-427d-ba1e-fd9cdfe526f8",
"placement_rule": "mptest",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1",
"marker": "47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1",
"index_type": "Normal",
"owner": "mptester",
"ver": "0#1,1#1,2#12007,3#1,4#1,5#1,6#1,7#1,8#1,9#1,10#1",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
"mtime": "0.000000",
"creation_time": "2022-10-11T11:27:45.247514Z",
"max_marker": "0#,1#,2#00000012006.12038.5,3#,4#,5#,6#,7#,8#,9#,10#",
"usage": {
"rgw.main": {
"size": 8689549312,
"size_actual": 8689549312,
"size_utilized": 8689549312,
"size_kb": 8485888,
"size_kb_actual": 8485888,
"size_kb_utilized": 8485888,
"num_objects": 20
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

#Hmmm, 19 objects for a single file, and size 8689549312 instead of 8589934592
#(Actually pretty "good" numbers. In other test runs it was much worse)

#Back to the client ... #======================
#Delete the object
rclone delete mptester:/mptest/file1
rclone lsl mptester:/mptest
<nothing there>

#Back on Ceph #============
radosgw-admin bucket stats --bucket=mptest {
"bucket": "mptest",
"num_shards": 11,
"tenant": "",
"zonegroup": "d8ea45b1-d527-427d-ba1e-fd9cdfe526f8",
"placement_rule": "mptest",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1",
"marker": "47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1",
"index_type": "Normal",
"owner": "mptester",
"ver": "0#1,1#1,2#12008,3#1,4#1,5#1,6#1,7#1,8#1,9#1,10#1",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
"mtime": "0.000000",
"creation_time": "2022-10-11T11:27:45.247514Z",
"max_marker": "0#,1#,2#00000012007.12040.5,3#,4#,5#,6#,7#,8#,9#,10#",
"usage": {
"rgw.main": {
"size": 99614720,
"size_actual": 99614720,
"size_utilized": 99614720,
"size_kb": 97280,
"size_kb_actual": 97280,
"size_kb_utilized": 97280,
"num_objects": 19
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

#Now the "empty" bucket contains 19 objects and 99614720 bytes (95MB) of data ?
#That's already a really bad problem, as the user would be billed for 95MB instead of 0

radosgw-admin gc process --include-all

rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
.mgr 4.7 MiB 2 0 6 0 0 0 15207 37 MiB 10455 176 MiB 0 B 0 B
.rgw.root 672 KiB 57 0 171 0 0 0 13539 16 MiB 616 422 KiB 0 B 0 B
max.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B
...
max.rgw.log 6.8 MiB 826 0 2478 0 0 0 6618895 5.0 GiB 1419457 194 MiB 0 B 0 B
max.rgw.meta 817 KiB 78 0 234 0 0 0 30516 25 MiB 3093 1.2 MiB 0 B 0 B
max.rgw.mptest.data 1.1 GiB 274 0 822 0 0 0 11727 9 KiB 59798 29 GiB 0 B 0 B
max.rgw.mptest.index 0 B 11 0 33 0 0 0 36468 36 MiB 36065 23 MiB 0 B 0 B
max.rgw.mptest.non-ec 0 B 0 0 0 0 0 0 24057 14 MiB 6070 5.9 MiB 0 B 0 B
max.rgw.otp 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B

total_objects 39102
total_used 253 GiB
total_avail 3.3 TiB
total_space 3.5 TiB
#... and the data pool 1.1 GB in 274 object, index 11 objects

#On the client delete the bucket #===============================
rclone rmdir mptester:/mptest

#On Ceph test: #=============
radosgw-admin bucket stats --bucket=mptest
failure: (2002) Unknown error 2002:
#Okaay, I guess that means "bucket not found" #
radosgw-admin gc process --include-all
rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
.mgr 4.7 MiB 2 0 6 0 0 0 15207 37 MiB 10455 176 MiB 0 B 0 B
.rgw.root 672 KiB 57 0 171 0 0 0 13565 16 MiB 616 422 KiB 0 B 0 B
max.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B
...
max.rgw.log 5.9 MiB 826 0 2478 0 0 0 6619613 5.0 GiB 1419613 194 MiB 0 B 0 B
max.rgw.meta 805 KiB 77 0 231 0 0 0 30666 25 MiB 3107 1.2 MiB 0 B 0 B
max.rgw.mptest.data 75 MiB 20 0 60 0 0 0 11981 9 KiB 60052 29 GiB 0 B 0 B
max.rgw.mptest.index 0 B 11 0 33 0 0 0 36592 37 MiB 36065 23 MiB 0 B 0 B
max.rgw.mptest.non-ec 0 B 0 0 0 0 0 0 24057 14 MiB 6070 5.9 MiB 0 B 0 B
max.rgw.otp 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B

total_objects 38847
total_used 251 GiB
total_avail 3.3 TiB
total_space 3.5 TiB

#So, now we have a pool with nothing in it which uses 75MB in 20 objects
#What's in it:
rados ls --pool max.rgw.mptest.data | sort
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~4fhdFoPsGu4sUwa0HiOj6z4TQteBJB9.863
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.867
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.869
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.870
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.875
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.877
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~aKcRL17Bs3m3R3LB5EEfX3HTiw1SArT.874
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.874
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.875
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.876
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~4fhdFoPsGu4sUwa0HiOj6z4TQteBJB9.863_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.867_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.869_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.870_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.875_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.877_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~aKcRL17Bs3m3R3LB5EEfX3HTiw1SArT.874_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.874_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.875_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.876_1

#So far:
#RGW somehow completely forgets about MP upload "fragments" if a "failed" MP upload is repeated.
#Not only do the fragments remain in the corresponding buckets and are used in accounting,
#but even deleting the bucket does not free the space. It is lost forever, and accumulating

#A suggested fix was to run rgw-orphan-list and delete the orphans from the pool:
rgw-orphan-list max.rgw.mptest.data
Pool is "max.rgw.mptest.data".
Note: output files produced will be tagged with the current timestamp -- 20221011115953.
running 'rados ls' at Tue Oct 11 13:59:53 CEST 2022
running 'rados ls' on pool max.rgw.mptest.data.
running 'radosgw-admin bucket radoslist' at Tue Oct 11 13:59:53 CEST 2022
computing delta at Tue Oct 11 13:59:53 CEST 2022
20 potential orphans found out of a possible 20 (100%).
The results can be found in './orphan-list-20221011115953.out'.
Intermediate files are './rados-20221011115953.intermediate' and './radosgw-admin-20221011115953.intermediate'.
  • WARNING: This is EXPERIMENTAL code and the results should be used
  • only with CAUTION! ***
    Done at Tue Oct 11 13:59:53 CEST 2022.

cat orphan-list-20221011115953.out
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~4fhdFoPsGu4sUwa0HiOj6z4TQteBJB9.863
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.867
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.869
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.870
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.875
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.877
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~aKcRL17Bs3m3R3LB5EEfX3HTiw1SArT.874
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.874
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.875
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.876
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~4fhdFoPsGu4sUwa0HiOj6z4TQteBJB9.863_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.867_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.869_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~FYaScn-NiPEdPk6sEY5tJKXx569tO-R.870_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.875_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~U_1f1XG0Y-Qcka8Rg9BkZqSPXSbxGpm.877_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~aKcRL17Bs3m3R3LB5EEfX3HTiw1SArT.874_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.874_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.875_1
47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__shadow_file1.2~tyFuo0ItA0IuZh2Q335M6-Y7FNDLsrh.876_1

#Yo, as expected
#So delete them ...
rados -p max.rgw.mptest.data rm 47bae180-d58b-4438-a057-a89cc2d403f2.1934912.1__multipart_file1.2~4fhdFoPsGu4sUwa0HiOj6z4TQteBJB9.863
... for all

#So, that does seem to work for data that is not in any bucket

#Now back to the client again #============================
#Again create the bucket, do 5 interrupted and one full upload, find and delete visible mp upload fragments

#Afterwards we have on Ceph/RGW: #===============================

radosgw-admin bucket stats --bucket=mptest {
"bucket": "mptest",
"num_shards": 11,
"tenant": "",
"zonegroup": "d8ea45b1-d527-427d-ba1e-fd9cdfe526f8",
"placement_rule": "mptest",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "47bae180-d58b-4438-a057-a89cc2d403f2.1935023.1",
"marker": "47bae180-d58b-4438-a057-a89cc2d403f2.1935023.1",
"index_type": "Normal",
"owner": "mptester",
"ver": "0#1,1#1,2#12361,3#1,4#1,5#1,6#1,7#1,8#1,9#1,10#1",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
"mtime": "0.000000",
"creation_time": "2022-10-11T12:09:42.119166Z",
"max_marker": "0#,1#,2#00000012360.12395.5,3#,4#,5#,6#,7#,8#,9#,10#",
"usage": {
"rgw.main": {
"size": 94371840,
"size_actual": 94371840,
"size_utilized": 94371840,
"size_kb": 92160,
"size_kb_actual": 92160,
"size_kb_utilized": 92160,
"num_objects": 18
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

rados df
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
.mgr 4.7 MiB 2 0 6 0 0 0 15207 37 MiB 10455 176 MiB 0 B 0 B
.rgw.root 672 KiB 57 0 171 0 0 0 13609 16 MiB 616 422 KiB 0 B 0 B
max.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B
...
max.rgw.log 15 MiB 826 0 2478 0 0 0 6623585 5.0 GiB 1422665 198 MiB 0 B 0 B
max.rgw.meta 829 KiB 79 0 237 0 0 0 31224 26 MiB 3134 1.2 MiB 0 B 0 B
max.rgw.mptest.data 45 GiB 12333 0 36999 0 0 0 11989 18 KiB 109571 60 GiB 0 B 0 B
max.rgw.mptest.index 0 B 22 0 66 0 0 0 74073 74 MiB 73188 48 MiB 0 B 0 B
max.rgw.mptest.non-ec 0 B 0 0 0 0 0 0 48814 29 MiB 12310 12 MiB 0 B 0 B
max.rgw.otp 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B

total_objects 51173
total_used 297 GiB
total_avail 3.2 TiB
total_space 3.5 TiB
#Even a little better than last time. Only 45GB lost. But 12333 objects for a single empty bucket?? :) #
gw-orphan-list max.rgw.mptest.data
Pool is "max.rgw.mptest.data".
Note: output files produced will be tagged with the current timestamp -- 20221011122035.
running 'rados ls' at Tue Oct 11 14:20:35 CEST 2022
running 'rados ls' on pool max.rgw.mptest.data.
running 'radosgw-admin bucket radoslist' at Tue Oct 11 14:20:35 CEST 2022
computing delta at Tue Oct 11 14:20:35 CEST 2022
12330 potential orphans found out of a possible 12333 (99%).
The results can be found in './orphan-list-20221011122035.out'.
Intermediate files are './rados-20221011122035.intermediate' and './radosgw-admin-20221011122035.intermediate'.
  • WARNING: This is EXPERIMENTAL code and the results should be used
  • only with CAUTION! ***
    Done at Tue Oct 11 14:20:35 CEST 2022.
    #Wow, that's a lot ...
    #Delete them all ... then
    rados df
    POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR
    .mgr 4.7 MiB 2 0 6 0 0 0 15207 37 MiB 10455 176 MiB 0 B 0 B
    .rgw.root 672 KiB 57 0 171 0 0 0 13620 16 MiB 616 422 KiB 0 B 0 B
    max.rgw.control 0 B 8 0 24 0 0 0 0 0 B 0 0 B 0 B 0 B
    ...
    max.rgw.log 15 MiB 826 0 2478 0 0 0 6633903 5.0 GiB 1423537 198 MiB 0 B 0 B
    max.rgw.meta 829 KiB 79 0 237 0 0 0 31594 26 MiB 3134 1.2 MiB 0 B 0 B
    max.rgw.mptest.data 18 MiB 3 0 9 0 0 0 11995 21 KiB 121901 60 GiB 0 B 0 B
    max.rgw.mptest.index 0 B 22 0 66 0 0 0 74140 74 MiB 73188 48 MiB 0 B 0 B
    max.rgw.mptest.non-ec 0 B 0 0 0 0 0 0 48814 29 MiB 12310 12 MiB 0 B 0 B
    max.rgw.otp 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B

total_objects 38843
total_used 252 GiB
total_avail 3.3 TiB
total_space 3.5 TiB

radosgw-admin bucket stats --bucket=mptest {
"bucket": "mptest",
"num_shards": 11,
"tenant": "",
"zonegroup": "d8ea45b1-d527-427d-ba1e-fd9cdfe526f8",
"placement_rule": "mptest",
"explicit_placement": {
"data_pool": "",
"data_extra_pool": "",
"index_pool": ""
},
"id": "47bae180-d58b-4438-a057-a89cc2d403f2.1935023.1",
"marker": "47bae180-d58b-4438-a057-a89cc2d403f2.1935023.1",
"index_type": "Normal",
"owner": "mptester",
"ver": "0#1,1#1,2#12361,3#1,4#1,5#1,6#1,7#1,8#1,9#1,10#1",
"master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0",
"mtime": "0.000000",
"creation_time": "2022-10-11T12:09:42.119166Z",
"max_marker": "0#,1#,2#00000012360.12395.5,3#,4#,5#,6#,7#,8#,9#,10#",
"usage": {
"rgw.main": {
"size": 94371840,
"size_actual": 94371840,
"size_utilized": 94371840,
"size_kb": 92160,
"size_kb_actual": 92160,
"size_kb_utilized": 92160,
"num_objects": 18
},
"rgw.multimeta": {
"size": 0,
"size_actual": 0,
"size_utilized": 0,
"size_kb": 0,
"size_kb_actual": 0,
"size_kb_utilized": 0,
"num_objects": 0
}
},
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
}
}

#So bucket stats/accounting not fixed ... At least the client still works ok at this time
rgw-orphan-list max.rgw.mptest.index
Pool is "max.rgw.mptest.index".
Note: output files produced will be tagged with the current timestamp -- 20221011123001.
running 'rados ls' at Tue Oct 11 14:30:01 CEST 2022
running 'rados ls' on pool max.rgw.mptest.index.
running 'radosgw-admin bucket radoslist' at Tue Oct 11 14:30:01 CEST 2022
computing delta at Tue Oct 11 14:30:02 CEST 2022
22 potential orphans found out of a possible 22 (100%).
The results can be found in './orphan-list-20221011123001.out'.
Intermediate files are './rados-20221011123001.intermediate' and './radosgw-admin-20221011123001.intermediate'.
  • WARNING: This is EXPERIMENTAL code and the results should be used
  • only with CAUTION! ***
    Done at Tue Oct 11 14:30:02 CEST 2022.

#... delete them all and the bucket is screwed:
radosgw-admin bucket stats --bucket=mptest
error getting bucket stats bucket=mptest ret=-2
failure: (2) No such file or directory:

radosgw-admin user stats --uid=mptester {
"stats": {
"size": 94371840,
"size_actual": 94371840,
"size_kb": 92160,
"size_kb_actual": 92160,
"num_objects": 18
},
"last_stats_sync": "0.000000",
"last_stats_update": "2022-10-11T12:34:01.393017Z"
}
#And the stats are still wrong #

#on the client: #==============
rclone lsl mptester:/mptest
<nothing>

rclone mkdir mptester:/mptest

rclone lsl mptester:/mptest
2022/10/11 14:37:14 Failed to lsl: directory not found

clone -P -v sync mptestfiles mptester:/mptest
2022-10-11 14:39:04 INFO : S3 bucket mptest: Bucket "mptest" created with ACL "private"
2022-10-11 14:40:07 INFO : file1: Copied (new)
Transferred: 8 GiB / 8 GiB, 100%, 161.382 MiB/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 1m3.0s
2022/10/11 14:40:07 INFO :
Transferred: 8 GiB / 8 GiB, 100%, 161.382 MiB/s, ETA 0s
Transferred: 1 / 1, 100%
Elapsed time: 1m3.0s

rclone lsl mptester:/mptest
2022/10/11 14:40:21 Failed to lsl: directory not found

#Looks like the pool/placement group is screwed now. Better get rid of it.

#On Ceph: #========
radosgw-admin user rm --uid=mptester --purge-data
could not remove user: unable to remove user, unable to delete user data
#Oh, well ...
radosgw-admin user rm --uid=mptester
could not remove user: unable to remove user, must specify purge data to remove user with buckets
#So, we're left with a ghost user :)
#At least deleting placement group and pools does still work
#But syncing also doesn't work anymore, always 1 shards behind on meta and data sync

#
#So, the "delete orphans" only works if the orphans are not "associated" with an existing bucket, or so ...
#Otherwise the system gets screwed in mysterious ways.

Actions #7

Updated by Mykola Golub over 1 year ago

Looking at the code. In `MultipartObjectProcessor::process_first_chunk`, if writing the multipart object first chunk returns that it exists already, we generate a new random name for this part and write with this name [1]. And then in `MultipartObjectProcessor::complete` we just set "partX -> new_part_name" omap key value in the upload meta object [2], and the previously uploaded part becomes orphan.

One option to resolve this that I can imagine is to provide and use something like `omap_exchange_val_by_key` (instead of `omap_set_val_by_key`), so we could set atomically the part link, and check if it had not been set already, and if it had we could initiate removal of that previous part. A complication is that the rados backend seems not to provide something like `omap_exchange_val_by_key` and it should be implemented, and additionally we have several new backends, where this should be implemented too.

Another option could seem to be instead of "partNum -> part_name" omap values to store "part_name -> partNum" omap values, so the links would not be overwritten. And then when completing the upload (in `RGWCompleteMultipart::execute`) process the list of all uploaded parts and remove duplicates.

[1] https://github.com/ceph/ceph/blob/v17.2.5/src/rgw/rgw_putobj_processor.cc#L352
[2] https://github.com/ceph/ceph/blob/v17.2.5/src/rgw/rgw_putobj_processor.cc#L495

Actions #8

Updated by Mykola Golub over 1 year ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 48695

Actually it looks like there is a simpler solution to this problem, which uses the meta object lock when checking if the part is already uploaded, and just skips the newly uploaded part, if it is. See the PR https://github.com/ceph/ceph/pull/48695 for details.

Actions #9

Updated by Mykola Golub over 1 year ago

  • Pull request ID changed from 48695 to 37260

As it was discussed in [1] there is already a wip PR with more generic solution [2].

[1] https://github.com/ceph/ceph/pull/48695
[2] https://github.com/ceph/ceph/pull/37260

Actions #10

Updated by Aleksandr Rudenko over 1 year ago

It is very big problem for us.

We have a lot of big buckets with orphaned parts which use hundreds TBs of space.

And second problem that bucket check can't fix it if we have sharded bucket.
We have to reshard big buckets to 0 shards and fix it. But we can't reshard very big buckets (200-500m objects) to 0 shards because it can lead to another problems with osd crash...
and fix will eat a lot of memory..

Actions #11

Updated by Ulrich Klein over 1 year ago

Will this ever get fixed? Or remain in the state that solutions are discussed and discarded?

Actions #12

Updated by Matt Benjamin over 1 year ago

  • Pull request ID changed from 37260 to 49709

Hi Ulrich,

The plan is to merge https://github.com/ceph/ceph/pull/49709. Apparently this tracker duplicates https://tracker.ceph.com/issues/16767. I've updated this one.

thanks,

Matt

Actions #13

Updated by Ulrich Klein about 1 year ago

I just repeated my little test from comment #6 on a single-site 17.2.6.
Result was exactly the same as for all tests before. So, I guess there had been no fix for this problem in 17.2.6.
Will there ever ...?

Actions #14

Updated by Konstantin Shalygin about 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Assignee deleted (Matt Benjamin)
  • Target version set to v19.0.0
  • Backport set to pacific quincy reef
Actions #15

Updated by Konstantin Shalygin about 1 year ago

Hi Ulrich, this was merged to main
Thanks for point to this, I was updated status, bot will create backport tickets soon

Actions #16

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59566: reef: Multipart re-uploads cause orphan data added
Actions #17

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59567: quincy: Multipart re-uploads cause orphan data added
Actions #18

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59568: pacific: Multipart re-uploads cause orphan data added
Actions #19

Updated by Backport Bot about 1 year ago

  • Tags changed from multipart gc to multipart gc backport_processed
Actions #20

Updated by Ulrich Klein 9 months ago

I just repeated my little test from comment #6 on a single-site 18.2.0
The uploads were noticeable faster, but the result was exactly the same as for all tests before. So, I guess there has been no fix for this problem in 18.2.0.

Actions #21

Updated by Nathan Hoad 5 months ago

Hi all, we had a user trigger this behavior, and we were able to come up with a clean up process using only radosgw-admin commands. This allowed us to delete the data so that the user could actually reuse the bucket.

1. Have the user backup all of their data.
2. `radosgw-admin bucket check --fix --bucket=$BUCKET` (bucket stats showed a bad rgw.none num_objects count)
3. `radosgw-admin bucket reshard --bucket=$BUCKET` --num-shards=29`
4. `radosgw-admin bucket rm --bucket=$BUCKET --purge-objects` to purge everything. Note that I had to run this a couple of times as it still appeared to get stuck.
5. Now that the bucket is gone, the user was able to recreate the bucket and put their data back.

We found that without the reshard and bucket check --fix, the bucket rm would get stuck and make no progress. In that situation it would infinitely loop over the same object.

Actions #22

Updated by Jacques Heunis 4 months ago

We discovered this behaviour in one of our users' buckets and tried the solution given by Nathan above. Unfortunately it didn't work: `radosgw-admin bucket rm --bucket=$BUCKET --purge-objects` got stuck in a loop (it ran for ages and judging by debug logs it was spinning on a subset of objects without managing to actually delete them).

In our case the issue showed up because the user was trying to delete the bucket. We therefore didn't need to fix the bucket, just get rid of it.
Our solution was:
  1. List all the rados objects storing data for out bucket with:
    radsogw-admin bucket radoslist --bucket=$BUCKET
  2. Get the number of shards in our bucket with:
    radosgw-admin bucket stats --bucket=$BUCKET
  3. Loop over each shard, deleting all metadata/omaps for our bucket by running:
    rados -p $POOL.rgw.buckets.index clearomap $BUCKET_ID.$SHARD_ID
  4. At this point `bucket stats` reports our bucket as empty
  5. Now that the bucket thinks it's empty it can be deleted as usual with:
    radosgw-admin bucket rm --bucket=$BUCKET
  6. Finally, clear up the data objects for our former-bucket by looping over each object listed by `radoslist` above and running:
    rados -p $POOL.rgw.buckets rm $OBJ_NAME
Actions

Also available in: Atom PDF