Project

General

Profile

Actions

Bug #2446

closed

libceph: corrupt inc osdmap epoch 24630 off 702 (ffff88001e5d876c of ffff88001e5d84ae-ffff88001e5d888c)

Added by Karol Jurak almost 12 years ago. Updated almost 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph version: 0.46
Kernel client version: Debian's linux-image-3.3.0-trunk-amd64=3.3.4-1~experimental.1 patched so that the ceph client is 29275a2c2ac9dc5a8b9095304c649871baf43f4f (tip of the wip-crush branch at the time I built the kernel)

Execution of for example 'ceph osd in <id>' causes the following messages to appear on the consoles of the clients:

[87625.522288] libceph: corrupt inc osdmap epoch 24630 off 702 (ffff88001e5d876c of ffff88001e5d84ae-ffff88001e5d888c)
[87625.522546] libceph: osdc handle_map corrupt msg

Some clients seem to handle this situation without problems but others freeze trying to access rbd block devices and output I/O Error messages:

[87749.580008] end_request: I/O error, dev rbd1, sector 75761648
[87749.580008] Buffer I/O error on device rbd1, logical block 9470206
[87749.580008] Buffer I/O error on device rbd1, logical block 9470207
[87749.580008] Buffer I/O error on device rbd1, logical block 9470208
[87749.580008] EXT4-fs warning (device rbd1): ext4_end_bio:250: I/O error writing to inode 2359814 (offset 0 size 12288 starting block 9470209)

Files

24630 (1.35 KB) 24630 The incremental osdmap 24630 which the error message mentioned in bug report concerns. Karol Jurak, 05/18/2012 02:14 AM
osdmaps.tgz (299 KB) osdmaps.tgz Karol Jurak, 05/21/2012 02:45 AM

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #2091: corrupt v5 inc osdmapCan't reproduce02/22/2012

Actions
Actions #1

Updated by Sage Weil almost 12 years ago

  • Status changed from New to Need More Info
  • Priority changed from Normal to High

Is the osdmap/24630 present on all monitors? Is it identical on all of them?

The attachment is 1386 bytes.
The difference between those offsets is only 990 bytes. Maybe the monitor or osd sent a truncated inc map?

Is this reproducible?

Actions #2

Updated by Karol Jurak almost 12 years ago

The monitors deleted older osdmaps from their mondata directories over the weekend, however I managed to reproduce this bug. I executed a couple of 'ceph osd in|out' and 'ceph osd crush reweight' commands, but the 'corrupt inc osdmap' messages showed up after I subsequently shut down one of the OSDs.

[ 1145.443663] libceph: corrupt inc osdmap epoch 25315 off 98 (ffff880045ef90d4 of ffff880045ef9072-ffff880045ef90fc)

[ 1168.018671] libceph: corrupt inc osdmap epoch 25316 off 98 (ffff8800467eb07e of ffff8800467eb01c-ffff8800467eb2ae)

[ 1168.028042] libceph: corrupt inc osdmap epoch 25317 off 98 (ffff8800033e5318 of ffff8800033e52b6-ffff8800033e56d8)

I took a snapshot of the contents of $mondata/osdmap and $mondata/osdmap_full directories on one of the monitors (attached) and verified that the files {osdmap,osdmap_full}/{25315..25317} are identical on all monitors.

Actions #3

Updated by Sage Weil almost 12 years ago

  • Status changed from Need More Info to 7

Aha, I see the bug. You can apply the following patch and the problem should go away:

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 2592f3c..011a3a9 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -883,8 +883,12 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end,
                pglen = ceph_decode_32(p);

                if (pglen) {
-                       /* insert */
                        ceph_decode_need(p, end, pglen*sizeof(u32), bad);
+
+                       /* removing existing (if any) */
+                       __remove_pg_mapping(&map->pg_temp, pgid);
+
+                       /* insert */
                        pg = kmalloc(sizeof(*pg) + sizeof(u32)*pglen, GFP_NOFS);
                        if (!pg) {
                                err = -ENOMEM;
Actions #4

Updated by Karol Jurak almost 12 years ago

I have tested this patch for a couple hours today and there were no 'corrupt inc osdmap' messages. Thanks.

Actions #5

Updated by Sage Weil almost 12 years ago

  • Status changed from 7 to Resolved

Thanks for testing!

Actions

Also available in: Atom PDF