Project

General

Profile

Actions

Bug #2563

closed

leveldb corruption

Added by Samuel Just almost 12 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This was also mentioned once in the mailing list.

ceph version 0.47.2 (8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
1: /usr/bin/ceph-osd() [0x6eb32a]
2: (()+0xfcb0) [0x7f160bfa0cb0]
3: (gsignal()+0x35) [0x7f160a491445]
4: (abort()+0x17b) [0x7f160a494bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f160addf69d]
6: (()+0xb5846) [0x7f160addd846]
7: (()+0xb5873) [0x7f160addd873]
8: (()+0xb596e) [0x7f160addd96e]
9: (std::__throw_length_error(char const*)+0x57) [0x7f160ad8a907]
10: (()+0x9eaa2) [0x7f160adc6aa2]
11: (char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag)+0x35) [0x7f160adc8495]
12: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned long, std::allocator<char> const&)+0x1d) [0x7f160adc861d]
13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d1ce7]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e0712]
15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cc552]
16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6ccd50]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cd7f8]
18: /usr/bin/ceph-osd() [0x6e679f]
19: (()+0x7e9a) [0x7f160bf98e9a]
20: (clone()+0x6d) [0x7f160a54d4bd]


Files

omap.tgz (5.12 MB) omap.tgz Omap archive Samuel Just, 06/12/2012 02:55 PM
omap-20120917.tgz (9.2 MB) omap-20120917.tgz OMAP Tarball Matt Garner, 09/17/2012 02:04 PM
Actions #1

Updated by Samuel Just almost 12 years ago

It's triggerable without ceph, I've filed a bug below with leveldb and I'm continuing to look into it.

http://code.google.com/p/leveldb/issues/detail?id=97

Actions #2

Updated by Samuel Just almost 12 years ago

  • Status changed from New to Can't reproduce

It looks like one of the leveldb store files was corrupted, possibly by the filesystem. It may be possible to recover using the instructions in the leveldb tracker link above.

Actions #3

Updated by Matt Garner over 11 years ago

Experiencing the same issue on a production ceph cluster.

ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
1: /usr/bin/ceph-osd() [0x6edaba]
2: (()+0xfcb0) [0x7f5a09b47cb0]
3: (gsignal()+0x35) [0x7f5a08723445]
4: (abort()+0x17b) [0x7f5a08726bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f5a0907169d]
6: (()+0xb5846) [0x7f5a0906f846]
7: (()+0xb5873) [0x7f5a0906f873]
8: (()+0xb596e) [0x7f5a0906f96e]
9: (std::__throw_length_error(char const*)+0x57) [0x7f5a0901c907]
10: (()+0x9eaa2) [0x7f5a09058aa2]
11: (char* std::string::_S_construct&lt;char const*&gt;(char const*, char const*, std::allocator&lt;char&gt; const&, std::forward_iterator_tag)+0x35) [0x7f5a0905a495]
12: (std::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; >::basic_string(char const*, unsigned long, std::allocator&lt;char&gt; const&)+0x1d) [0x7f5a0905a61d]
13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d43d7]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e2e02]
15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cec42]
16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6cf440]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cfee8]
18: /usr/bin/ceph-osd() [0x6e8e8f]
19: (()+0x7e9a) [0x7f5a09b3fe9a]
20: (clone()+0x6d) [0x7f5a087df4bd]

osd.7 is one of eight identical PowerEdge 850 units with a mdadm raid0 on 2x 2TB or 3TB drives per machine running btrfs.
All machines running 12.04 and 0.48.1argonaut from deb packages.

This osd had just been added to the existing cluster and was in process of its initial population of pgs from other osds in the cluster.

The only unusual thing about this osd was that I had enabled btrfs compression=zlib on the partition housing the osd data.

I did a btrfsck of the volume containing the omap and found no errors.

df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/md0 19G 3.0G 14G 18% /
udev 2.0G 4.0K 2.0G 1% /dev
tmpfs 791M 268K 791M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
/dev/md0 19G 3.0G 14G 18% /home
/dev/sdc1 93M 31M 57M 36% /boot
/dev/md1 5.5T 655G 4.8T 12% /data

ceph.conf:
[osd]
osd data = /data/ceph/osd/ceph-7
keyring = /data/ceph/osd/ceph-7/keyring
osd journal = /data/ceph/osd/ceph-7/journal
osd journal size = 2000
filestore xattr use omap = true
debug optracker = 20
debug journal = 20

Ceph log dump is here:
http://www.mattgarner.com/ceph/ceph-osd.7-20120917.tgz

Actions #4

Updated by Greg Farnum over 11 years ago

  • Status changed from Can't reproduce to 12

Just got another report of this on the list.
This user has enabled btrfs' lzo compression, and I believe btrfs compression has been a common thread across everybody who's reported this problem.

Actions #5

Updated by Samuel Just almost 11 years ago

  • Status changed from 12 to Resolved
Actions

Also available in: Atom PDF