Bug #2563
closedleveldb corruption
0%
Description
This was also mentioned once in the mailing list.
ceph version 0.47.2 (8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
1: /usr/bin/ceph-osd() [0x6eb32a]
2: (()+0xfcb0) [0x7f160bfa0cb0]
3: (gsignal()+0x35) [0x7f160a491445]
4: (abort()+0x17b) [0x7f160a494bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f160addf69d]
6: (()+0xb5846) [0x7f160addd846]
7: (()+0xb5873) [0x7f160addd873]
8: (()+0xb596e) [0x7f160addd96e]
9: (std::__throw_length_error(char const*)+0x57) [0x7f160ad8a907]
10: (()+0x9eaa2) [0x7f160adc6aa2]
11: (char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag)+0x35) [0x7f160adc8495]
12: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned long, std::allocator<char> const&)+0x1d) [0x7f160adc861d]
13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d1ce7]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e0712]
15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cc552]
16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6ccd50]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cd7f8]
18: /usr/bin/ceph-osd() [0x6e679f]
19: (()+0x7e9a) [0x7f160bf98e9a]
20: (clone()+0x6d) [0x7f160a54d4bd]
Files
Updated by Samuel Just almost 12 years ago
It's triggerable without ceph, I've filed a bug below with leveldb and I'm continuing to look into it.
Updated by Samuel Just almost 12 years ago
- Status changed from New to Can't reproduce
It looks like one of the leveldb store files was corrupted, possibly by the filesystem. It may be possible to recover using the instructions in the leveldb tracker link above.
Updated by Matt Garner over 11 years ago
- File omap-20120917.tgz omap-20120917.tgz added
Experiencing the same issue on a production ceph cluster.
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
1: /usr/bin/ceph-osd() [0x6edaba]
2: (()+0xfcb0) [0x7f5a09b47cb0]
3: (gsignal()+0x35) [0x7f5a08723445]
4: (abort()+0x17b) [0x7f5a08726bab]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f5a0907169d]
6: (()+0xb5846) [0x7f5a0906f846]
7: (()+0xb5873) [0x7f5a0906f873]
8: (()+0xb596e) [0x7f5a0906f96e]
9: (std::__throw_length_error(char const*)+0x57) [0x7f5a0901c907]
10: (()+0x9eaa2) [0x7f5a09058aa2]
11: (char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag)+0x35) [0x7f5a0905a495]
12: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, unsigned long, std::allocator<char> const&)+0x1d) [0x7f5a0905a61d]
13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, leveldb::Slice const&) const+0x47) [0x6d43d7]
14: (leveldb::TableBuilder::Add(leveldb::Slice const&, leveldb::Slice const&)+0x92) [0x6e2e02]
15: (leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x482) [0x6cec42]
16: (leveldb::DBImpl::BackgroundCompaction()+0x2b0) [0x6cf440]
17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6cfee8]
18: /usr/bin/ceph-osd() [0x6e8e8f]
19: (()+0x7e9a) [0x7f5a09b3fe9a]
20: (clone()+0x6d) [0x7f5a087df4bd]
osd.7 is one of eight identical PowerEdge 850 units with a mdadm raid0 on 2x 2TB or 3TB drives per machine running btrfs.
All machines running 12.04 and 0.48.1argonaut from deb packages.
This osd had just been added to the existing cluster and was in process of its initial population of pgs from other osds in the cluster.
The only unusual thing about this osd was that I had enabled btrfs compression=zlib on the partition housing the osd data.
I did a btrfsck of the volume containing the omap and found no errors.
df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/md0 19G 3.0G 14G 18% /
udev 2.0G 4.0K 2.0G 1% /dev
tmpfs 791M 268K 791M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 2.0G 0 2.0G 0% /run/shm
/dev/md0 19G 3.0G 14G 18% /home
/dev/sdc1 93M 31M 57M 36% /boot
/dev/md1 5.5T 655G 4.8T 12% /data
ceph.conf:
[osd]
osd data = /data/ceph/osd/ceph-7
keyring = /data/ceph/osd/ceph-7/keyring
osd journal = /data/ceph/osd/ceph-7/journal
osd journal size = 2000
filestore xattr use omap = true
debug optracker = 20
debug journal = 20
Ceph log dump is here:
http://www.mattgarner.com/ceph/ceph-osd.7-20120917.tgz
Updated by Greg Farnum over 11 years ago
- Status changed from Can't reproduce to 12
Just got another report of this on the list.
This user has enabled btrfs' lzo compression, and I believe btrfs compression has been a common thread across everybody who's reported this problem.