Bug #38358
closedshort pg log + cache tier ceph_test_rados out of order reply
0%
Description
the combination of
- 1-pg-log-overrides/short_pg_log.yaml
and
- workloads/cache-agent-small.yaml
and any msgr failure injection
results in a ceph_test_rados crash like
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323: finishing write tid 3 to smithi13913891-294 2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323: finishing write tid 2 to smithi13913891-294 2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:Error: finished tid 2 when last_acked_tid was 3 2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fdcb4ff9700 time 2019-02-16 12:48:16.152554 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: 905: abort() 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: ceph version 14.0.1-3796-g597cd08 (597cd0800d5525c39d588f536bfb01afed545bdb) nautilus (dev) 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7fdccd2799b7] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x5eb) [0x55d3145cacfb] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 3: (write_callback(void*, void*)+0x19) [0x55d3145e6899] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 4: (()+0x537d6) [0x7fdcd5ea57d6] 2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 5: (Context::complete(int)+0x9) [0x7fdcd5e89739] 2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 6: (Finisher::finisher_thread_entry()+0x16e) [0x7fdccd2be79e] 2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 7: (()+0x76db) [0x7fdcccdf86db] 2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 8: (clone()+0x3f) [0x7fdccc57b88f]
/a/kchai-2019-02-16_11:36:29-rados-wip-sage-testing-2019-02-16-1748-distro-basic-smithi/3601272
The short pg log in the base tier means that reqid aren't reliable propagated back to the cache tier, breaking the ordering when client ops are resent.
Updated by Sage Weil over 5 years ago
- Related to Bug #24320: out of order reply and/or osd assert with set-chunks-read.yaml added
Updated by Sage Weil over 5 years ago
/a/sage-2019-02-21_06:38:51-rados-wip-sage-testing-2019-02-20-2138-distro-basic-smithi/3620775
Updated by Sage Weil over 5 years ago
/a/sage-2019-02-23_23:02:18-rados-wip-sage2-testing-2019-02-23-1354-distro-basic-smithi/3631889
Updated by Neha Ojha over 5 years ago
This is on luminous:
/a/teuthology-2019-02-23_01:30:03-rados-luminous-distro-basic-smithi/3627561/
We recently changed the pg log limits for short_pg_log.yaml, which may be the reason why these failures are popping up more.
Updated by Neha Ojha about 5 years ago
/a/yuriw-2019-03-07_00:04:47-rados-wip_yuri_nautilus_3.6.19-distro-basic-smithi/3675857/
Updated by Sage Weil almost 5 years ago
avoiding this in the qa suite as of this pr: https://github.com/ceph/ceph/pull/28658
Updated by Neha Ojha over 4 years ago
- Status changed from New to Pending Backport
- Backport set to nautilus
Seen in nautilus: /a/yuriw-2019-12-15_16:25:11-rados-wip-yuri-nautilus-baseline_12.13.19-distro-basic-smithi/4605500/
Updated by Nathan Cutler over 4 years ago
- Copied to Backport #43346: nautilus: short pg log + cache tier ceph_test_rados out of order reply added
Updated by Nathan Cutler over 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".