Project

General

Profile

Actions

Bug #38358

closed

short pg log + cache tier ceph_test_rados out of order reply

Added by Sage Weil over 5 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

the combination of

- 1-pg-log-overrides/short_pg_log.yaml

and

- workloads/cache-agent-small.yaml

and any msgr failure injection

results in a ceph_test_rados crash like

2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323:  finishing write tid 3 to smithi13913891-294
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stdout:3323:  finishing write tid 2 to smithi13913891-294
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:Error: finished tid 2 when last_acked_tid was 3
2019-02-16T12:48:16.152 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fdcb4ff9700 time 2019-02-16 12:48:16.152554
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr:/build/ceph-14.0.1-3796-g597cd08/src/test/osd/RadosModel.h: 905: abort()
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: ceph version 14.0.1-3796-g597cd08 (597cd0800d5525c39d588f536bfb01afed545bdb) nautilus (dev)
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xda) [0x7fdccd2799b7]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x5eb) [0x55d3145cacfb]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 3: (write_callback(void*, void*)+0x19) [0x55d3145e6899]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 4: (()+0x537d6) [0x7fdcd5ea57d6]
2019-02-16T12:48:16.153 INFO:tasks.rados.rados.0.smithi139.stderr: 5: (Context::complete(int)+0x9) [0x7fdcd5e89739]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 6: (Finisher::finisher_thread_entry()+0x16e) [0x7fdccd2be79e]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 7: (()+0x76db) [0x7fdcccdf86db]
2019-02-16T12:48:16.154 INFO:tasks.rados.rados.0.smithi139.stderr: 8: (clone()+0x3f) [0x7fdccc57b88f]

/a/kchai-2019-02-16_11:36:29-rados-wip-sage-testing-2019-02-16-1748-distro-basic-smithi/3601272

The short pg log in the base tier means that reqid aren't reliable propagated back to the cache tier, breaking the ordering when client ops are resent.


Related issues 2 (0 open2 closed)

Related to RADOS - Bug #24320: out of order reply and/or osd assert with set-chunks-read.yamlResolved05/26/2018

Actions
Copied to RADOS - Backport #43346: nautilus: short pg log + cache tier ceph_test_rados out of order replyResolvedNathan CutlerActions
Actions #1

Updated by Sage Weil over 5 years ago

  • Related to Bug #24320: out of order reply and/or osd assert with set-chunks-read.yaml added
Actions #2

Updated by Sage Weil over 5 years ago

/a/sage-2019-02-21_06:38:51-rados-wip-sage-testing-2019-02-20-2138-distro-basic-smithi/3620775

Actions #3

Updated by Sage Weil over 5 years ago

/a/sage-2019-02-23_23:02:18-rados-wip-sage2-testing-2019-02-23-1354-distro-basic-smithi/3631889

Actions #4

Updated by Neha Ojha over 5 years ago

This is on luminous:

/a/teuthology-2019-02-23_01:30:03-rados-luminous-distro-basic-smithi/3627561/

We recently changed the pg log limits for short_pg_log.yaml, which may be the reason why these failures are popping up more.

Actions #5

Updated by Neha Ojha about 5 years ago

/a/yuriw-2019-03-07_00:04:47-rados-wip_yuri_nautilus_3.6.19-distro-basic-smithi/3675857/

Actions #6

Updated by Sage Weil almost 5 years ago

avoiding this in the qa suite as of this pr: https://github.com/ceph/ceph/pull/28658

Actions #7

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #8

Updated by Neha Ojha over 4 years ago

  • Status changed from New to Pending Backport
  • Backport set to nautilus

Seen in nautilus: /a/yuriw-2019-12-15_16:25:11-rados-wip-yuri-nautilus-baseline_12.13.19-distro-basic-smithi/4605500/

Actions #9

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #43346: nautilus: short pg log + cache tier ceph_test_rados out of order reply added
Actions #10

Updated by Nathan Cutler over 4 years ago

  • Pull request ID set to 28658
Actions #11

Updated by Nathan Cutler over 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF