Bug #213: non-idempotent transactions (clone) under ext3 may not replay correct result - Ceph - Ceph

Custom queries

Backports: mimic
Backports: missing release
Backports: nautilus
Bluestore
Bug queue
Bug queue - no subprojects
Bug triage
Ceph backlog
Crash queue
Crash triage
Feature Requests
Feedback
My issues
Need Review
Pending backports
Priority queue
Product Backlog Scrub
Project Triage
Test Failures

Actions

Copy link

Bug #213

closed

non-idempotent transactions (clone) under ext3 may not replay correct result

Added by Sage Weil almost 14 years ago. Updated over 12 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Sage Weil

Category:

OSD

Target version:

v0.39

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The writeahead journaling will restore the store to a known state regardless of which operations have committed, but only if the transactions are idempotent. i.e. can be repeated and still end up at a known result. This is true of all the normal operations (remove, write, truncate, setxattr, etc., but not for clone and clone_range. E.g.,

clone foo_head -> foo_2
 write foo_head

will put the old head content in _2, but if it is replayed _2 will contain the new _head content. Meh!

How to fix? We could mark which transaction are non-idempotent, and sync the store before applying them.. that's expensive, but possibly the price you pay for not using btrfs! :)

Related issues 1 (0 open — 1 closed)

Related to Ceph - Bug #2098: xfs/ext4 non-idempotent transaction

Resolved

Sage Weil

02/23/2012

Actions

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Sage Weil almost 14 years ago

Target version changed from v0.21 to v0.22

Actions

Copy link

Updated by Sage Weil over 13 years ago

Target version changed from v0.22 to v0.23

Actions

Copy link

Updated by Sage Weil over 13 years ago

Target version changed from v0.23 to v0.24

Actions

Copy link

Updated by Sage Weil over 13 years ago

Target version deleted (~~v0.24~~)

Actions

Copy link

Updated by Sage Weil over 12 years ago

Priority changed from High to Normal

Actions

Copy link

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position deleted (~~614~~)
Translation missing: en.field_position set to 1

Actions

Copy link

Updated by Anonymous over 12 years ago

Isn't the idempotency in that case "clone foo_head -> foo_2 IFF foo_2 does not exist" ?

Actions

Copy link

Updated by Sage Weil over 12 years ago

Tommi Virtanen wrote:

Isn't the idempotency in that case "clone foo_head -> foo_2 IFF foo_2 does not exist" ?

That's almost enough for clone() (if we add O_EXCL and whitelist EEXIST for non-btrfs). It wouldn't catch something like

1 clone A->B
 2 modify A
   ...
 3 delete B
 4 &lt;crash&gt;
   &lt;replay from 1&gt;

That trick also wouldn't work for clone_range(), which doesn't create a file.

It may be that we need to make transactions idempotent at a higher level, but it'd be dependent on what you clone to, and whether it is ever modified/removed... it'd be dependent on the particular, though, and hard to analyze/verify.

Actions

Copy link

Updated by Sage Weil over 12 years ago

FWIW even if we know what not to replay, we could still be screwed with ext4 (which does not commit everything in order):

clone A->B
 modify A
 &lt;fs commits A (before B)&gt;
 &lt;crash&gt;

On replay, we don't actually have the old A to clone to B. :(

Actions

Copy link

#10

Updated by Sage Weil over 12 years ago

I think the simplest solution would be:

- for all operations, set an xattr with the last op_seq to write to that file.
 - for any operation that is potentially non-idempotent, fsync(2) after doing it.
 - on replay, verify the xattr isn't == or newer to avoid re-doing the operation.

Those operations would be:

- create collection
 - clone
 - clone range

We need to set the attr on all operations to avoid something like

1 truncate B
 2 clone A->B
 3 modify A
 &lt;crash&gt;
 &lt;replay 1&gt;
 &lt;skip 2 due to xattr&gt;
 &lt;replay 3&gt;

Actions

Copy link

#11

Updated by Sage Weil over 12 years ago

Translation missing: en.field_story_points deleted (0)
Translation missing: en.field_position deleted (6)
Translation missing: en.field_position set to 6

Actions

Copy link

#12

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position deleted (9)
Translation missing: en.field_position set to 7

Actions

Copy link

#13

Updated by Sage Weil over 12 years ago

Translation missing: en.field_story_points set to 5
Translation missing: en.field_position deleted (7)
Translation missing: en.field_position set to 7

Actions

Copy link

#14

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position deleted (7)
Translation missing: en.field_position set to 4

Actions

Copy link

#15

Updated by Sage Weil over 12 years ago

Target version set to v0.39
Translation missing: en.field_position deleted (1)
Translation missing: en.field_position set to 971

Actions

Copy link

#16

Updated by Sage Weil over 12 years ago

Update: the current first pass plan is to initiate a FileStore sync after any non-idempotent operation. This updates commit_op_seq on disk and ensures that it won't be replayed.

It's also heavyweight as it calls sync(2). So it's a big hammer, but at least it's correct.

We can set a non-idempotent bool in the do_transaction method on CLONE or anything similar and then do the commit at the end (before any other operations occur under the
current OpSequencer).

Actions

Copy link

#17

Updated by Sage Weil over 12 years ago

Status changed from New to 7
Assignee set to Sage Weil

Actions

Copy link

#18

Updated by Sage Weil over 12 years ago

Status changed from 7 to Resolved

dae6c956543276e103a272eb1e897db17b840348

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #213

non-idempotent transactions (clone) under ext3 may not replay correct result

Updated by Sage Weil almost 14 years ago

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 13 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Anonymous over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago

Updated by Sage Weil over 12 years ago