Project

General

Profile

Bug #2174

rbd: iozone thrashing failure

Added by Sage Weil about 12 years ago. Updated almost 12 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
rbd
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

consistently failing

- ceph:
log-whitelist:
- wrongly marked me down or wrong addr
- objects unfound and apparently lost
- thrashosds: null
- rbd:
all:
image_size: 20480
- workunit:
all:
- suites/iozone.sh

with

SSHException: SSH session not active

need to check console...

ubuntu@teuthology:/a/nightly_coverage_2012-03-14-a/1208

plana54.crash.jpg View (108 KB) Alex Elder, 03/29/2012 07:59 AM

History

#1 Updated by Alex Elder about 12 years ago

I tried reproducing the problem, and although I'm not sure I know
how to recognize it my test did end in failure.

While the test was ending (I think cleaning up, after I exited
the "interactive" directive at the end of my yaml file) I logged
into the three systems I had. I ran "dmesg" on each, and while
the osd and mon systems were "clean", the client showed this:

[ 50.094605] eth0: no IPv6 routers present
[ 263.339473] libceph: loaded (mon/osd proto 15/24, osdmap 5/6 5/6)
[ 263.364083] rbd: loaded rbd (rados block device)
[ 264.834765] libceph: client0 fsid 18b3462e-4bfc-4397-9587-7696c6f120d8
[ 265.117136] libceph: mon1 10.214.131.7:6789 session established
[ 265.117432] INFO: trying to register non-static key.
[ 265.149933] the code is fine but needs lockdep annotation.
[ 265.182486] turning off the locking correctness validator.
[ 265.182489] Pid: 1435, comm: rbd Not tainted 3.2.0-ceph-00182-g147ad9e #1
[ 265.182492] Call Trace:
[ 265.182500] [<ffffffff8109cebf>] register_lock_class+0x24f/0x4b0
[ 265.182505] [<ffffffff8109f5b9>] __lock_acquire+0x99/0x15d0
[ 265.182511] [<ffffffff81123b28>] ? __alloc_pages_nodemask+0x128/0x7d0
[ 265.182515] [<ffffffff8109e94f>] ? check_irq_usage+0x9f/0xf0
[ 265.182521] [<ffffffffa016edfe>] ? rbd_do_request+0xae/0x3a0 [rbd]
[ 265.182524] [<ffffffff810a10e2>] lock_acquire+0xa2/0x120
[ 265.182529] [<ffffffffa016edfe>] ? rbd_do_request+0xae/0x3a0 [rbd]
[ 265.182532] [<ffffffff8109ffac>] ? __lock_acquire+0xa8c/0x15d0
[ 265.182537] [<ffffffff816048a9>] down_read+0x39/0x50
[ 265.182541] [<ffffffffa016edfe>] ? rbd_do_request+0xae/0x3a0 [rbd]
[ 265.182546] [<ffffffffa016edfe>] rbd_do_request+0xae/0x3a0 [rbd]
[ 265.182551] [<ffffffffa016f901>] T.966+0xe1/0x1b0 [rbd]
[ 265.182556] [<ffffffffa016fbc1>] rbd_read_header+0x91/0x320 [rbd]
[ 265.182561] [<ffffffffa01710c0>] rbd_add+0x6c0/0xc28 [rbd]
[ 265.182566] [<ffffffff813e0447>] bus_attr_store+0x27/0x30
[ 265.182573] [<ffffffff811eb5d6>] sysfs_write_file+0xe6/0x170
[ 265.182579] [<ffffffff811763f8>] vfs_write+0xc8/0x190
[ 265.182582] [<ffffffff811765b1>] sys_write+0x51/0x90
[ 265.182587] [<ffffffff8160e182>] system_call_fastpath+0x16/0x1b
[ 265.186440] rbd1: unknown partition table
[ 265.760748] rbd: rbd1: added with size 0x500000000
[ 301.976182] kjournald starting. Commit interval 5 seconds
[ 302.065620] EXT3-fs (rbd1): using internal journal
[ 302.095064] EXT3-fs (rbd1): mounted filesystem with ordered data mode
ubuntu@plana34:~$

So it looks like there was a lock being acquired, and that the
lock was not properly initialized.

This was using the path where a sync read was being called to read
the header. I will review the logic in rbd_do_request() and its
ancestors tomorrow morning; I may have broken something when I
rearranged rbd_read_header().

#2 Updated by Alex Elder about 12 years ago

I think I can explain this:

[ 265.117432] INFO: trying to register non-static key.
[ 265.149933] the code is fine but needs lockdep annotation.
[ 265.182486] turning off the locking correctness validator.

And although I can't be 100% certain that it also explains the RCU
error that got logged, I lay out a scenario below that I think
could.

The lock in question is the rw_semaphore at rbd_dev->header.snap_rwsem.

In the kernel client RBD code, the rbd_dev and therefore its header
sub-structure are dynamically allocated in rbd_add(), and are
initially zero-filled. The header structure is otherwise not
initialized in this function. It is not until rbd_init_disk() is
called, which immediately calls rbd_read_header() that the header
gets operated on.

rbd_read_header() reads in the on-disk header data, then calls
rbd_header_from_disk() to translate that on-disk data into the
in-core format in the header structure. It is not until here, in
rbd_header_from_disk(), that the header.rwsem field gets
initialized.

The problem is that we have this call chain, before we call
rbd_header_from_disk():
rbd_req_sync_read()
rbd_req_sync_op()
rbd_do_request()
down_read(&header->snap_rwsem);

In other words, in order to read in the header block in order
to initialize the header structure, the semaphore is needed
and therefore it needs to have already been initialized.

So that's why the warning message from lockdep showed up.
Now, could this cause the hang?

The semaphore ought to work properly when it is initialized to
zero. So the first time down_read() is used it should properly
acquire the lock and block out writers.

However, once the rbd_req_sync_read() to get the on-disk
header data completes, we call rbd_header_from_disk(), which
will initialize the snap_rwsem field.

The problem arises if there's a concurrent caller of
rbd_read_header(). If so I think the following scenario
can occur, and could lead to the hang (reported as an
RCU waiting too long) that we observe:

Time Process 1 Process 2
---- --------- ---------
T1 rbd_read_header()
T2 rbd_req_sync_read()
T3 rbd_read_header()
T4 rbd_req_sync_read()
T5 down_read(snap_rwsem)
T6 ... (osd request) ...
T7 up_read(snap_rwsem)
T8 ... down_read(snap_rwsem)
T9 rbd_header_from_disk() ... (osd request) ...
T10 init_rwsem(snap_rwsem)
--> WHOOPS <--
T11 up_read(snap_rwsem)
T12 ...
T13 rbd_header_from_disk()
T14 init_rwsem(snap_rwsem)

The RW semaphore will block readers if there are any writers, will
keep count allowed readers, and block writers until no other writer
or readers hold the lock. The implementation is roughly that a
positive count is the number of readers holding the lock, while a
negative count means there is writer, and 0 means no lock is held.

So let's assume the semaphore is initially unlocked, so it has a
count of 0. At time T5, Process 1 acquires the semaphore and the
reader count becomes 1, then at time T7 the count drops back down to
0 again. At time T8, Process 2 acquires the semaphore so the reader
count becomes 1. At time T10, process 1 initializes the semaphore,
resetting its reader count to 0--which is a problem. At time T11,
Process 2 releases its read lock, decrementing the count--to -1.
Thus, the lock looks like it's held by a writer, but there is no
such writer.

I think this scenario can explain the hang, and even if it can't it
needs to be fixed. Thie fix is to move the semaphore initialization
earlier, more appropriately when the basic header structure gets
allocated and not when the read-in on-disk header data gets translated.

I am fairly sure my recent changes did not make this problem appear.
I think it existed before, but was not exercised. One explanation
is that my code changes made the problem more likely, but I think
a more likely one is that the new plana hardware is able to provide
the kind of concurrency that can lead to it, while the older sepia
systems would not.

#3 Updated by Alex Elder about 12 years ago

Sorry, but the formatting of that last post is important and it got stripped out.
I'm going to try this again.

I think I can explain this:

  [ 265.117432] INFO: trying to register non-static key.
  [ 265.149933] the code is fine but needs lockdep annotation.
  [ 265.182486] turning off the locking correctness validator.

And although I can't be 100% certain that it also explains the RCU
error that got logged, I lay out a scenario below that I think
could.

The lock in question is the rw_semaphore at rbd_dev->header.snap_rwsem.

In the kernel client RBD code, the rbd_dev and therefore its header
sub-structure are dynamically allocated in rbd_add(), and are
initially zero-filled.  The header structure is otherwise not
initialized in this function.  It is not until rbd_init_disk() is
called, which immediately calls rbd_read_header() that the header
gets operated on.

rbd_read_header() reads in the on-disk header data, then calls
rbd_header_from_disk() to translate that on-disk data into the
in-core format in the header structure.  It is not until here, in
rbd_header_from_disk(), that the header.rwsem field gets
initialized.

The problem is that we have this call chain, *before* we call
rbd_header_from_disk():
  rbd_req_sync_read()
    rbd_req_sync_op()
      rbd_do_request()
        down_read(&header->snap_rwsem);

In other words, in order to read in the header block in order
to initialize the header structure, the semaphore is needed
and therefore it needs to have already been initialized.

So that's why the warning message from lockdep showed up.
Now, could this cause the hang?

The semaphore ought to work properly when it is initialized to
zero.  So the first time down_read() is used it should properly
acquire the lock and block out writers.

However, once the rbd_req_sync_read() to get the on-disk
header data completes, we call rbd_header_from_disk(), which
will initialize the snap_rwsem field.

The problem arises if there's a concurrent caller of
rbd_read_header().  If so I think the following scenario
can occur, and could lead to the hang (reported as an
RCU waiting too long) that we observe:

Time        Process 1                       Process 2
----        ---------                       ---------
 T1     rbd_read_header()
 T2       rbd_req_sync_read()
 T3                                     rbd_read_header()
 T4                                       rbd_req_sync_read()
 T5         down_read(snap_rwsem)
 T6         ... (osd request) ...
 T7         up_read(snap_rwsem)
 T8             ...                         down_read(snap_rwsem)
 T9       rbd_header_from_disk()            ... (osd request) ...
T10         init_rwsem(snap_rwsem)
                        --> WHOOPS <--
T11                                         up_read(snap_rwsem)
T12                                         ...
T13                                       rbd_header_from_disk()
T14                                         init_rwsem(snap_rwsem)

The RW semaphore will block readers if there are any writers, will
keep count allowed readers, and block writers until no other writer
or readers hold the lock.  The implementation is roughly that a
positive count is the number of readers holding the lock, while a
negative count means there is writer, and 0 means no lock is held.

So let's assume the semaphore is initially unlocked, so it has a
count of 0.  At time T5, Process 1 acquires the semaphore and the
reader count becomes 1, then at time T7 the count drops back down to
0 again.  At time T8, Process 2 acquires the semaphore so the reader
count becomes 1.  At time T10, process 1 initializes the semaphore,
resetting its reader count to 0--which is a problem.  At time T11,
Process 2 releases its read lock, decrementing the count--to -1.
Thus, the lock looks like it's held by a writer, but there is no
such writer.

I think this scenario can explain the hang, and even if it can't it
needs to be fixed.  Thie fix is to move the semaphore initialization
earlier, more appropriately when the basic header structure gets
allocated and not when the read-in on-disk header data gets translated.

I am fairly sure my recent changes did not make this problem appear.
I think it existed before, but was not exercised.  One explanation
is that my code changes made the problem more likely, but I think
a more likely one is that the new plana hardware is able to provide
the kind of concurrency that can lead to it, while the older sepia
systems would not.

#4 Updated by Josh Durgin about 12 years ago

That analysis of the race looks correct to me. The first unapplied patch in wip-rbd would have fixed this (9a3e22a0ced9d74f97efb714eaa2d1e6bf132389), although now it would need rebasing.

#5 Updated by Alex Elder about 12 years ago

That's excellent Josh. I'll use it, it's basically what I was
thinking of doing anyway, now I'll just use yours. Did you
anticipate this race, or did you just recognize that lock in
the header was misplaced?

And I'll put more of a priority on integrating your patches from
the wip-rbd branch... (Actually I'm planned for switching over
to working on RBD again shortly and that was what I was first
going to do anyway.)

#6 Updated by Josh Durgin about 12 years ago

Thanks Alex. I remember thinking it fixed a race initially, but then going back later and being unable to find the race it fixed, so that's why it wasn't applied earlier.

#7 Updated by Alex Elder about 12 years ago

I have further verified that this problem occurs with the current
for-linus branch: 
    d23a4b3fd6ef70b80411b3  rbd: fix safety of rbd_put_client() 

There was obviously some recent rbd work there, but I am fairly
convinced that Josh's fix will get rid of this problem.  About
to test that...

#8 Updated by Alex Elder about 12 years ago

The test that reproduced the problem has now run once to completion
without hitting it. Therefore it's ready to ship.

I kheeeed!

I'll keep testing for a bit but I'm pretty convinced. I'll give
it another couple of runs and if all ends well I'll plan to commit
Josh's change to the testing branch, and from there to the master
branch.

#9 Updated by Alex Elder about 12 years ago

I've been off on other things, but this problem apparently recurred
even if the latest check-in (Josh's change) in place.

I am going to kick off a run to try to reproduce it again today, after
I get my locked up machines sorted out.

#10 Updated by Alex Elder almost 12 years ago

I have been trying to reproduce this using the latest testing/master/for-linus
branch (they're the same right now) and have not seen it happen after 20+
iterations of the suites/iozone.sh workunit. I had previously hit an
unrelated problem at the end of the thrashosds task, so I commented that out.

I will keep testing though.

#11 Updated by Alex Elder almost 12 years ago

I don't know whether we've adequately captured the signature or symptoms
of this problem. I believe though that it is indicated by a message
such as this involved in a system crash:

INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 4, t=492980 jiffies)

And this occurs somewhere under a cpu_idle() call.

Based on that assumption I've attached a jpg of the console screen
following a system crash that was running the kernel_untar_build.sh
workunit (not suites/iozone.sh), which has this signature.

This may be the same problem as originally reported here, or it could be
something different. In any case, below is a synopsis of the YAML file
(the original in fact repeated the kernel_untar_build.sh workunit many
times).

roles:
- [mon.a, mon.c, osd.0]
- [mon.b, mds.a, osd.1]
- [client.0]
kernel:
mon:
branch: master
client:
branch: master
overrides:
ceph:
conf:
osd:
osd op complaint time: 120
coverage: true
fs: btrfs
log-whitelist:
- clocks not synchronized
- old request
tasks:
- ceph:
- rbd:
all:
- workunit:
all:
- kernel_untar_build.sh

#12 Updated by Alex Elder almost 12 years ago

Another 12 iterations of suites/iozone.sh using the current
master branch (c666601a935b94cc0f3310339411b6940de751ba) without
error. This was something like 12 hours of continuously running
iozone.sh. I'm going to quit testing this now and wait to see if
our regular automated testing produces it again.

#13 Updated by Alex Elder almost 12 years ago

Status was Verified. Changing it to Need More Info because I can't even
seem to reproduce it at this point. (I suppose I could use Can't Reproduce.)

I just want to mark this as something I don't expect to be actively
paying attention to, but I'm ready to look again should it recur.

#14 Updated by Alex Elder almost 12 years ago

  • Status changed from 12 to Can't reproduce

OK, I'll go ahead and state that I can't reproduce this...

Also available in: Atom PDF