Project

General

Profile

Actions

Bug #3370

closed

All nfsd hung trying to lock page(s) on export of kclient ceph

Added by David Zafman over 11 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Workunit bonnie hung over NFS client with retransmitted NFS read:

ubuntu 2667 2572 0 Oct18 ? 00:00:00 bash c mkdir - /tmp/cephtest/mnt.1/client.1/tmp && cd -- /tmp/cephtest/mnt.1/cli
ubuntu 2669 2667 0 Oct18 ? 00:00:00 /bin/bash /tmp/cephtest/workunit.client.1/suites/bonnie.sh
ubuntu 2672 2669 0 Oct18 ? 00:01:09 /usr/sbin/bonnie++ -n 100

In the syslog the kernel noticed nfsd not making progress:

INFO: task nfsd:1181 blocked for more than 120 seconds.

All 8 nfsd processes look like this
[<ffffffff8112a20e>] sleep_on_page+0xe/0x20
[<ffffffff8112a1f7>] __lock_page+0x67/0x70
[<ffffffff811aaa2f>] __generic_file_splice_read+0x59f/0x5d0
[<ffffffff811aaa9e>] generic_file_splice_read+0x3e/0x80
[<ffffffff811a921b>] do_splice_to+0x7b/0xa0
[<ffffffff811a94d7>] splice_direct_to_actor+0xa7/0x1c0
[<ffffffffa036b762>] nfsd_vfs_read.isra.13+0x112/0x160 [nfsd]
[<ffffffffa036dc98>] nfsd_read_file+0x88/0xb0 [nfsd]
[<ffffffffa037c7a2>] nfsd4_encode_read+0x132/0x1f0 [nfsd]
[<ffffffffa03815dd>] nfsd4_encode_operation+0x5d/0xa0 [nfsd]
[<ffffffffa037851a>] nfsd4_proc_compound+0x25a/0x630 [nfsd]
[<ffffffffa0367b4e>] nfsd_dispatch+0xbe/0x1c0 [nfsd]
[<ffffffffa025ab19>] svc_process+0x489/0x7a0 [sunrpc]
[<ffffffffa036718d>] nfsd+0xbd/0x1a0 [nfsd]
[<ffffffff810791fe>] kthread+0xae/0xc0
[<ffffffff8163f3c4>] kernel_thread_helper+0x4/0x10
[<ffffffffffffffff>] 0xffffffffffffffff

A direct read attempt through the ceph client:

dd if=/tmp/cephtest/mnt.0/client.1/tmp/Bonnie.2672 of=/dev/null

Hung here
[<ffffffff8112a22e>] sleep_on_page_killable+0xe/0x40
[<ffffffff8112a187>] __lock_page_killable+0x67/0x70
[<ffffffff8112c63e>] generic_file_aio_read+0x48e/0x730
[<ffffffffa03f1d54>] ceph_aio_read+0x654/0x880 [ceph]
[<ffffffff8117b703>] do_sync_read+0xa3/0xe0
[<ffffffff8117c060>] vfs_read+0xb0/0x180
[<ffffffff8117c17a>] sys_read+0x4a/0x90
[<ffffffff8163e1e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

I'm categorizing as ceph client issue, it is likely an interaction with kernel nfs server.

Actions #1

Updated by David Zafman over 11 years ago

  • Description updated (diff)

I verified that PG_locked was set in struct page flags field. I suspected that ceph_readpages() was leaving pages locked, so I ran my test case with that function disabled. That function is not called in a ceph kernel client read, but is part of readahead that ends up in the code path that the kernel NFS server uses to read files.

My Bonnie run with that function disabled was able to get past the I/O portion of the test without hanging. During some earlier testing I didn't see the function finish_read() getting called at all. I presume that's where the unlock_page() from the complete I/O is supposed to occur.

Actions #2

Updated by Sage Weil over 11 years ago

It might be that leaving the pages locked for the duration of the read is the wrong thing. My recollection is vague, but I think we've switched this behavior around a few different times. In 7c272194e66e91830b90f6202e61c69f8590f1eb we switched from a blocking implementation (which sucked for obvious reasons, but left the pages locked for the duration of the read) to an async one, which still left them locked. I suggest checking other file systems to see what their readpages behavior is...

Actions #3

Updated by David Zafman over 11 years ago

  • Status changed from New to Fix Under Review
Actions #4

Updated by David Zafman over 11 years ago

  • Status changed from Fix Under Review to Resolved

commit: 2978257c56935878f8a756c6cb169b569e99bb91

Actions #5

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (26)
Actions #6

Updated by zhou wei over 6 years ago

David Zafman wrote:

commit: 2978257c56935878f8a756c6cb169b569e99bb91

I can't find this commit? can some body give me a refer?

Actions

Also available in: Atom PDF