Project

General

Profile

Actions

Bug #2573

closed

libceph: many "socket closed" messages

Added by Alex Elder almost 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

While trying to reproduce a null pointer problem in the client
messenger code I was running xfstests #049 over RBD devices.

In the process, I'd get easily 10-15 messages like this on the
console of the RBD client for each iteration of test 049:

[  884.475733] libceph: osd1 10.214.131.33:6800 socket closed

I think this frequent socket closing was in fact what led to
the conditions under which the other problems I was chasing
could occur.

But stepping back from that, I don't believe there's any reason
those sockets should be closing, certainly not at that frequency.
These channels use TCP connections, so there should be no data
loss. I think the point of these ceph connections is to have
the connection survive loss of the TCP connection, but that
by no means suggests we should be suffering TCP connection losses
frequently. Only some sort of hardware event should lead to that,
it seems to me.

I believe these messages indicate that the other end of the socket
connection is closing the socket. And it's possible that the
server side of these connections is doing this for a reason. If
that's the case I question a bit of that design. And if it's not
the case then we should find out what's going on.

Actions #1

Updated by Greg Farnum almost 12 years ago

The sockets have a default timeout of 15 minutes, after which they will close — the idea being that if the socket is actually in use it won't close, but this shuts it down after a reasonable period of time to prevent unending resource use buildups.

If the test doesn't take that long, then there probably is a problem that we want to fix.

Actions #2

Updated by Alex Elder almost 12 years ago

The test takes on the order of a minute to complete one pass
of test 049. During that time I typically see 10-20 socket
closure messages.

Actions #3

Updated by Greg Farnum almost 12 years ago

In that case, if you want to run this with the osd messenger debug at 5 and can gather logs next time I'll be happy to take a look and see why it's closing connections so often.

Actions #4

Updated by Sage Weil over 11 years ago

  • Status changed from New to Resolved

I was seeing this too, but with the latest code and all (knock wood) the races closed I'm not anymore. Going to optimistically call it resolved.

Actions #5

Updated by Oliver Francke over 11 years ago

Hi Alex,

looking for annoying messages I got aware of this ticket... we are currently running:
Linux fcmsnode1 3.6.0 #1 SMP Tue Oct 2 09:54:37 CEST 2012 x86_64 GNU/Linux
+
ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)

ceph-osd-13.log shows repeatedly:

--- 8-< ---
2012-10-11 10:51:25.878280 7f25c8c83700 0 -- 10.10.10.22:6812/1604 >> 10.10.10.22:0/715655332 pipe(0x6a21800 sd=76 pgs=0 cs=0 l=0).accept peer addr is really 10.10.10.22:0/715655332 (socket is 10.10.10.22:48610/0)
2012-10-11 10:51:25.879084 7f25d08dc700 0 osd.13 1353 pg[6.5( v 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] watch: ctx->obc=0x6381000 cookie=1 oi.version=2301953 ctx->at_version=1353'2567563
2012-10-11 10:51:25.879133 7f25d08dc700 0 osd.13 1353 pg[6.5( v 1353'2567562 (1353'2566561,1353'2567562] n=1857 ec=390 les/c 1347/1349 1340/1347/1333) [13,33] r=0 lpr=1347 mlcod 1353'2567561 active+clean] watch: oi.user_version=2301951


Would be cool, if you don't mind to come back, even if it leads to a debug-level raising in a productive environment.

Thnx in @vance,

Oliver.

Actions

Also available in: Atom PDF