Project

General

Profile

Bug #1744

teuthology: race with daemon shutdown?

Added by Sage Weil over 12 years ago. Updated about 12 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
teuthology
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description


INFO:teuthology.task.cfuse:Unmounting ceph-fuse clients...
INFO:teuthology.task.cfuse.cfuse.0.err:ceph-fuse[24778]: fuse finished with error 0
INFO:teuthology.task.ceph:Shutting down mds daemons...
INFO:teuthology.task.ceph.mds.0:mds.0: Stopped
INFO:teuthology.task.ceph:Shutting down osd daemons...
INFO:teuthology.task.ceph:Shutting down mon daemons...
INFO:teuthology.task.ceph.mon.1:mon.1: Stopped
INFO:teuthology.task.ceph.mon.0:mon.0: Stopped
INFO:teuthology.task.ceph.mon.2:mon.2: Stopped
INFO:teuthology.task.ceph:Checking cluster ceph.log for badness...
INFO:teuthology.task.ceph:Cleaning ceph cluster...
INFO:teuthology.orchestra.run.err:rm: cannot remove `/tmp/cephtest/data/osd.0.data/current/meta': Directory not empty
INFO:teuthology.task.ceph:Removing ceph binaries...
INFO:teuthology.task.ceph:Removing shipped files: daemon-helper enable-coredump...
ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2fa96d0>
Traceback (most recent call last):
  File "/home/qa/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/home/qa/teuthology/teuthology/task/ceph.py", line 1066, in task
    yield
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/home/qa/teuthology/teuthology/contextutil.py", line 35, in nested
    if exit(*exc):
  File "/usr/lib/python2.6/contextlib.py", line 34, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/qa/teuthology/teuthology/task/ceph.py", line 718, in cluster
    wait=False,
  File "/home/qa/teuthology/teuthology/orchestra/run.py", line 272, in wait
    proc.exitstatus.get()
  File "/home/qa/teuthology/virtualenv/lib/python2.6/site-packages/gevent/event.py", line 223, in get
    raise self._exception
CommandFailedError: Command failed with status 1: 'rm -rf -- /tmp/cephtest/ceph.conf /tmp/cephtest/ceph.keyring /tmp/cephtest/data /tmp/cephtest/monmap /tmp/cephtest/asok.*'
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
INFO:teuthology.orchestra.run.out:kernel.core_pattern = core
WARNING:teuthology.task.internal:Found coredumps on ubuntu@sepia69.ceph.dreamhost.com, flagging run as failed
INFO:teuthology.task.internal:Removing archive directory...
INFO:teuthology.task.internal:Tidying up after the test...
INFO:teuthology.orchestra.run.err:rmdir: failed to remove `/tmp/cephtest': Directory not empty
ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2f08bd0>
Traceback (most recent call last):
  File "/home/qa/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.6/contextlib.py", line 23, in __exit__
    self.gen.next()
  File "/home/qa/teuthology/teuthology/task/internal.py", line 43, in base
    wait=False,
  File "/home/qa/teuthology/teuthology/orchestra/run.py", line 272, in wait
    proc.exitstatus.get()
  File "/home/qa/teuthology/virtualenv/lib/python2.6/site-packages/gevent/event.py", line 207, in get
    raise self._exception
CommandFailedError: Command failed with status 1: 'rmdir -- /tmp/cephtest'

Note that osds should be stopped, but the rm -r failed due to a racing process creating files while it was running. /tmp/cephtest is left with

ubuntu@sepia68:/tmp/cephtest$ find . -ls
4194311    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 .
4194522    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 ./data
4194530    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 ./data/osd.0.data
4194542    4 drwxr-xr-x   3 ubuntu   ubuntu       4096 Nov 20 19:13 ./data/osd.0.data/current
4194548    8 drwxr-xr-x   2 ubuntu   ubuntu       4096 Nov 20 19:13 ./data/osd.0.data/current/meta
4194528    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.2__0_DAE19372
4194527    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.6__0_DAE19632
4194532    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.0__0_DAE19212
4194526    8 -rw-r--r--   1 ubuntu   ubuntu      61668 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.4__0_103B1A2E
4194524    4 -rw-r--r--   1 ubuntu   ubuntu      60696 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.1__0_103B043E
4194531    4 -rw-r--r--   1 ubuntu   ubuntu      68904 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.0__0_103B076E
4194523    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.1__0_DAE193A2
4194536    8 -rw-r--r--   1 ubuntu   ubuntu      74844 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.7__0_103B185E
4194533    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.5__0_DAE19162
4194534    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.3__0_DAE19002
4194537    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.7__0_DAE197C2
4194535    8 -rw-r--r--   1 ubuntu   ubuntu      74628 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.2__0_103B05CE
4194549    4 -rw-r--r--   1 ubuntu   ubuntu      59184 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.6__0_103B188E
4194541    4 -rw-r--r--   1 ubuntu   ubuntu          8 Nov 20 19:13 ./data/osd.0.data/current/meta/pginfo\\u0.4__0_DAE191D2
4194551    4 -rw-r--r--   1 ubuntu   ubuntu      61020 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.3__0_103B1A9E
4194553    4 -rw-r--r--   1 ubuntu   ubuntu      65016 Nov 20 19:13 ./data/osd.0.data/current/meta/pglog\\u0.5__0_103B1BFE

Is this because 'killall ceph-osd' was run but the process didn't actually stop immediately? Or some other reason? (Why don't we see a

INFO:teuthology.task.ceph.osd.0:osd.0: Stopped

line?)


Related issues

Related to Ceph - Bug #1997: teuthology: wait for clean osd shutdown before umount Resolved 01/27/2012

Associated revisions

Revision 1364b882 (diff)
Added by Josh Durgin about 12 years ago

ceph: delay raising exceptions until all daemons are stopped

If a daemon crashes, the exception is raised when we stop it. This
caused some daemons to continue running during cleanup, since the rest
of the daemons of the same type would not be shut down. Also log each
daemon that crashed, for easier debugging.

Fixes: #1744

History

#1 Updated by Sage Weil over 12 years ago

  • translation missing: en.field_position set to 11

#2 Updated by Anonymous over 12 years ago

Was this using any one of the following?

teuthology/task/lost_unfound.py
teuthology/task/mon_recovery.py
teuthology/task/thrashosds.py

Can I see the full config and log?

#3 Updated by Sage Weil over 12 years ago

Tommi Virtanen wrote:

Was this using any one of the following?

teuthology/task/lost_unfound.py
teuthology/task/mon_recovery.py
teuthology/task/thrashosds.py

Can I see the full config and log?

sadly no, it was in my terminal. 90% sure it wasn't any of the above, though! yaml below:

interactive-on-error: true
roles:
- - mon.0
  - mds.0
  - osd.0
- - mon.1
  - osd.1
- - mon.2
  - client.0
tasks:
- ceph:
    conf:
      client.0:
        debug ms: 1
        debug client: 20
- cfuse: null
- workunit:
    all:
    - suites/dbench.sh

targets:
  ubuntu@sepia68.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC4tEUrS50JgpGleCROSvvJmvFwbQScDxZwo5qqY9G/8+Gw92v7PMeBjWXA7Q+LTIq+pnzNKQxUeYTSpxYHDMAr1ZqaC9rfTvDMS6DQgMn+O0O23+mXDok8a3NW/S4id2bZQ3lORol4xRzkLqF7HF/YYzNlRv/RxwZifQE9iiIEt0vio17Y5Nw6c1wJBWbTjhI+dM4tHfIG9dDdOL/HmTveF1TqIRIiIJM+rtYeqj1P7eeyEU/0kwhwFgUU1JUCnv2Szf8ZtF4MWFewVl5G1sbowDMXk0NJfuOf4saFJ9gF2SdvUxXbuMUgQ1Edjgtek2HCOnBWTTw3t5+zn51S08wt
  ubuntu@sepia69.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9bW2x5nGRrB0hNpG2Cd5/94b3uheKlfjax0aF3uHFm93wLmpB6U2EcrAEATMAjV0lauR6CPppitoxgnUZX+8KyJ8yDu7TOYtUZvAyxDO09245FuxFRytGI9tyFOgfvYN7fSg2URpkLT1wWMVQ6NgzAV+oS/jsrKgAW66wrmLv61MMghKN6u7hLQnTF6iSod7RMMyvurCdxKu1gU37GEdp8wJH6y8f173MkfDOEVoojafUPycRB2X3dFn63raGYkhkXofwZB1CSEJEVU3Rnz/b+Hkc374Ij+vPd97Q7fIRaI4YE0ovEDDcG0WU++9VN7vI/wW0pc7DAiPun9SVpqM1
  ubuntu@sepia80.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCts9+jqfNcV/Be7XXh87aCtjWMf2tzSv4FFH1QuQ9QJnpwTQ9Nwk5dBnZKYGOPO0WYlY//3Be/IoCnCubH7+rCjsfluZ0o+wJVdKLazJG3KeiLP1LgJPB8J+emv4cCshTJbuZoFgeNzADdwF8C239ImnPzJqQdh2Av7aomnv8TcZFYwxjyMSBI/F4VQX4Ml8cWq2U1jnk6ft7cs1lZSe+74/glVjWpxBDcgMDFNkGUlLyPdWXxYUMdWHkayTvu8j4/AV8HQaFkbmLks/FTCjpYawCuETflkGl/WbTl00hARGHrPPJpDHmHIf6rEuwMMZ2Eh2Ggui3s9jhmRqlAyGVH

#4 Updated by Anonymous over 12 years ago

Josh saw similar, it seems the ctx.daemons data structure loses entries / they never get added / something. So far, reading the source & logs has just led to paradoxes.

#5 Updated by Sage Weil over 12 years ago

  • Target version changed from v0.39 to v0.40

#6 Updated by Sage Weil over 12 years ago

any additional teuthology logging we can add to sort out what is happening?

#7 Updated by Sage Weil about 12 years ago

  • Target version deleted (v0.40)
  • translation missing: en.field_position deleted (70)
  • translation missing: en.field_position set to 98

#8 Updated by Josh Durgin about 12 years ago

  • Status changed from New to Resolved
  • Assignee set to Josh Durgin

This turned out to be uncaught exceptions that weren't logged until later when daemons crashed. Fixed by 1364b8826f3f654fbbdddc8ba53dcec8ee038cf8.

Also available in: Atom PDF