Project

General

Profile

Bug #2022

osd: misdirectect request

Added by Sage Weil about 12 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
argonaut
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

from rados_api_tests.yaml:
[WRN] client.4292 10.3.14.128:0/3016298 misdirected client.4292.0:4 0.0 to osd.1 not [0,1] in e261/261" in cluster log

  roles:
  - - mon.a
    - mon.c
    - osd.0
  - - mon.b
    - mds.a
    - osd.1
  - - client.0
  tasks:
  - chef: null
  - ceph: null
  - cfuse: null
  - workunit:
      client.0:
      - rados/test.sh

/a/nightly_coverage_2012-02-03-b/10553

Associated revisions

Revision ff67210e (diff)
Added by Sage Weil almost 12 years ago

objecter: do not feed session to op_submit()

The linger_send() method was doing this, but it is problematic because the
new Op doesn't get its pgid or acting vector set correctly. The result is
that the request goes to the right OSD, but has the wrong pgid, and makes
the OSD complain about misdirected requests and drop it on the floor. It
didn't affect the test results because we weren't testing whether the
watch was working in that case.

Instead, we'll just recalculate and get the same value the parent linger
op did. Which is fine, and goes through all the usual code paths so
nothing is missed.

Also, increment num_homeless_ops before we recalc_op_target(), so that we
don't (harmlessly, but confusingly) underflow.

Fixes: #2022
Signed-off-by: Sage Weil <>

Revision ef6beec9 (diff)
Added by Sage Weil almost 12 years ago

objecter: do not feed session to op_submit()

The linger_send() method was doing this, but it is problematic because the
new Op doesn't get its pgid or acting vector set correctly. The result is
that the request goes to the right OSD, but has the wrong pgid, and makes
the OSD complain about misdirected requests and drop it on the floor. It
didn't affect the test results because we weren't testing whether the
watch was working in that case.

Instead, we'll just recalculate and get the same value the parent linger
op did. Which is fine, and goes through all the usual code paths so
nothing is missed.

Also, increment num_homeless_ops before we recalc_op_target(), so that we
don't (harmlessly, but confusingly) underflow.

Fixes: #2022
Signed-off-by: Sage Weil <>

Revision 38962abd (diff)
Added by Sage Weil over 11 years ago

osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <>

Revision b7814dbe (diff)
Added by Sage Weil over 11 years ago

osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <>

Revision 41a57077 (diff)
Added by Sage Weil over 11 years ago

osd: based misdirected op role calc on acting set

We want to look at the acting set here, nothing else. This was causing us
to erroneously queue ops for later (wasting memory) and to erroneously
print out a 'misdrected op' message in the cluster log (confusion and
incorrect [but ignored] -ENXIO reply).

Fixes: #2022
Signed-off-by: Sage Weil <>

History

#1 Updated by Sage Weil about 12 years ago

saw this again on rados_api_tests:


ubuntu@teuthology:/a/nightly_coverage_2012-02-12-b/11713$ cat summary.yaml 
ceph-sha1: bd1a956757d57f6cb693d8660d1ce48018563df1
client.0-kernel-sha1: eda84b58922928516e6e62af85430b7c9705b6cf
description: collection:basic btrfs:with-btrfs.yaml clusters:fixed-3.yaml tasks:rados_api_tests.yaml
duration: 434.6066517829895
failure_reason: '"2012-02-12 13:20:30.674204 osd.1 10.3.14.131:6800/6103 45 : [WRN]
  client.4123 10.3.14.132:0/3007642 misdirected client.4123.0:4 0.0 to osd.1 not [0,1]
  in e258/258" in cluster log'
flavor: gcov
mon.a-kernel-sha1: eda84b58922928516e6e62af85430b7c9705b6cf
mon.b-kernel-sha1: eda84b58922928516e6e62af85430b7c9705b6cf
owner: scheduled_teuthology@teuthology
success: false

#2 Updated by Sage Weil about 12 years ago

weird, saw this twice a few days (maybe 18 runs apart), but wasn't able to reproduce after several hundred iterations.
will try with all logs off.l

#3 Updated by Sage Weil about 12 years ago

hit this again:

ubuntu@teuthology:/a/nightly_coverage_2012-02-22-a/13282$ grep WRN ceph.log 
...
2012-02-22 00:45:47.084012 osd.1 10.3.14.166:6800/25258 55 : [WRN] client.4205 10.3.14.182:0/3010848 misdirected client.4205.0:4 0.0 to osd.1 not [0,1] in e264/264

kernel: &id001
  sha1: cc050a5dab7beb893d72dbd41e8aa5c07f1d0427
nuke-on-error: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    coverage: true
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - old request
    sha1: 761ecc69c24856b15531c92b69b1c73c5cc81bfc
roles:
- - mon.a
  - mon.c
  - osd.0
- - mon.b
  - mds.a
  - osd.1
- - client.0
targets:
  ubuntu@sepia39.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDjS/F+14GNG9FDE+IEUxOHppkQkrheg5+XeJy06es1rrEgHIrNVI4TTDsprfP1oT3AJm9A+IaQ5nrZ9DK8M4407+pao+MBBLZ8CTcNrFIKGhwljWwTS0Bi0y8WYfmhVdUOUBEXQPeyGFTFCetnHGxJ5U57v340bQeTqrULioFQDQgeungqhBwoCNWbs7/u7waKNHoYPlPqswSWYJ3JRMjHO9EaFkCip6M0jfFgsdRpCfZ4duzOR/V6+H/7o/o+uXuBl2kT/BxbEg031s73yty2OnfjpsX8np6kUwYRKDU9gOy+C8yB1RHccqdeYJGjxBeSrefhOJkpt79s7/pj9QBt
  ubuntu@sepia55.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDZ0HauoMdA5zgEmINOhrUYdVO1V3PDt4dp/dfm+AwB3iunb6Y/BXifGRDznlrqgIZ8+Sxl7U6lPsMsj0+MOysoWA1GPmpQqwQlDLF3SMOKvRXnGALB2x2spotMgQstoSYQWUAWK+6tC41oZq2R6ht0bljU0jiS8TP3qCZ+I5izt/p31vUjZUAvtCZBQkGr8vLZDrc+/R5dCPkGn/D+IoNA0P+l75WO1nHOpAfuYN7q0xCUN42NBJl13lCN9xZBaQZyDC4OyFp7sCsSEO16SkHpiLgALITUREksvWmS/OAcIk6ayvQNWr8mBCYpiiGgr8Arvdn826UB6PDCZ4Ym9J3H
  ubuntu@sepia56.ceph.dreamhost.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUTeKEAoW2ZB3Rl0YEdWLBCfL4PuA3QA6KC2BiPsR3UXfI92jFNnZ8Yr4Qz7pJQkI9ELdC1p4T9ro1zHv/fY7U5jtFeT4aJ1daTY2fw5R5dFuMTaBnvCnWczJshoFEaEgvwh8CHXyykPtpl9l329kWOK8VjNEyN9xCork9i7NLkrPXj6Mq44baZWw32IaFWnpcTfhL1NprMeu7pSLmZooRWElIXrT2WD8FJGp8t1FlSpvOwou5pIcW1EzBLNoR8YG8lVAl5r9xKsRI1zlOtmkKeKosasoNNx8PYmSGXR/ToiMlCpxVtqMxtd7bT1VQLqDEu7W/MMHnj2T9M8ZTF759
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph: null
- ceph-fuse: null
- workunit:
    client.0:
    - rados/test.sh

#4 Updated by Sage Weil about 12 years ago

  • Source set to Q/A

ubuntu@teuthology:/a/nightly_coverage_2012-02-26-a/13876$ grep WRN ceph.log
2012-02-26 01:18:03.166529 osd.1 10.3.14.155:6800/6904 40 : [WRN] client.4141 10.3.14.154:0/3028499 misdirected client.4141.0:4 0.0 to osd.1 not [0,1] in e260/260

#5 Updated by Sage Weil about 12 years ago

ubuntu@teuthology:/a/nightly_coverage_2012-02-29-b/14479$ grep WRN ceph.log 
2012-02-29 15:22:10.491773 osd.1 10.3.14.152:6800/14237 48 : [WRN] client.4235 10.3.14.140:0/3022280 misdirected client.4235.0:4 0.0 to osd.1 not [0,1] in e257/257

#6 Updated by Sage Weil about 12 years ago

  • Target version changed from v0.42 to v0.44

#7 Updated by Sage Weil about 12 years ago

  • Status changed from New to 12

Just saw this with a different workload:

kernel: &id001
  sha1: 147ad9e3a993733ed1adb91829dcb40f0431a3b4
nuke-on-error: true
overrides:
  ceph:
    conf:
      osd:
        osd op complaint time: 120
    coverage: true
    fs: xfs
    log-whitelist:
    - clocks not synchronized
    - old request
    sha1: 5912312c14a6214f4318fd7bfb6fd08714458b6f
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana31.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5J4n7rTsH+IMjGAu+EfhukuK5+zScoSaPIfXDOUU8LfvuI/3x8Luiyv9eRVwZgwuLBWZ/zorBbGZ+G2Iaxy3632AG/XE7cRZA9AxzZT+Qvm9D+BW+Uletgf92cttKMk7qwK3DetQwRKKl6AMv0SDpUff+nzqnJH6LMS8zoBPVXDHFM3Lup8h9H6DYEs1F/Zn8LVSw8hNiD279rg1n1hqWdItmnKBPKyC/qkRoPa6h7gDU6FPaBiNhuhBd0016XGrVwL7Y8gqoDBiArP+NDt1lcnbeiK43bFhqW+pYovOdIA2MJC6z+bkZDlOJdxoz9mDP0cJZBdB43v3UdbS1R+WT
  ubuntu@plana56.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDGhdi9iPGn6WABEvQEvL6j7rFt9GYPDzl7MiAG3dwXAawWEL4V4udCe/7qM/ueICoUWj5ROXAuaA4OebocZ4jMgPemUkL4+6vjqiBtW05n8RqRZmyQRPgW5f4P5mpbU2whK6YENZOm1Mh+bKRl4VVqaWbqPUeKelIeFiI8SKzE9/pN3Jo5ZFPa7+5ngOr0wRhHvwYSX1YRYgvJOxmhKPFDr1BPyLgbKkBV+2HgtSDfy74KYZCj7k72ljrzBQ0q7IV4lRn9dHmbsKR5aQ3qcIG8p/MX/Ej1u2NlZQzAWdT6/yhfQ2vDCnL6fDNvEXg770zmnMwXJg+g8Pic4+0ZqfJ7
  ubuntu@plana65.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDh632PE613HcpBSxtjZXzV0ir4wNikuHXgRHDOVBqbBGE31fCV3CY0Ohb8ow/wVnmbvhu9UblBgy+ahQZNHe3XHK87Tb3Mp1zCkW6ypmMCw6pUr2nm4av6saqk/aGitqbp/P8vC1eoNyrJNPcpIN3tkHNe/hLKmEDX7v9A3E73jArXwtZhaPdfpE9FQ04NxhuucRe2Ruuasa2H+8Gv3cp2tEiaj8CNetcXRcM41yakLyyLnsuis3kZlMzsRGoj56rzjn+KyRDq3E5wRROQfmXkAikFEuDhd167Kg6X8Io00zAjwMWmG6Zub2q+dU1r9ij0iVxOABqMmBjytBdB1Mp/
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph:
    log-whitelist:
    - wrongly marked me down or wrong addr
    - objects unfound and apparently lost
- thrashosds: null
- kclient: null
- workunit:
    all:
    - suites/ffsb.sh

2012-03-14 01:03:24.328808 osd.5 10.214.132.13:6806/6585 42203 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364084 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.328870 osd.5 10.214.132.13:6806/6585 42204 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364085 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.328928 osd.5 10.214.132.13:6806/6585 42205 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364086 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.328968 osd.5 10.214.132.13:6806/6585 42206 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364087 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329015 osd.5 10.214.132.13:6806/6585 42207 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364088 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329057 osd.5 10.214.132.13:6806/6585 42208 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364089 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329098 osd.5 10.214.132.13:6806/6585 42209 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364090 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329256 osd.5 10.214.132.13:6806/6585 42210 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364091 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329304 osd.5 10.214.132.13:6806/6585 42211 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364092 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329343 osd.5 10.214.132.13:6806/6585 42212 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364093 0.15 to osd.5 not [3,5] in e252/269
2012-03-14 01:03:24.329385 osd.5 10.214.132.13:6806/6585 42213 : [WRN] client.4120 10.214.132.22:0/1249107828 misdirected client.4120.1:364094 0.15 to osd.5 not [3,5] in e252/269
ubuntu@teuthology:/a/nightly_coverage_2012-03-14-a/1213$ 

#8 Updated by Sage Weil almost 12 years ago

  • Status changed from 12 to Fix Under Review
  • Priority changed from Normal to High
  • Target version changed from v0.44 to v0.47

i just found a crush algorithm fix/change that was in ceph.git but on the kernel... i bet that is the problem. now sitting in the wip-crush branch awaiting review!

#9 Updated by Sage Weil almost 12 years ago

running this in a loop against the wip-crush kernel branch and the problem seems to be gone!

#10 Updated by Sage Weil almost 12 years ago

  • Status changed from Fix Under Review to Resolved

#11 Updated by Sage Weil almost 12 years ago

  • Status changed from Resolved to Need More Info

not fixed after all, failed again on

ubuntu@teuthology:/a/nightly_coverage_2012-05-13-b/1268

#12 Updated by Sage Weil almost 12 years ago

Last two failures were the rados api tests:


ubuntu@teuthology:/a$ cat nightly_coverage_2012-06-10-b/6919/config.yaml 
kernel: &id001
  sha1: ae386c7bacf279c8b02342b2ce8f4cab0bf616f6
nuke-on-error: true
overrides:
  ceph:
    conf:
      global:
        lockdep: true
      osd:
        osd op complaint time: 120
    coverage: true
    fs: btrfs
    log-whitelist:
    - clocks not synchronized
    - slow request
    sha1: b25f27705b6dc63a1b8b565afc3aeb9fe0b7f1a4
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana59.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC2ijQopwlP92QQaUVB/Dx8HLmRokvfKoN4+2xGBvej1Qb/P67By9WNF92tQspMpLSpvqodRaXxcKixvrlLEJJQ16lY9xje2jNDWRjJ/AjnAg79qRA7N6L6BC7rccZ8oZWXqOpNKIKRNUUyn+XwtVgkbbcUr5NjQG2CfsLPtoNkdX98a/UyvhefomanU6oStJHFIptLXe2WjgW6DoXofu9qIpyZrO75XFOIlEpC2QTXU6eTVGRxhL2qtVVR42z7bcrNuVFr4TTTNvqaabyvW2mPZNz3a8ojh/pb8gIZGFYydiG8aOsS9dapltOysmdwfBPXHsAoD/49vbPEHIdyjarp
  ubuntu@plana60.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5sa91O7pjxDijkXWrqS3slUWFtQJZtC3hfN5cKt+oISIVRZoy4CNxFrOWhmXwKOZaIV7fS2PzmhZQe9jfA5iRP7G/opl/5sMImzTIUUz8Qrlqb7epmwYHkoE196uN45y8iW/IDwsD7o69+rOGIj0SgdT0nGBYXB27OhdECdumxHcST/vM8dFfIwwMienvhKKB+ghrS2bYa33DcfYHBQWdIPXrUf2brb/XRYDwVMg8W5uPyW8dx3GIONVxSWequnTO1m0FuMKB7BrVpH19+/H4DFOwj6vqSvvOYUWm3Go3FsQTvparxrg7/Rliee0RBWcBIE3D0gWk6r74HV0C4pnT
  ubuntu@plana61.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOTCMIScDTmD9NkfsWU7xeyZ+WOXai5izYeliiXDSjJC3bT6r8Fp+rhPfcHCVHiw++VsbvKZtkhjCSnJTVPWCdpRDghzJ3nZUBImWRo3PmHo1etQpCeimaOrIJ2q0ChN5jmSOqy5B+Z4om2vXBtBY6nkdTxDOr2+MH3NrSPkQSFB0zO+VPuwKXsemeUC6urb2IZZpxY3cxNq4fafTF9PROpgOnIA+o3igyU4duKEjnCzTHZjw/PL7Eph/7p6+UQgrUwe7pgVzT+2MM0zcBtBSXNqs3dCGmpvUapOkBlDoIX02EkWRNpkM3vfeFt1EFC17B5vd61Kg40bYUG8qWGR0T
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph: null
- ceph-fuse: null
- workunit:
    client.0:
    - rados/test.sh
ubuntu@teuthology:/a$ cat teuthology-2012-06-15_19:00:09-master-testing-gcov/97/config.yaml
kernel: &id001
  branch: testing
nuke-on-error: true
overrides:
  ceph:
    branch: master
    conf:
      global:
        lockdep: true
    coverage: true
    fs: btrfs
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana40.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDEwyNlwC9Utqf3PCjL2JR4wwDkzpdEJuW93DOW82vYVisYEGod454JwXeNkjqzTUk6tXeRoUM9f/C6sZS3LFgHcMYt6m0sxP8DC4qU+q0YxCw9zLY8bXKe4DDjijM62h/SnyqyOWIh9amGT7wRwZEHBV1BKvZbNxQIJ7ESkuKsk/tJfWKhq7dSw6E/+MZ4yQtXvTyaJ3pK96Hq2uoUkawv+FxXBrzG3FtTTYA8gqA1SIiV3erEIQuBK/WD74i5yK4rwpfGTo7jNc0V6wrwO1BKFj/OGjSC+2LSAkBgf8WLe6UL/dHr3bBEyzm0V4xMf5Iqb8JGvkaXNEfbFqzKC2Wv
  ubuntu@plana54.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPP6P/C+Ldv/43F9hTYiCRCuVUg01F2Vk/8UiasSHCfY563uxAVGywiFHfbt4nJ3ZcShMVViFXdKTjR+jFOeK2Gs3GHBnkHvn4y3OxM0jqnvEVTDB6BWMxpRNhzmiJFRcYmuaJC9qxv6Ls7Eig6kfYr4tvnG3IApzXZu++ZxQDRwHuG3cg8q+huNj9bmgB3EW/LGgGMLMWGShcqdXYGqURuptleq1qgMH8qhDR+ldqi8kZzYKmeiII0JHtH9YWnMSoMSNcapg8RxTYrHwIhyA/biZLBRdSRwG6e2YieExq6STQXAIBuvLE0uNRG57NMPX9+fraeNFFo6gWTcSuN0Zf
  ubuntu@plana60.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC5sa91O7pjxDijkXWrqS3slUWFtQJZtC3hfN5cKt+oISIVRZoy4CNxFrOWhmXwKOZaIV7fS2PzmhZQe9jfA5iRP7G/opl/5sMImzTIUUz8Qrlqb7epmwYHkoE196uN45y8iW/IDwsD7o69+rOGIj0SgdT0nGBYXB27OhdECdumxHcST/vM8dFfIwwMienvhKKB+ghrS2bYa33DcfYHBQWdIPXrUf2brb/XRYDwVMg8W5uPyW8dx3GIONVxSWequnTO1m0FuMKB7BrVpH19+/H4DFOwj6vqSvvOYUWm3Go3FsQTvparxrg7/Rliee0RBWcBIE3D0gWk6r74HV0C4pnT
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      client.0:
      - rados/test.sh

#13 Updated by Tamilarasi muthamizhan almost 12 years ago

latest run log: ubuntu@teuthology:/a/teuthology-2012-06-18_19:00:05-regression-master-testing-gcov/1586

#14 Updated by Sage Weil almost 12 years ago

  • Status changed from Need More Info to 7
  • Assignee set to Sage Weil

here is the smoking gun. note that teh pgid goes to 0.0 when linger tid 1 is resending the watch op 4:


2012-06-18 15:32:23.343694 7f27b064f780 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:23.343711 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x1e41b90 con 0x1e49eb0
2012-06-18 15:32:23.343731 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x1e39390 con 0x1e49eb0
2012-06-18 15:32:23.344175 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 5 ==== mon_map v1 ==== 473+0+0 (803067403 0 0) 0x7f278c002580 con 0x1e49eb0
2012-06-18 15:32:23.344219 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c002760 con 0x1e49eb0
2012-06-18 15:32:23.344442 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 7 ==== osd_map(1465..1465 src has 6..1465) v3 ==== 4497+0+0 (3241515889 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:23.344473 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1465,1465] > 0
2012-06-18 15:32:23.344477 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding full epoch 1465
2012-06-18 15:32:23.344539 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:23.344553 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 8 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004380 con 0x1e49eb0
2012-06-18 15:32:23.344566 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 9 ==== osd_map(1465..1465 src has 6..1465) v3 ==== 4497+0+0 (3241515889 0 0) 0x7f278c004520 con 0x1e49eb0
2012-06-18 15:32:23.344620 7f27b064f780 10 client.4832.objecter create_pool name=test-rados-api-plana87-25626-3
2012-06-18 15:32:23.344629 7f27b064f780 10 client.4832.objecter pool_op_submit 1
2012-06-18 15:32:23.344632 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- pool_op(create pool 0 auid 0 tid 1 name test-rados-api-plana87-25626-3 v0) v4 -- ?+0 0x1e408c0 con 0x1e49eb0
2012-06-18 15:32:23.344664 7f27aa4dc700  3 client.4832.objecter handle_osd_map ignoring epochs [1465,1465] <= 1465
2012-06-18 15:32:23.344673 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:23.344686 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 10 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:24.119145 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 11 ==== pool_op_reply(tid 1 (0) Success v1466) v1 ==== 43+0+0 (3129229325 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:24.119183 7f27aa4dc700 10 client.4832.objecter handle_pool_op_reply pool_op_reply(tid 1 (0) Success v1466) v1
2012-06-18 15:32:24.119194 7f27aa4dc700 10 client.4832.objecter have request 1 at 0x1e392e0 Op: create
2012-06-18 15:32:24.119196 7f27aa4dc700 20 client.4832.objecter waiting for client to reach epoch 1466 before calling back
2012-06-18 15:32:24.119198 7f27aa4dc700 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:24.119204 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=1466}) v2 -- ?+0 0x7f2798005e90 con 0x1e49eb0
2012-06-18 15:32:24.119222 7f27aa4dc700 10 client.4832.objecter done
2012-06-18 15:32:24.490340 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 12 ==== osd_map(1466..1466 src has 6..1466) v3 ==== 280+0+0 (1318621838 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:24.490373 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1466,1466] > 1465
2012-06-18 15:32:24.490379 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding incremental epoch 1466
2012-06-18 15:32:24.490397 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:24.490415 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 13 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:24.490442 7f27b064f780 15 client.4832.objecter send_linger 1
2012-06-18 15:32:24.490459 7f27b064f780 10 client.4832.objecter recalc_op_target tid 2 pgid 460.7fc1f406 acting [2,5]
2012-06-18 15:32:24.490497 7f27b064f780 20 client.4832.objecter  note: not requesting ack
2012-06-18 15:32:24.490502 7f27b064f780 10 client.4832.objecter op_submit oid foo @460 [watch 1~0] tid 2 osd.2
2012-06-18 15:32:24.490506 7f27b064f780 15 client.4832.objecter send_op 2 to osd.2
2012-06-18 15:32:24.490510 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6806/25783 -- osd_op(client.4832.0:2 foo [watch 1~0] 460.7fc1f406) v4 -- ?+0 0x1e351b0 con 0x1e67d00
2012-06-18 15:32:24.490524 7f27b064f780  5 client.4832.objecter 0 unacked, 1 uncommitted
2012-06-18 15:32:25.282374 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== osd.2 10.214.132.9:6806/25783 1 ==== osd_op_reply(2 foo [watch 1~0] = 0) v4 ==== 102+0+0 (4054106005 0 0) 0x7f2788000a30 con 0x1e67d00
2012-06-18 15:32:25.282407 7f27aa4dc700 10 client.4832.objecter in handle_osd_op_reply
2012-06-18 15:32:25.282410 7f27aa4dc700  7 client.4832.objecter handle_osd_op_reply 2 ondisk v 0'0 in 460.7fc1f406 attempt 0
2012-06-18 15:32:25.282415 7f27aa4dc700 10 client.4832.objecter  op 0 rval 0 len 0
2012-06-18 15:32:25.282417 7f27aa4dc700 15 client.4832.objecter handle_osd_op_reply safe
2012-06-18 15:32:25.282418 7f27aa4dc700 15 client.4832.objecter handle_osd_op_reply completed tid 2
2012-06-18 15:32:25.282423 7f27aa4dc700  5 client.4832.objecter 0 unacked, 0 uncommitted
2012-06-18 15:32:25.282424 7f27aa4dc700 10 client.4832.objecter _linger_commit 1
2012-06-18 15:32:25.282451 7f27b064f780 10 client.4832.objecter delete_pool 460
2012-06-18 15:32:25.282460 7f27b064f780 10 client.4832.objecter pool_op_submit 3
2012-06-18 15:32:25.282466 7f27b064f780  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- pool_op(delete pool 460 auid 0 tid 3 name delete v1466) v4 -- ?+0 0x1e351b0 con 0x1e49eb0
2012-06-18 15:32:26.482302 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 14 ==== pool_op_reply(tid 3 (0) Success v1468) v1 ==== 43+0+0 (1225606032 0 0) 0x7f278c004a30 con 0x1e49eb0
2012-06-18 15:32:26.482341 7f27aa4dc700 10 client.4832.objecter handle_pool_op_reply pool_op_reply(tid 3 (0) Success v1468) v1
2012-06-18 15:32:26.482351 7f27aa4dc700 10 client.4832.objecter have request 3 at 0x1e392e0 Op: delete
2012-06-18 15:32:26.482354 7f27aa4dc700 20 client.4832.objecter waiting for client to reach epoch 1468 before calling back
2012-06-18 15:32:26.482355 7f27aa4dc700 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:26.482361 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=1468}) v2 -- ?+0 0x7f27980075f0 con 0x1e49eb0
2012-06-18 15:32:26.482380 7f27aa4dc700 10 client.4832.objecter done
2012-06-18 15:32:27.129004 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 15 ==== osd_map(1468..1468 src has 6..1468) v3 ==== 168+0+0 (3345943677 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:27.129036 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1468,1468] > 1466
2012-06-18 15:32:27.129041 7f27aa4dc700  3 client.4832.objecter handle_osd_map requesting missing epoch 1467
2012-06-18 15:32:27.129043 7f27aa4dc700 10 client.4832.objecter maybe_request_map subscribing (onetime) to next osd map
2012-06-18 15:32:27.129048 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_subscribe({monmap=2+,osdmap=1467}) v2 -- ?+0 0x7f2798007d90 con 0x1e49eb0
2012-06-18 15:32:27.129065 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:27.129074 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 16 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:27.129860 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 17 ==== osd_map(1467..1468 src has 6..1468) v3 ==== 328+0+0 (1304961584 0 0) 0x7f278c004ed0 con 0x1e49eb0
2012-06-18 15:32:27.129899 7f27aa4dc700  3 client.4832.objecter handle_osd_map got epochs [1467,1468] > 1466
2012-06-18 15:32:27.129905 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding incremental epoch 1467
2012-06-18 15:32:27.129924 7f27aa4dc700 10 client.4832.objecter recalc_linger_op_target tid 1 pgid 460.7fc1f406 acting [2,5]
2012-06-18 15:32:27.129930 7f27aa4dc700  3 client.4832.objecter handle_osd_map decoding incremental epoch 1468
2012-06-18 15:32:27.129939 7f27aa4dc700 15 client.4832.objecter send_linger 1
2012-06-18 15:32:27.129959 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6790/0 -- mon_get_version(what=osdmap handle=1) v1 -- ?+0 0x7f2798008940 con 0x1e49eb0
2012-06-18 15:32:27.129977 7f27aa4dc700 20 client.4832.objecter  note: not requesting ack
2012-06-18 15:32:27.129980 7f27aa4dc700 10 client.4832.objecter op_submit oid foo @460 [watch 1~0] tid 4 osd.2
2012-06-18 15:32:27.129984 7f27aa4dc700 15 client.4832.objecter send_op 4 to osd.2
2012-06-18 15:32:27.129987 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 --> 10.214.132.9:6806/25783 -- osd_op(client.4832.0:4 foo [watch 1~0] 0.0) v4 -- ?+0 0x7f2798008f00 con 0x1e67d00
2012-06-18 15:32:27.130001 7f27aa4dc700  5 client.4832.objecter 0 unacked, 1 uncommitted
2012-06-18 15:32:27.130003 7f27aa4dc700 15 client.4832.objecter send_linger 1 already (re)registering
2012-06-18 15:32:27.130005 7f27aa4dc700 20 client.4832.objecter dump_active .. 0 homeless
2012-06-18 15:32:27.130006 7f27aa4dc700 20 client.4832.objecter 4       0.0     osd.2   foo     [watch 1~0]
2012-06-18 15:32:27.130021 7f27aa4dc700  1 -- 10.214.133.29:0/3025626 <== mon.2 10.214.132.9:6790/0 18 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (4115709547 0 0) 0x7f278c004870 con 0x1e49eb0
2012-06-18 15:32:27.130305 7f27b064f780 10 client.4832.objecter close_session for osd.2

see wip-2022 for fix

#15 Updated by Sage Weil almost 12 years ago

  • Status changed from 7 to Resolved

#16 Updated by Sage Weil over 11 years ago

  • Status changed from Resolved to Need More Info
  • Target version deleted (v0.47)

apparently there is a different cause for this:

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2012-07-01_21:52:14-regression-next-testing-basic/4394$ zcat remote/ubuntu@plana52.front.sepia.ceph.com/log/cluster.mon.a.log.gz| grep WRN
2012-07-02 01:51:02.081488 osd.2 10.214.132.26:6806/7570 23 : [WRN] client.4119 10.214.132.11:0/1009797 misdirected client.4119.0:179854 pg 3.8724b63c to osd.2 in e179
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2012-07-01_21:52:14-regression-next-testing-basic/4394$ cat config.yaml 
kernel: &id001
  branch: testing
  kdb: true
nuke-on-error: true
overrides:
  ceph:
    branch: next
    conf:
      client:
        rbd cache: false
    fs: btrfs
    log-whitelist:
    - slow request
roles:
- - mon.a
  - osd.0
  - osd.1
  - osd.2
- - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
targets:
  ubuntu@plana52.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9kswBp2g5ZV1Qrvlee8MvUOCNdubQFqUBr5WSsmFBODqEuiitWbhuBu2Ucz0lBMf41DpMKLeYDN0lIC94GZmGaiCN+Ak9Ia05d/uRvesT2nDgHB3Z9J/zEFlY8RVxL3xhD+hq4u8dbASlqqoMDiBP+7efZMxt4Ndnzr/yOxge3KenxyQImBUS+OV+BqnfCOHf6BqM33U1leXz2kng7ocxoE91DAMslKD/2DPRSYEhfucUJZk6IYevr/g0JVhbfvjSlZzwUEfTyVmPeqNyls/U+azhKlvQbqpb+ttc02RNydQ1YgOgHFCaqd9Vm8XjUU6vYGlkFHZ+BMJuEwA9AH/D
  ubuntu@plana55.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDXY+KFAAzWoJq5vLwy6PJHxNeqz3fHCisJDAbtdrnjhhxVyUQtQLlhIPqiQHi6PADNYNUS/4um0TNmDFYxJLJU9SxqmBQ3QTM9F56YQa9F/+98o4LyPLS5TXqq+nCDbU1vhMbpu0mv2MDZ9BVZAgdT/yYgYGErIQz2MnaCAbgp0SRSZOxq0/3KgMz4W0KxkagiNglZV3RvarYASdqZheYeQYtnIyEw+Hk/ZLHoxUirBthAuCu5RvYYTDptQDuOR0tjRaMS81kapD5VZhFbetSxJ9rJ21oepmLSY+0UoIufZS4CNJ/sP2HDDc1Pw1mjJhqClScxTOP1yUnNWhW1d0sP
  ubuntu@plana67.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDqiZsiT7h3fNR9yzwK2WToaotO4olIxmVdh+aSf3ILwEpHjFYbWXymL0C77hn0MdGRbaOzWOSMMng3MAHKy9xR3/CGNXXqO7iEK1fJOvSfmypkvJDyrMY/RuSvdifcXJyREvFsSK6cdmRpO235ODhfui4FC5BLmgv/VvasH/1Ur4ALfe7UE9L+cU4VeoJdl082oYeo1nn1beERgaypX67MXepG2NKbEY77jG5FXbGVpKWmsgIEWiiX8p6+afTOP+8cGsM3vsAG7nTJeFVKkEHc7A8cPkT4l/iXKjSiwWAtU5NV0QmRC/1ad78+xTOWNzJaTrIxoKuuGpB+DjdvrJgN
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
    log-whitelist:
    - wrongly marked me down or wrong addr
    - objects unfound and apparently lost
- thrashosds: null
- rbd_fsx:
    clients:
    - client.0
    ops: 20000

#17 Updated by Sage Weil over 11 years ago

  • Status changed from Need More Info to In Progress
  • Assignee changed from Sage Weil to Samuel Just

2012-07-05 13:55:58.671142 7fa0c4aab700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] enqueue_op 0x1c04780 osd_sub_op_reply(client.4121.0:215 0.8 13046878/10000000035.00000000/head//0 [] ack, result = 0) v1
2012-07-05 13:55:58.671248 7fa0c20a4700 10 osd.2 11 dequeue_op osd_sub_op_reply(client.4121.0:215 0.8 13046878/10000000035.00000000/head//0 [] ack, result = 0) v1 pg pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean]
2012-07-05 13:55:58.671281 7fa0c20a4700 20 osd.2 11 _share_map_outgoing osd.5 10.214.132.17:6814/7611 already has epoch 11
2012-07-05 13:55:58.671293 7fa0c20a4700  7 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] repop_ack rep_tid 83 op osd_op(client.4121.0:215 10000000035.00000000 [write 311~26] 0.13046878 snapc 1=[]) v4 result 0 ack_type 1 from osd.5
2012-07-05 13:55:58.671310 7fa0c20a4700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] eval_repop repgather(0x1b8bea0 applied 11'33 rep_tid=83 wfack= wfdisk=2,5 op=osd_op(client.4121.0:215 10000000035.00000000 [write 311~26] 0.13046878 snapc 1=[]) v4) wants=d
2012-07-05 13:55:58.671325 7fa0c20a4700 10 osd.2 11 dequeue_op 0x1b5fe00 finish
2012-07-05 13:55:58.691811 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'30 lcod 11'30 mlcod 11'30 active+clean] op_commit repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk=2 op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.691843 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'30 mlcod 11'30 active+clean] requeue_ops 0x208cb00
2012-07-05 13:55:58.691854 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'30 mlcod 11'30 active+clean] enqueue_op 0x2035240 osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 RETRY snapc 1=[]) v4
2012-07-05 13:55:58.691866 7fa0c8ab3700 20 osd.2 11 op_has_sufficient_caps pool=0 (data) owner=0 may_read=0 may_write=1 may_exec=0 require_exec_caps=0 -> yes
2012-07-05 13:55:58.691877 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] eval_repop repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4) wants=d
2012-07-05 13:55:58.691894 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] log_op_stats osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4 inb 158 outb 0 rlat 0.003825 lat 0.192428
2012-07-05 13:55:58.691908 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] update_stats 3'46
2012-07-05 13:55:58.691919 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean]  sending commit on repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4) 0x1e90800
2012-07-05 13:55:58.691933 7fa0c8ab3700  1 -- 10.214.132.12:6813/6750 --> 10.214.133.24:0/15771 -- osd_op_reply(201 10000000035.00000000 [write 255~29] ondisk = 0) v4 -- ?+0 0x1e90800 con 0x1ec93c0
2012-07-05 13:55:58.691946 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean]  removing repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.691959 7fa0c8ab3700 20 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean]    q front is repgather(0x1e68d80 applied 11'31 rep_tid=81 wfack= wfdisk= op=osd_op(client.4121.0:201 10000000035.00000000 [write 255~29] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.691989 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'31 lcod 11'31 mlcod 11'30 active+clean] op_commit repgather(0x1e68900 applied 11'32 rep_tid=82 wfack= wfdisk=2,5 op=osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 snapc 1=[]) v4)
2012-07-05 13:55:58.692004 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'31 mlcod 11'30 active+clean] requeue_ops 0x208cb80
2012-07-05 13:55:58.692013 7fa0c8ab3700 15 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'31 mlcod 11'30 active+clean] enqueue_op 0x1ce6900 osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 RETRY snapc 1=[]) v4
2012-07-05 13:55:58.692025 7fa0c8ab3700 20 osd.2 11 op_has_sufficient_caps pool=0 (data) owner=0 may_read=0 may_write=1 may_exec=0 require_exec_caps=0 -> yes
2012-07-05 13:55:58.692035 7fa0c8ab3700 10 osd.2 11 pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'32 mlcod 11'30 active+clean] eval_repop repgather(0x1e68900 applied 11'32 rep_tid=82 wfack= wfdisk=5 op=osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 snapc 1=[]) v4) wants=d
2012-07-05 13:55:58.692067 7fa0c18a3700 10 osd.2 11 dequeue_op osd_op(client.4121.0:208 10000000035.00000000 [write 284~27] 0.13046878 RETRY snapc 1=[]) v4 pg pg[0.8( v 11'33 (0'0,11'33] n=6 ec=1 les/c 6/6 4/4/3) [2,5] r=0 lpr=4 luod=11'32 lcod 11'32 mlcod 11'30 active+clean]
2012-07-05 13:55:58.692093 7fa0c18a3700 20 osd.2 11 _share_map_outgoing osd.5 10.214.132.17:6814/7611 already has epoch 11

requeue_op is called twice, each putting the waited-for-ondisk request at the front of the queue, which is wrong because the replies get reordered.

#18 Updated by Sage Weil over 11 years ago

this is easily reproduced with

interactive-on-error: true
overrides:
  ceph:
    conf:
      global:
        ms inject socket failures: 5000
      client:
        debug ms: 20
        debug objecter: 20
        debug objectcacher: 20
      osd:
        debug ms: 1
        debug osd: 20
        debug filestore: 20
    fs: btrfs
    log-whitelist:
    - slow request
    branch: master
roles:
- - mon.a
  - mon.c
  - osd.0
  - osd.1
  - osd.2
- - mon.b
  - mds.a
  - osd.3
  - osd.4
  - osd.5
- - client.0
tasks:
- ceph: null
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/blogbench.sh

#19 Updated by Sage Weil over 11 years ago

  • Backport set to argonaut

#20 Updated by Sage Weil over 11 years ago

ugh, wrong bug.. ignore last two messages!

#21 Updated by Sage Weil over 11 years ago

  • Assignee changed from Samuel Just to Sage Weil

#22 Updated by Tamilarasi muthamizhan over 11 years ago

latest log: ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815

and

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824

#23 Updated by Tamilarasi muthamizhan over 11 years ago

Log location: ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824

Attaching the config.yaml, summary.yaml and cluster log for reference:

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824$ zcat /log/cluster.mon.a.log.gz | grep WRN
2012-07-08 22:52:54.930706 osd.2 10.214.132.21:6806/8952 22 : [WRN] client.4119 10.214.132.23:0/1019058 misdirected client.4119.0:170290 pg 3.ef671d64 to osd.2 in e302

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824$ cat summary.yaml
ceph-sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
description: collection:thrash clusters:6-osd-3-machine.yaml fs:xfs.yaml thrashers:default.yaml
workloads:rbd_fsx_cache_writeback.yaml
duration: 1442.6627759933472
failure_reason: '"2012-07-08 22:52:54.930706 osd.2 10.214.132.21:6806/8952 22 : [WRN]
client.4119 10.214.132.23:0/1019058 misdirected client.4119.0:170290 pg 3.ef671d64
to osd.2 in e302" in cluster log'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7824$ cat config.yaml
kernel: &id001
kdb: true
sha1: null
nuke-on-error: true
overrides:
ceph:
conf:
client:
rbd cache: true
fs: xfs
log-whitelist:
- slow request
sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDXY+KFAAzWoJq5vLwy6PJHxNeqz3fHCisJDAbtdrnjhhxVyUQtQLlhIPqiQHi6PADNYNUS/4um0TNmDFYxJLJU9SxqmBQ3QTM9F56YQa9F/+98o4LyPLS5TXqq+nCDbU1vhMbpu0mv2MDZ9BVZAgdT/yYgYGErIQz2MnaCAbgp0SRSZOxq0/3KgMz4W0KxkagiNglZV3RvarYASdqZheYeQYtnIyEw+Hk/ZLHoxUirBthAuCu5RvYYTDptQDuOR0tjRaMS81kapD5VZhFbetSxJ9rJ21oepmLSY+0UoIufZS4CNJ/sP2HDDc1Pw1mjJhqClScxTOP1yUnNWhW1d0sP
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCuMOcu2XPQovy/Qzmwyvc9tvGP9JZVJ6cqiJ3RPOSGgAifKLTxe2ramHpD8AKcdthu8VAfouFpZK4CtBWKJowurR+4yZKgEugzvYuZ/nK/np56vreBQmRBWD1vLPtxPsTT3YGu5qx+ixdSwrSxexxc0/7+EW9x1D6knL+OGUNWksoGIRlXxjh9qafbw/1XKeQQF28vxBXHofXUFY8USMUcq5HDuaFfmgKzufH6vk84oqyr/jtGej6b4g6tbGiHPYR+o5tmTQHyxpOxqLZP2RFFqHlQ/QaOmRvSNIoOo+1UbqdcWsLk16/lXIS1mI+BZsZouk1H+fGeMTEUDGktiPW7
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC1atTvrZo1Sr3dY/WzNvE8gg/UeFV/U2NdmR9UlTkihYWK9nqdjirHhN/MNYVMkjY+yP3zWnbkQNx89J+XTvf1ROM+CsT3n885LCMxDtzzjwU2/x6vjwkLKSm43e8QMOfsCVl4jTniK3godrIJIg/pvUwD+dnkV/qmx3SpMnRTwzwgpfFwtFVENu7k519POG1jVrQ1tpksAke6s4/gWNsAeYxoDzP4tegWPmWIu7qHcpH8X5t/ClHWeMV6ur0yVRk/6rX0Jve98PFrsW5vFJvbZpMNMvV4ei/g8jyeCUS3OBqanRnSBgn3geLTngw4y32+squLtIKBeojNQ+8u/YNL
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down or wrong addr
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rbd_fsx:
clients:
- client.0
ops: 20000

#24 Updated by Tamilarasi muthamizhan over 11 years ago

Log location: ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815

Attaching the config.yaml, summary.yaml and cluster log for reference:

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815$ cat config.yaml
kernel: &id001
kdb: true
sha1: null
nuke-on-error: true
overrides:
ceph:
conf:
client:
rbd cache: true
rbd cache max dirty: 0
fs: btrfs
log-whitelist:
- slow request
sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
roles:
- - mon.a
- osd.0
- osd.1
- osd.2
- - mds.a
- osd.3
- osd.4
- osd.5
- - client.0
targets:
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9Ru6XkJBGiUQK9AtlFt82TzpaWuKams26i0FcItt3hbniR1yxpWVHM3dQI5Gft3liumnOD+cPZiZJzGYyj2KDBCZ8G9V65YqCbzO+moJmv5wDWKg1pEIIW040aLrlsOPbZlEL7htT14MHTTstyTQCOLkrySCpexwYrA2wQBhsHc7pxL+XLa+WM1zTXSQe6QrS8iYxITGRibEMSjcXlOuLFnst42O6o4WQHd31WS9pbniBmso7KVgTFxmcN5rvEo1YAJJYwVxGfmorWrXan1ULY6CksasatbCuohmVNNZfsnE8KdyYsPYCbKIPp9NnmBL3Pp/oPqqyPsj36Wgj5e4/
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDrxOb9f5/SfItd83HOnLVyJRnfji0fbdvL+3T82akjV6J4s/nyR8Bu+rpXbyUwu2BRDoxK4pT2dBqw86meq1qbU5Q1ypWBSH41MYGd213fy0g8YibFiYVGmXFCSwtY8X2Pet9vtLDoYvtnsgNI8djy5GPkQyZFKSszJHznZvQU10NWfM6RfxxtsBKXC/aot4QXb3GIym2/EmeuTAAef6p98dd15P9l9HQkpwXZLwiDZ53IbU79CTINo5HTD/6+1XHUcjb1OUKzQMx1jU485gW6IlsR0G0jJKSv+YEu4zSxxva7gWt1AYxGo2jhNDffEGLsNurzXFf9yeYshCTAszLf
: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDo+Kh24vRxeTQ6/n5PIIGuxrPHPRO/xMQlwoLHi7mR01cIXJMG5wet7mp2om3/5SZSDcLBHduDKrdWL142Sg5fC0zZPUggbxS7nz/UCjYBzMsOtHEUAU5Gs0KFopOCHXNEveK95ezsroMAD5+jS/IEpiooYCkrR3H+NSvUU0Ae352PlXqV0vamkYzyQyEMmhFE50ALhUXbKMve3d2mxJee5sqVZSBmQTbze9RKUA96t9iiwiheflXbN1i9WHlbBOIue5pZ5fM3/vqPWgaShfFpa0pT56QKJfjyFcDeCLOislo23E5qKAJOi5vn5BoYVtG3niNQpt/YbYGfDEHVeqt9
tasks:
- internal.lock_machines: 3
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock: null
- ceph:
log-whitelist:
- wrongly marked me down or wrong addr
- objects unfound and apparently lost
- thrashosds:
timeout: 1200
- rbd_fsx:
clients:
- client.0
ops: 20000

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815$ cat summary.yaml
ceph-sha1: 97c1562dda2b0f3370bb186e7acbcd9c273210bc
description: collection:thrash clusters:6-osd-3-machine.yaml fs:btrfs.yaml thrashers:default.yaml
workloads:rbd_fsx_cache_writethrough.yaml
duration: 1057.5641641616821
failure_reason: '"2012-07-08 22:45:46.910053 osd.2 10.214.132.39:6806/8420 46 : [WRN]
client.4119 10.214.131.3:0/1009479 misdirected client.4119.0:212884 pg 3.fd47d096
to osd.2 in e353" in cluster log'
flavor: basic
owner: scheduled_teuthology@teuthology
success: false

ubuntu@teuthology:/a/teuthology-2012-07-08_21:46:11-regression-next-testing-basic/7815$ zcat /log/cluster.mon.a.log.gz | grep WRN
2012-07-08 22:45:46.910053 osd.2 10.214.132.39:6806/8420 46 : [WRN] client.4119 10.214.131.3:0/1009479 misdirected client.4119.0:212884 pg 3.fd47d096 to osd.2 in e353

#25 Updated by Sage Weil over 11 years ago

  • Status changed from In Progress to Fix Under Review
  • Assignee deleted (Sage Weil)

FINALLY tracked this one down. tests consistently now pass. bugfix-2022

#26 Updated by Sage Weil over 11 years ago

  • Status changed from Fix Under Review to Resolved

YAY

#27 Updated by Anonymous over 11 years ago

Writing down things discovered due to conversation on irc:

The log message happens with v0.48argonaut, it seems 41a570778a51fe9a36a5b67a177d173889e58363 is needed for full fix. It's already in branch stable, waiting for next release.

Also available in: Atom PDF