Bug #50389: mds: "cluster [ERR] Error recovering journal 0x203: (2) No such file or directory" in cluster log" - CephFS - Ceph

Actions

Copy link

Bug #50389

closed

mds: "cluster [ERR] Error recovering journal 0x203: (2) No such file or directory" in cluster log"

Added by Patrick Donnelly about 3 years ago. Updated about 1 year ago.

Status:

Resolved

Priority:

High

Assignee:

Xiubo Li

Category:

Target version:

Ceph - v17.0.0

% Done:

100%

Source:

Q/A

Tags:

backport_processed

Backport:

pacific,octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS, osdc

Labels (FS):

qa-failure

Pull request ID:

41254

Crash signature (v1):

Crash signature (v2):

Description

Symptom:

2021-04-15T16:51:03.445+0000 7faaa4574700  1 mds.j Updating MDS map to version 16 from mon.0
2021-04-15T16:51:03.446+0000 7faaa4574700 10 mds.j      my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
2021-04-15T16:51:03.446+0000 7faaa4574700 10 mds.j  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2021-04-15T16:51:03.446+0000 7faaa4574700 10 mds.j my gid is 4329
2021-04-15T16:51:03.446+0000 7faaa4574700 10 mds.j map says I am mds.3.0 state up:standby-replay
2021-04-15T16:51:03.446+0000 7faaa4574700 10 mds.j msgr says I am [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279]
2021-04-15T16:51:03.446+0000 7faaa4574700 10 mds.j handle_mds_map: handling map as rank 3
2021-04-15T16:51:03.446+0000 7faaa4574700 10 notify_mdsmap: mds.metrics
2021-04-15T16:51:03.446+0000 7faaa4574700 10 notify_mdsmap: mds.metrics: rank0 is mds.i
2021-04-15T16:51:03.446+0000 7faa9dd67700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] --> [v2:172.21.15.39:3300/0,v1:172.21.15.39:6789/0] -- mon_get_version(what=osdmap handle=4) v1 -- 0x5649ed383760 con 0x5649ed42e800
2021-04-15T16:51:03.446+0000 7faa9dd67700  4 mds.3.log Waiting for journal 0x203 to recover...
2021-04-15T16:51:03.448+0000 7faaa4574700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] <== mon.0 v2:172.21.15.39:3300/0 19 ==== mon_get_version_reply(handle=4 version=26) v2 ==== 24+0+0 (secure 0 0 0) 0x5649ed383760 con 0x5649ed42e800
2021-04-15T16:51:03.448+0000 7faa9e568700  0 mds.4329.journaler.mdlog(ro) error getting journal off disk
2021-04-15T16:51:03.448+0000 7faa9dd67700  4 mds.3.log Journal 0x203 recovered.
2021-04-15T16:51:03.448+0000 7faa9dd67700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x203: (2) No such file or directory

From: /ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047603/remote/smithi039/log/ceph-mds.j.log.gz

This one is weird. The standby-replay daemon has no traffic with the OSDs up to that point and yet concluded ENOENT. That usually means the objecter claims there's no pool after getting the latest version of the OSDMap (mon_get_version in above output). I don't see how that could be the case here. Might be a bug in osdc.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Xiubo Li about 3 years ago

Assignee set to Xiubo Li

I will take it. Thanks.

Actions

Copy link

Updated by Xiubo Li about 3 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Xiubo Li about 3 years ago

Checked the mds log:

2021-04-15T16:51:03.446+0000 7faa9dd67700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] --> [v2:172.21.15.39:3300/0,v1:172.21.15.39:6789/0] -- mon_get_version(what=osdmap handle=4) v1 -- 0x5649ed383760 con 0x5649ed42e800
2021-04-15T16:51:03.446+0000 7faa9dd67700  4 mds.3.log Waiting for journal 0x203 to recover...
2021-04-15T16:51:03.448+0000 7faaa4574700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] <== mon.0 v2:172.21.15.39:3300/0 19 ==== mon_get_version_reply(handle=4 version=26) v2 ==== 24+0+0 (secure 0 0 0) 0x5649ed383760 con 0x5649ed42e800
2021-04-15T16:51:03.448+0000 7faa9e568700  0 mds.4329.journaler.mdlog(ro) error getting journal off disk
2021-04-15T16:51:03.448+0000 7faa9dd67700  4 mds.3.log Journal 0x203 recovered.
2021-04-15T16:51:03.448+0000 7faa9dd67700 -1 log_channel(cluster) log [ERR] : Error recovering journal 0x203: (2) No such file or directory
2021-04-15T16:51:03.448+0000 7faa9dd67700  5 mds.beacon.j set_want_state: up:standby-replay -> down:damaged
2021-04-15T16:51:03.448+0000 7faa9dd67700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] --> [v2:172.21.15.39:3300/0,v1:172.21.15.39:6789/0] -- log(1 entries from seq 1 at 2021-04-15T16:51:03.449291+0000) v1 -- 0x5649ed368000 con 0x5649ed42e800
2021-04-15T16:51:03.448+0000 7faa9dd67700  5 mds.beacon.j Sending beacon down:damaged seq 5
2021-04-15T16:51:03.448+0000 7faa9dd67700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] --> [v2:172.21.15.39:3300/0,v1:172.21.15.39:6789/0] -- mdsbeacon(4329/j down:damaged seq=5 v16) v8 -- 0x5649ee1fe000 con 0x5649ed42e800
2021-04-15T16:51:03.448+0000 7faa9dd67700 20 mds.beacon.j send_and_wait: awaiting 5 for up to 5s
2021-04-15T16:51:03.448+0000 7faaa7d7b700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] <== mon.0 v2:172.21.15.39:3300/0 20 ==== mdsbeacon(4329/j down:damaged seq=5 v16) v8 ==== 130+0+0 (secure 0 0 0) 0x5649ee1fe000 con 0x5649ed42e800
2021-04-15T16:51:03.448+0000 7faaa7d7b700  5 mds.beacon.j received beacon reply down:damaged seq 5 rtt 0
2021-04-15T16:51:03.448+0000 7faa9dd67700  1 mds.j respawn!

When this standby-replay mds daemon was doing the journal recover it was trying to read the journal head object 0x203.00000000, this object was located in pg3.7 [3, 11], but during this time in the OSD.3's and OSD.11's log files, there was no any read op request of this object.

And checked all the other OSDs' log file didn't find any log about this op requests during that time too.

Actions

Copy link

Updated by Xiubo Li about 3 years ago

The cephfs_metadata pool was created since osdmap v22:

2021-04-15T16:50:55.422+0000 7f79fc9d0700 10 mon.a@0(leader) e1 handle_subscribe mon_subscribe({osdmap=0}) v3
2021-04-15T16:50:55.422+0000 7f79fc9d0700 10 mon.a@0(leader).osd e22 check_osdmap_sub 0x562bd872d020 next 0 (onetime)
2021-04-15T16:50:55.585+0000 7f79fc9d0700 10 mon.a@0(leader).paxosservice(osdmap 1..22) dispatch 0x562bd89a5c80 mon_command({"prefix": "osd pool create", "pool": "cephfs_metadata", "pg_num": 8} v 0) v1 from client.? 172.21.15.39:0/1177237822 con 0x562bd822d800
2021-04-15T16:50:55.617+0000 7f79fc9d0700 10 mon.a@0(leader).paxosservice(osdmap 1..22)  setting proposal_timer 0x562bd8974f60 with delay of 0.533051
2021-04-15T16:50:56.150+0000 7f79ff1d5700 10 mon.a@0(leader).paxosservice(osdmap 1..22) propose_pending
2021-04-15T16:50:56.150+0000 7f79ff1d5700  1 mon.a@0(leader).osd e22 do_prune osdmap full prune enabled
2021-04-15T16:50:56.150+0000 7f79ff1d5700 10 mon.a@0(leader).osd e22 should_prune currently holding only 21 epochs (min osdmap epochs: 500); do not prune.
2021-04-15T16:50:56.152+0000 7f79fa1cb700 10 mon.a@0(leader).paxosservice(osdmap 1..22) refresh
2021-04-15T16:50:56.154+0000 7f79fa1cb700 10 mon.a@0(leader).paxosservice(osdmap 1..22) post_refresh
2021-04-15T16:50:56.156+0000 7f79fa1cb700 10 mon.a@0(leader).paxosservice(osdmap 1..23) refresh

And the standby-replay mds.j have gotten the osdmap version was 26:

2021-04-15T16:51:03.446+0000 7faa9dd67700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] --> [v2:172.21.15.39:3300/0,v1:172.21.15.39:6789/0] -- mon_get_version(what=osdmap handle=4) v1 -- 0x5649ed383760 con 0x5649ed42e800
2021-04-15T16:51:03.446+0000 7faa9dd67700  4 mds.3.log Waiting for journal 0x203 to recover...
2021-04-15T16:51:03.448+0000 7faaa4574700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] <== mon.0 v2:172.21.15.39:3300/0 19 ==== mon_get_version_reply(handle=4 version=26) v2 ==== 24+0+0 (secure 0 0 0) 0x5649ed383760 con 0x5649ed42e800
2021-04-15T16:51:03.448+0000 7faa9e568700  0 mds.4329.journaler.mdlog(ro) error getting journal off disk

Actions

Copy link

Updated by Xiubo Li about 3 years ago

Checked all the possible logs in osd/mon/mds and the related code, and have compared the normal logs, the sequence are almost the same, didn't find any log or code that is suspicious and could cause it couldn't find the pool from the osdmap yet.

Just before the journal->recover(), which will try to read the 0x203.00000000 object from pg3.7 [3,11], 0x403.00000000 was successfully read, which also from the same pg 3.7.

2021-04-15T16:51:03.441+0000 7faaa4574700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] --> [v2:172.21.15.39:6800/37588,v1:172.21.15.39:6801/37588] -- osd_op(unknown.0.0:3 3.7 3:e4812e9a:::403.00000000:head [read 0~0] snapc 0=[] ondisk+read+known_if_redirected+full_force e26) v8 -- 0x5649ee1e4c00 con 0x5649ee1e4800
...
2021-04-15T16:51:03.445+0000 7faaa757a700  1 -- [v2:172.21.15.39:6838/1945817279,v1:172.21.15.39:6839/1945817279] <== osd.3 v2:172.21.15.39:6800/37588 1 ==== osd_op_reply(3 403.00000000 [read 0~22 out=22b] v0'0 uv20 ondisk = 0) v8 ==== 156+0+22 (crc 0 0 0) 0x5649ee026480 con 0x5649ee1e4800

read tier was not enabled too, which when reading the journal the tier pool may cause the pool dne issue here:

"tier_of":-1,"read_tier":-1,"write_tier":-1,

Maybe we should enable the 'debug objecter = 20/20' to reproduce it. Locally I have tried but didn't reproduce it yet.

Actions

Copy link

Updated by Xiubo Li about 3 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 41254

There is one rare case that when mds daemon received a new mdsmap
and during decoding it, the metadata_pool will be reset to -1, if
other threads try to get it from old mdsmap it will get a invalid
pool id.

Actions

Copy link

Updated by Xiubo Li about 3 years ago

More detail:

 654 void MDSDaemon::handle_mds_map(const cref_t<MMDSMap> &m)
 655 { 
 656   version_t epoch = m->get_epoch();
 657 
 658   // is it new?
 659   if (epoch <= mdsmap->get_epoch()) { 
 660     dout(5) << "handle_mds_map old map epoch " << epoch << " <= " 
 661             << mdsmap->get_epoch() << ", discarding" << dendl;
 662     return;
 663   }
 664   
 665   dout(1) << "Updating MDS map to version " << epoch << " from " << m->get_source() << dendl;
 666   
 667   // keep old map, for a moment
 668   std::unique_ptr<MDSMap> oldmap;
 669   oldmap.swap(mdsmap);
 670   
 671   // decode and process
 672   mdsmap.reset(new MDSMap);                                                                                                                                             
 673   mdsmap->decode(m->get_encoded());
 674

Just after the Line#672 and before the Line#673 finished, if the journal recovery thread tries to get the metadata pool from the mdsmap, it will return -1. Though the metadata pool won't change, but the mdsmap.reset(new MDSMap) will make it clean temporarily.

NOTE: the journal recovery thread doesn't hold the mds_lock mostly.

Actions

Copy link