Project

General

Profile

Actions

Bug #65494

open

ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists"

Added by Nir Soffer about 1 month ago. Updated 6 days ago.

Status:
Pending Backport
Priority:
Normal
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
backport_processed
Backport:
squid,reef,quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description

We have a random error (about 1 in 200 deploys) when after creating a rook
cephcluster and cephblockpool successfully, configuring rbd mirroring and
adding a cephrbdmirror, the cephrbdmirror never becomes ready (we waited few hours).

Looking at ceph status shows:

  cluster:
    id:     dbf6c8b8-dd8b-4117-933e-93778b1a7274
    health: HEALTH_ERR
            Module 'devicehealth' has failed: table Device already exists

In rook-ceph-mgr-a pod logs we see:

debug 2024-04-09T13:05:48.947+0000 7f0632607700 -1 Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 524, in check
    return func(self, *args, **kwargs)
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 350, in _do_serve
    if self.db_ready() and self.enable_monitoring:
  File "/usr/share/ceph/mgr/mgr_module.py", line 1271, in db_ready
    return self.db is not None
  File "/usr/share/ceph/mgr/mgr_module.py", line 1283, in db
    self._db = self.open_db()
  File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db
    self.configure_db(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db
    self.load_schema(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema
    self.maybe_upgrade(db, int(row['value']))
  File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade
    db.executescript(self.SCHEMA)
sqlite3.OperationalError: table Device already exists

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/devicehealth/module.py", line 394, in serve
    self._do_serve()
  File "/usr/share/ceph/mgr/mgr_module.py", line 532, in check
    self.open_db();
  File "/usr/share/ceph/mgr/mgr_module.py", line 1264, in open_db
    self.configure_db(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1241, in configure_db
    self.load_schema(db)
  File "/usr/share/ceph/mgr/mgr_module.py", line 1230, in load_schema
    self.maybe_upgrade(db, int(row['value']))
  File "/usr/share/ceph/mgr/mgr_module.py", line 1207, in maybe_upgrade
    db.executescript(self.SCHEMA)
sqlite3.OperationalError: table Device already exists

Restarting the ceph-mgr pod does not help, rbd-mirroring is broken and
we don't have any workaround.

For testing ramen this is not that bad, we can delete the environment and
recreate it in 10 minutes, but for real deployment this looks bad.

See also

- Upstream issue: https://github.com/RamenDR/ramen/issues/1298


Files

rook-ceph-mgr-a-7868f9cbdd-2cn62.log.gz (87.2 KB) rook-ceph-mgr-a-7868f9cbdd-2cn62.log.gz rook-ceph-mgr-a-7868f9cbdd-2cn62 log Nir Soffer, 04/15/2024 07:13 PM

Related issues 3 (3 open0 closed)

Copied to cephsqlite - Backport #65730: squid: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists"In ProgressPatrick DonnellyActions
Copied to cephsqlite - Backport #65731: reef: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists"In ProgressPatrick DonnellyActions
Copied to cephsqlite - Backport #65736: quincy: ceph-mgr critical error: "Module 'devicehealth' has failed: table Device already exists"In ProgressPatrick DonnellyActions
Actions

Also available in: Atom PDF