Discussion:
kdb5_util-1.15.1: Invalid argument while making newly loaded database live
(too old to reply)
rachit chokshi
2024-03-04 09:23:39 UTC
Permalink
Hello,
We have a setup where the kerberos database (db2) is hosted on an NFS
server. There are multiple KDC servers each mounting the NFS share and
serving traffic.

For replicating data into the NFS hosted database from an external master
KDC. We have a sync job setup that runs "kdb5_util load" against the NFS
hosted database every few minutes (~5m)

Approximately once every month, we experience a corruption scenario where
the "kdb5_util load" starts crashing with the below error strings.
kdb5_util: Cannot open DB2 database
'/var/kerberos/krb5kdc_shared/principal': Invalid argument >while making
newly loaded database live
kdb5_util: Cannot open DB2 database
'/var/kerberos/krb5kdc_shared/principal~': Invalid >argument while deleting
bad database /var/kerberos/krb5kdc_shared/principal

After the system enters into this state. There is a complete outage.
Existing running KDCs processes are unable to access the database (Cannot
open DB2 database). Only way to recover is to delete the database and
create a new one from the dump.


It would be a great help, If anybody can help us understand where things
are going wrong and what can be done to avoid this situation. Tried going
through the code, no pointers found so far.

Thank you,
Rachit
Ken Hornstein
2024-03-04 15:55:53 UTC
Permalink
Post by rachit chokshi
We have a setup where the kerberos database (db2) is hosted on an NFS
server. There are multiple KDC servers each mounting the NFS share and
serving traffic.
I have to say up front that it is generally agreed that putting any database
file on a NFS filesystem is a bad idea. Also, it kind of sounds like
your multiple KDCs are serving the SAME database file? If so, THAT is
a huge problem!
Post by rachit chokshi
kdb5_util: Cannot open DB2 database
'/var/kerberos/krb5kdc_shared/principal~': Invalid >argument while deleting
bad database /var/kerberos/krb5kdc_shared/principal
I am looking at newer Kerberos code, so perhaps this has changed, but
that error comes from krb5_db_destroy() failing. For DB2, that ends
up calling krb5_db2_destroy(). That function does a lot of things,
and it's hard at a glance to figure out which part of it is failing; I
suspect the only way to figure out what is going wrong there is to build
a version of Kerberos with full debugging symbols and set a breakpoint
on krb5_db2_destroy(). I have a strong suspicion that the database file
is getting corrupted in a such a way that the other routines cannot
recover, and that's likely due to the use of NFS (especially if multiple
KDCs are using the same database file).

--Ken
Brent Kimberley
2024-03-04 17:01:05 UTC
Permalink
A message queue is typically a better way to synchronize a cluster.
The bonus is that you can track adds, deletes, and modifies via historian.
Anchors in Relative Time!?

-----Original Message-----
From: Kerberos <kerberos-***@mit.edu> On Behalf Of Ken Hornstein via Kerberos
Sent: Monday, March 4, 2024 10:56 AM
To: rachit chokshi <***@gmail.com>
Cc: ***@mit.edu
Subject: Re: kdb5_util-1.15.1: Invalid argument while making newly loaded database live
Post by rachit chokshi
We have a setup where the kerberos database (db2) is hosted on an NFS
server. There are multiple KDC servers each mounting the NFS share and
serving traffic.
I have to say up front that it is generally agreed that putting any database file on a NFS filesystem is a bad idea. Also, it kind of sounds like your multiple KDCs are serving the SAME database file? If so, THAT is a huge problem!
Post by rachit chokshi
kdb5_util: Cannot open DB2 database
'/var/kerberos/krb5kdc_shared/principal~': Invalid >argument while
deleting bad database /var/kerberos/krb5kdc_shared/principal
I am looking at newer Kerberos code, so perhaps this has changed, but that error comes from krb5_db_destroy() failing. For DB2, that ends up calling krb5_db2_destroy(). That function does a lot of things, and it's hard at a glance to figure out which part of it is failing; I suspect the only way to figure out what is going wrong there is to build a version of Kerberos with full debugging symbols and set a breakpoint on krb5_db2_destroy(). I have a strong suspicion that the database file is getting corrupted in a such a way that the other routines cannot recover, and that's likely due to the use of NFS (especially if multiple KDCs are using the same database file).

--Ken
________________________________________________
Kerberos mailing list ***@mit.edu
https://mailman.mit.edu/mailman/listinfo/kerberos
THIS MESSAGE IS FOR THE USE OF THE INTENDED RECIPIENT(S) ONLY AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED, PROPRIETARY, CONFIDENTIAL, AND/OR EXEMPT FROM DISCLOSURE UNDER ANY RELEVANT PRIVACY LEGISLATION. No rights to any privilege have been waived. If you are not the intended recipient, you are hereby notified that any review, re-transmission, dissemination, distribution, copying, conversion to hard copy, taking of action in reliance on or other use of this communication is strictly prohibited. If you are not the intended recipient and have received this message in error, please notify me by return e-mail and delete or destroy all copies of this message.
Loading...