A third party single sign-on application was installed on two nodes of db2 cluster having 2 CF and 4 members. The pureScale cluster started showing signs of trouble and naively, we did not think the impact of the software installed.
This single sign-on software (as we discovered much later) changed the gid and uid of the DB2 instance, fenced user to their own uid and gid for a single sign-on process.
So out of 6 nodes, the uid and gid were changed to a different number on 2 nodes by this software and we had no clue initially as that had happened.
We started looking things at wrong places and this led to more issues and cluster wide issues. In other words, we made things worst by deleting the TSA resources.
This also gave us the opportunity to document the recovery procedure as how to recover from this one of the rare situation.
Please note:
The uid and gid of all db2 related users should match on all machines and every DB2 DBA knows this by heart.
The second thing (we failed to notice early on) was the time difference on all 6 machines and that also led insult to injury. So make sure that ntp is running and the time is synchronized on all machines.
The procedure of successful recovery is listed here for my own future reference.
Note: Commands prefixed with # are run as root and $ as instance owner.
But, first symptoms.
- db2start does not work.
- lssam output shows resources offline or unable to monitor
- The cluster is dead after reboot on all machines.
How recovery was done:
- Make sure that /etc/hosts on each host has identical entries for all pureScale hosts.
- Make sure that nslookup <shortname> returns the same IP address which is in /etc/hosts for all pureScale hosts.
- Make sure that password less SSH worked from each machine to each machine for root and instance owner.
- Test ssh connection from each host to each host for 4 conditions to avoid any issue. Test ssh over short name, FQDN, IP address and the local host.
- For example, the host names are c0, c1, m0, m1, m2 and m3 (c0 and c1 are CFs and rests are all members.) The sample script is here.
File: ip.txt c0 c0.foo.com 10.10.10.10 c1 c1.foo.com 10.10.10.20 m0 m0.foo.com 10.10.10.30 m1 m1.foo.com 10.10.10.40 m2 m2.foo.com 10.10.10.50 m3 m3.foo.com 10.10.10.60
Script sshtest.sh #!/bin/bash if [ $# -lt 1 ] ; then echo "Usage: $0 command" exit 1 fi IPFILE=ip.txt CMD="$@" i=0 while read a b c do shortname[$i]=$a longname[$i]=$b ipaddr[$i]=$c ((i++)) done < $IPFILE for ((i = 0; i < ${#shortname[@]}; ++i)); do echo "Running $CMD in ${shortname[$i]}" ssh ${shortname[$i]} $CMD echo "Running $CMD in ${longname[$i]}" ssh ${longname[$i]} $CMD echo "Running $CMD in ${ipaddr[$i]}" ssh ${ipaddr[$i]} $CMD done echo "Running $CMD in localhost" ssh localhost $CMD
So, run above script from every host to make 100% sure that SSH communication is without any fail for root as well as db2 instance owner.
$ ./sshtest date
Create modified script that runs command on all hosts.
Script runall.sh #!/bin/bash if [ $# -lt 1 ] ; then echo "Usage: $0 command" exit 1 fi CMD="$@" for h in c0 c1 m0 m1 m2 m3 do echo "Running $CMD in ${h}" ssh $h $CMD done
Drop / Destroy TSA Resources. [Note: Before destroying TSA domain, note the domain name and the Tie Breaker information.]
$ lsrpdomain --> Find out name of the TSA domain name and save the domain name. $ db2cluster -cm -list -tiebreaker --> Note tie breaker information
# export CT_MANAGEMENT_SCOPE=2 # rmrpdomain -f <domainname>
Wait for 1-2 minute and run
# ./runall.sh lsrpdomain
–> It should show that there is no domain
# ./runall.sh lssam
–> It should show that there is no resources defined.
Assume that our shared instance mount name is /db2sd and the DB2 instance name is db2psc
Save file /db2sd/db2psc/sqllib_shared/cfg/db2instanceinfo.reg and /db2sd/db2psc/sqllib_shared/cfg/db2instanceinfo.reg.2 to a safe location and delete these files from the original location.
[This is an extreme case and you should never have a need to do this. We were adventurous so we did it and our resource creation was failing and we did not know at that time that the uid and gid were changed. So we were looking at wrong places to fix things.]
Save the output from
$ db2hareg -dump The example output:
A05000500000000,RU,32872,32736 A05000500000000,IN,100,2,0,1 A05000500000000,DN,m0.foo.com,,m0-r1,m0-r2 A05000500000000,DN,m1.foo.com,,m1-r1,m1-r2 A05000500000000,DN,m2.foo.com,,m2-r1,m2-r2 A05000500000000,DN,m3.foo.com,,m3-r1,m3-r2 A05000500000000,MO,/db2sd, ,0,8,0 A05000500000000,NL,128,c1.foo.com,0,c1-r1,c1-r2,-,CF A05000500000000,DN,c1.foo.com,,c1-r1,c1-r2 A05000500000000,NL,129,c0.foo.com,0,c0-r1,c0-r2,-,CF A05000500000000,DN,c0.foo.com,,c0-r1,c0-r2 A05000500000000,NL,0,m0.foo.com,0,m0-r1,m0-r2,-,MEMBER A05000500000000,NL,1,m1.foo.com,0,m1-r1,m1-r2,-,MEMBER A05000500000000,NL,2,m2.foo.com,0,m2-r1,m2-r2,-,MEMBER A05000500000000,NL,3,m3.foo.com,0,m3-r1,m3-r2,-,MEMBER A05000500000000,DB,TESTDB,1 A05000500000000,MO,/u00,TESTDB,0,12,0 A05000500000000,MO,/u01,TESTDB,0,1,0 A05000500000000,MO,/u02,TESTDB,0,1,0 A05000500000000,MO,/u03,TESTDB,0,1,0 A05000500000000,MO,/u04,TESTDB,0,1,0
Some explanation of the above output. A050005 is 10.5.005, which is DB2 release number shown in hex.
The RU record is for the rolling upgrade and this is created during the instance creation process. The IN record is the instance record. The MO record is the mount point.The DN, NL DB are for database nodes and database records.
Since we deleted db2instanceinfo.reg, all above entries were gone.
Run preprpnode command to exchange keys.
# ./runall.sh preprpnode c0 c1 m0 m1 m2 m3
From m0 host, create TSA domain and add all hosts one by one to the domain.
# db2cluster -cm -create -host m0 -domain <domainname>
Use same domain name that was found using lsrpdomain name. And add all other 5 hosts.
# db2clutser -cm -add -host m1 # db2clutser -cm -add -host m2 # db2clutser -cm -add -host m3 # db2clutser -cm -add -host c0 # db2clutser -cm -add -host c1
So, TSA domain is created. Check the domain is online on all nodes and the all nodes are online on all nodes.
# ./runall.sh lsrpdoamin # ./runall.sh lsrpnode
Then, create TSA resources.
# db2cluster -cm -create -resources
In our case, the above failed but when we deleted db2instnaceinfo.reg then it succeeded but we ran into other problem.
Run
# db2hareg -dump
The output is as shown below:
A05000500000000,IN,100,2,0,1 A05000500000000,DN,m0,,m0-r1,m0-r2 A05000500000000,DN,m1,,m1-r1,m1-r2 A05000500000000,DN,m2,,m2-r1,m2-r2 A05000500000000,DN,m3,,m3-r1,m3-r2 A05000500000000,NL,128,c1.foo.com,0,c1-r1,c1-r2,-,CF A05000500000000,NL,129,c0.foo.com,0,c0-r1,c0-r2,-,CF A05000500000000,NL,0,m0.foo.com,0,m0-r1,m0-r2,-,MEMBER A05000500000000,NL,1,m1.foo.com,0,m1-r1,m1-r2,-,MEMBER A05000500000000,NL,2,m2.foo.com,0,m2-r1,m2-r2,-,MEMBER A05000500000000,NL,3,m3.foo.com,0,m3-r1,m3-r2,-,MEMBER A05000500000000,MO,/db2sd, ,0,14,0 A05000500000000,RU,32872,32736
When we compare with the original output, there is no RU record. The DB and all mounts points are missing. We will see in the next sections how to add these records.
We have to call DB2 support to get the service password to run db2 internal command db2ALmanager to create the RU record in db2instanceinfo.reg file.
# export DB2SVCPW='xxxx'
Run these commands so that RU record is created.
# db2ALmanager -invalidate 0 -online -archlevel 0x0A05000500000000 -sectionlevel 0x0A05000500000000 -codeLevel 0x0A05000500000000 -instance db2psc # db2ALmanager -update 0 -online -archlevel 0x0A05000500000000 -sectionlevel 0x0A05000500000000 -codeLevel 0x0A05000500000000 -instance db2psc # db2ALmanager -revalidate 0 -instance db2psc
Now, we ran the following db2hareg commands to create database and all mount points records. [db2hareg does not need service password.]
# db2hareg -add Mount path=/u00,databasename=TESTDB,partitionnum=0,usecount=12,noncriticalflag=0 # db2hareg -add Mount path=/u01,databasename=TESTDB,partitionnum=0,usecount=1,noncriticalflag=0 # db2hareg -add Mount path=/u02,databasename=TESTDB,partitionnum=0,usecount=1,noncriticalflag=0 # db2hareg -add Mount path=/u03,databasename=TESTDB,partitionnum=0,usecount=1,noncriticalflag=0 # db2hareg -add Mount path=/u04,databasename=TESTDB,partitionnum=0,usecount=1,noncriticalflag=0
Note: Whenever db2hareg or internal db2HAmanager or db2greg commands are run (metadata for cluster is changed), it is necessary to repair the resources so that TSA knows what it needs to monitor.
# db2cluster -cm -repair -resources
Now, start db2 instance one by one for each hosts. DB2 instance can be started from any host. For example:
$ db2start 128 $ db2start 129 $ db2start 0 $ db2start 1 $ db2start 2 $ db2start 3
In our case, our instance start failed even though the resource creation was successful. We have to change the diag level to 4 and then we found out that file permission problem for the files that were owned by the db2 instance. At this time, we checked the gid and uid of the db2 instance owner on machines where this 3rd party single sign-on software was installed. Then, we noticed that the uid and gid are different than what we used initially at the time of the software installation. All above – what we did was totally unnecessary and we were looking things in the wrong places.
So, the next thing was to change the uid and gid to the proper numbers that we had used in all other machines. That was done by using smitty but this process does not go and change the uid and gid of the files to the new (and correct) values.
So, this was the manual process to change the uid and gid for the files. This is a treacherous process and no one should attempt to do this on global basis such as chown -R user:group on the top level folder as it will mess things and there are soft links that need to be updated properly.
For example: The new uid and gid were changed to 59705 and 906 by this 3rd party single sign-on software. We need to revert these to correct values.
In the home directory of instance directory on each host where sqllib directory is there.
# find . -type l | xargs chown -h db2psc # find . -user 59705 -group 906 | xargs chown db2psc:db2iadm1 # find . -user 59705 | xargs chown db2psc # find . -group 906 | xargs chgrp db2iadm1
The first command will look for all soft links and update the owner to the db2psc. The second command will look for all files owned by the uid 59705 and gid 906 and change the file ownership to db2psc. Run find command and you should not see any number showing in the uid and gid fields when seeing the listing of all files.
Then we need to run above same commands in /db2sd/db2psc for sqllib_shared directory.
Our database mount points were /u00, /u01, /u02, /u03, and /u04. The active logs and archive logs were on /au, /bu. So, repeat same process on all mount points.
Since the uid and gid were messed up, we need to make sure that we did it right; So run db2val command on each host to make sure we did not miss anything.
$ db2val --> on all hosts.
Since we modified things, it is a good idea to again repair the resources.
# db2cluster -cm -repair -resources
Then start the instance.
If some hosts are still in WAITING_FOR_FAILBACK state, follow the steps:
$ ./runall.sh db2cluster -cm -clear -alerts -member 0 $ ./runall.sh db2cluster -cm -clear -alerts -member 1 $ ./runall.sh db2cluster -cm -clear -alerts -member 2 $ ./runall.sh db2cluster -cm -clear -alerts -member 3
Then, wait for 1-2 minutes and the hosts should integrate automatically in the domain.
It is not enough to run above on one hosts only as the alerts scope is global (under sqllib_shared) and local (under sqllib).
Lessons learned: Ask if installing any software as root, if it is going to change uid and gid. As was our case, and we realized this much later in the process. The other lesson is to make sure that the time is synchronized on all hosts.
After instance is started successfully and the database is activated, we need to make sure that our lssam output matches with the original lssam output.
The output of db2hareg should also match.
Set the Tie Breaker back to the original disk using db2cluster -cm -set -tiebreaker command.