When you get errors like this:

"SQL1517N db2start failed because the cluster manager resource states are inconsistent." or something like "Cluster manager resource states for the DB2 instance are inconsistent. Refer to the db2diag.log for more information on inconsistencies."

Or, When you try to repair resources using command db2cluster -cm -repair -resources and you may receive an error stating that repair resources failed and "Refer to the db2cluster command log file. This log file can be found in /tmp/ibm.db2.cluster.*."

Then, you may need to repair / rebuild the TSA resources.

Please follow the following procedure to save basic information that we will need later to rebuild the resources.

  1. Save the output from db2hareg -dump and db2greg -dump as db2 instance owner in a file or notepad.
  2. Save the output from ~/sqllib/db2nodes.cfg, if you are able to access GPFS. If not, don’t worry about it now.
  3. Save the output from lsrpdomain as root to note the name of the RSCT domain name. We want to keep the same name when we rebuild the resources.
  4. Save the output from db2cluster -cm -list -hostfailuredetectiontime as root to know current host failure detection time set for your cluster. if you do not get the output or if it takes a very long time, we will set it to a default value of 4.
  5. Save the output from lscomg as root.
  6. Save the output from db2cluster -cm -list -tiebreaker to know the tie breaker. If on AIX, note the PVID value of the disk.

Destroy the RSCT domain – Procedure – 1

# export CT_MANAGEMENT_SCOPE=2
# rmrpdomain -f <domainName>

After above is run successfully, run lsrpdomain as root on each host and you should not see any output, which means that the domain is destroyed successfully. If rmrpdomain command was successful, skip the next step.

If by any chance, rmrpdomain takes very long and does not complete or just waits, then use this alternate procedure to destroy the domain or reset RSCT completely.

Destroy the RSCT domain – Procedure – 2

On each host as root, run commands to save netmon.cf and trace.conf

# cp /var/ct/cfg/netmon.cf /tmp/netmon.cf.`hostname -s`
# cp /var/ct/cfg/trace.conf /tmp/trace.conf.`hostname -s`

Reset RSCT domain on all hosts.

# /usr/sbin/rsct/install/bin/recfgct –> Run on all hosts

After RSCT domain is reset, restore netmon.cf and trace.conf

# cp /tmp/netmon.cf.`hostname -s` /var/ct/cfg/netmon.cf
# cp /tmp/trace.conf.`hostname -s` /var/ct/cfg/trace.conf 

Reboot all hosts

Please do not ignore this.

Check RSCT license

Run samlicm -t command on each host to make sure that the RSCT license is applied successfully.

# samlicm -t
# echo $?

The output of echo $? should be zero. If the output shows as 1, reapply SAM license using samlicm -i <lic file>

Exchange Keys.

Run prepnonode command on all hosts to exchange keys.

# preprpnode <host1> <host2> <host3> <host4> .... <hostn>

If you have 3 hosts with names as node01, node02 and node03, then run preprpnode node01 node02 node03 command on all hosts so that keys are exchanged.

Create RSCT Domain and add hosts

Go to your db2 software bin directory and run db2cluster command to create RSCT domain. Please use the same domain name that you saved through your lsrpdomain command.

# cd /opt/IBM/db2/V11.1/bin
# ./db2cluster -cm -create -host <firsthostName> -domain <domainname>
# ./db2clutser -cm -add -host <secondhostname>

Repeat above command to add all remaining hosts.

Stop GPFS domain to set host failure detection time

# ./db2cluster -cfs -stop -all
# ./db2cluster -cm -set -option hostfailuredetectiontime -value 4

Set Tie Breaker Disk

Please note the name of the tie-breaker disk from your output. For AIX, get the PVID of the tie-breaker disk.

# ./db2cluster -cm -set -tiebreaker -disk PVID=<pvid>

Start GPFS domain and fix network

# ./db2cluster -cfs -start -all
# ./db2cluster -cfs -repair -network_resiliency -all

Rebuild Resources

Login as db2 instance owner and run the command.

$ db2cluster -cm -repair -resources

Do sanity check

After resources are rebuilt successfully, then do the following checks using root

  1. Check lsrpdomain output on all hosts and the domain should be online on all hosts
  2. Check lsrpnode output and all hosts should show online on all hosts
  3. Check /usr/lpp/mmfs/bin/mmgetstate -a on all hosts and all GPFS should show as active and not arbitrating.

Run these commands as db2 instance owner to check the following.

  1. Check the cat ~/sqllib/db2nodes.cfg and the output of the file should be correct. For example, if the host failure was there at the time of cluster failure, fix the line entry of that host so that it is on correct hosts. For example.
0 node01 1 node02-r1,node02-r2 - MEMBER
1 node02 0 node02-r1,node02-r2 - MEMBER
2 node03 0 node03-r1,node03-r2 - MEMBER
128 node04 0 node04-r1,node05-r2 - CF 
129 node05 0 node05-r1,node05-r2 - CF

In above output, node02 failed to node01 and we need to fix this line as we rebuilt the resources from scratch. The correct db2nodes.cfg after fix is as follows.

0 node01 0 node01-r1,node01-r2 - MEMBER
1 node02 0 node02-r1,node02-r2 - MEMBER
2 node03 0 node03-r1,node03-r2 - MEMBER
128 node04 0 node04-r1,node05-r2 - CF 
129 node05 0 node05-r1,node05-r2 - CF

After db2nodes.cfg is corrected, run db2start command to start all hosts.