In DB2 pureScale if a node goes down, it is started on another host in a light mode just to do the crash recovery. When the home host becomes available, the host should fail back to its home host seamlessly.
For example:
This is a sample output from the db2instance -list command which shows that things are working pretty normal.
db2psc@node02:~> db2instance -list ID TYPE STATE HOME_HOST CURRENT_HOST ALERT -- ---- ----- --------- ------------ ----- 0 MEMBER STARTED node02 node02 NO 1 MEMBER STARTED node03 node03 NO 2 MEMBER STARTED node04 node03 NO 128 CF PRIMARY node02 node02 NO 129 CF PEER node03 node03 NO HOSTNAME STATE INSTANCE_STOPPED ALERT -------- ----- ---------------- ----- node04 INACTIVE NO NO node03 ACTIVE NO NO node02 ACTIVE NO NO
Now, I rebooted node04 and this is a sample output from the db2instance -list command which shows that node04 has started on node03 (look at the CURRENT_HOST column).
db2psc@node02:~> db2instance -list ID TYPE STATE HOME_HOST CURRENT_HOST ALERT -- ---- ----- --------- ------------ ----- 0 MEMBER STARTED node02 node02 YES 1 MEMBER STARTED node03 node03 YES 2 MEMBER WAITING_FOR_FAILBACK node04 node03 YES 128 CF PRIMARY node02 node02 NO 129 CF PEER node03 node03 HOSTNAME STATE INSTANCE_STOPPED ALERT -------- ----- ---------------- ----- node04 INACTIVE NO YES node03 ACTIVE NO YES node02 ACTIVE NO YES
Look at the 3rd line in the db2nodes.cfg file and notice that number 2 shows after node03.local and this is the logical host. In normal circumstances, this would be 0 (unless using LPARs on AIX).
db2psc@node02:~> cat sqllib/db2nodes.cfg 0 node02.local 0 node02-ib.local - MEMBER 1 node03.local 0 node03-ib.local - MEMBER 2 node03.local 2 node03-ib.local - MEMBER 128 node02.local 0 node02-ib.local - CF 129 node03.local 0 node03-ib.local - CF
Meanwhile, the node03 came back up and it did integrate well with the cluster. The output from db2instance -list is clean and db2nodes.cfg shows thing OK.
Now the real part comes here. I got into this after doing several reboots on all hosts (actually to fix the slow POST problem on the servers) and I noticed that I am somehow in a Restart Light loop and it will not come out from the WAITING_FOR_FAILBACK loop on its own.
The db2instance -list output.
db2psc@node02:~> db2instance -list ID TYPE STATE HOME_HOST CURRENT_HOST ALERT -- ---- ----- --------- ------------ ----- 0 MEMBER STARTED node02 node02 YES 1 MEMBER WAITING_FOR_FAILBACK node03 node02 YES 2 MEMBER WAITING_FOR_FAILBACK node04 node02 YES 128 CF PRIMARY node02 node02 NO 129 CF CATCHUP node03 node03 NO HOSTNAME STATE INSTANCE_STOPPED ALERT -------- ----- ---------------- ----- node04 ACTIVE NO YES node03 ACTIVE NO YES node02 ACTIVE NO NO
Please notice that two hosts node03 and node04 are waiting to fail back to their home hosts and currently their current host is node02.
The list of alerts shown seem to be in a cycle as shown.
Alert: DB2 member '2' failed to start on its home host 'node04'. The cluster manager will attempt to restart the DB2 member in restart light mode on another host. Check the db2diag.log for messages concerning failures on host 'node04' for member '2'.
Alert: The DB2 member '1' could not be started in restart light mode on host 'node04'. Check the db2diag.log for messages concerning a restart light or database crash recovery failure on the indicated host for DB2 member '1'.
Alert: DB2 member '1' failed to start on its home host 'node03'. The cluster manager will attempt to restart the DB2 member in restart light mode on another host. Check the db2diag.log for messages concerning failures on host 'node03' for member '1'.
Alert: The DB2 member '0' could not be started in restart light mode on host 'node03'. Check the db2diag.log for messages concerning a restart light or database crash recovery failure on the indicated host for DB2 member '0'.
There are 2 solutions to this problem.
Method – 1 : Brute Force
Look at the db2nodes.cfg file and compare it with your original file and if you see any thing different, correct it and restart DB2. This is not the recommended procedure to hand edit the db2nodes.cfg file.
In my case, this was my db2nodes.cfg file.
db2psc@node03:~> cat sqllib/db2nodes.cfg 0 node04.local 1 node04-ib.local - MEMBER 1 node02.local 3 node02-ib.local - MEMBER 2 node03.local 1 node03-ib.local - MEMBER 128 node02.local 0 node02-ib.local - CF 129 node03.local 0 node03-ib.local - CF
Member 0 should be on host node02, member 1 should be on node03 and member 2 should be on node04. But, due to several reboots, the restart light happened on different nodes. Stop DB2, restore your original db2nodes.cfg and clean up all alerts manually as shown below.
Remove Alert Files Manually From Each Host
Go to each host and remove the alert file manually.
Remove Alert Files From node02, node03 and node04
$ ssh node02 "rm -fr ~/sqllib/ctrlha/.*" $ ssh node03 "rm -fr ~/sqllib/ctrlha/.*" $ ssh node04 "rm -fr ~/sqllib/ctrlha/.*" $ ssh node02 "rm -fr ~/sqllib/ctrlhamirror/.*" $ ssh node03 "rm -fr ~/sqllib/ctrlhamirror/.*" $ ssh node04 "rm -fr ~/sqllib/ctrlhamirror/.*"
Start DB2 again and monitor db2instance -list and lssam output and it should come clean.
Let me repeat that above is not the recommended procedure and if you tell this to support folks, they will cringe, fringe and fry.
Method – 2 : Clear Alerts on Each member Individually (Recommended and Best)
For example, in my case, I have 3 hosts named node02, node03 and node04 and I used the following commands to clear alerts on each hosts.
Clear alerts on node02 for all members.
# ssh node02 db2cluster -cm -clear -alert -member 0 # ssh node02 db2cluster -cm -clear -alert -member 1 # ssh node02 db2cluster -cm -clear -alert -member 2
Clear alerts on node03 for all members.
# ssh node03 db2cluster -cm -clear -alert -member 0 # ssh node03 db2cluster -cm -clear -alert -member 1 # ssh node03 db2cluster -cm -clear -alert -member 2
Clear alerts on node04 for all members.
# ssh node04 db2cluster -cm -clear -alert -member 0 # ssh node04 db2cluster -cm -clear -alert -member 1 # ssh node04 db2cluster -cm -clear -alert -member 2
Check the output from db2instance -list and lssam and give it minimum 5 minutes to settle down.
This should fix this issue. The above procedure should also fix the issued of "Failed Offline" as seen in lssam output for the offline resources. That is nothing but the issue of restart light and somehow the idle process not starting on its own since there is an alert sitting. The TSA will not start that resource automatically since the code logic seems to say that there is an alert and a DBA needs to clear it. I wish that it was more automatic given the situation that GPFS is healthy on all nodes which is indicative of the fact that the network is working well.
In DB2 10.1, there is an additional option available to prevent automatic failback and that is useful when someone needs to do several reboots on a server or decide when a failback should happen.
This can be done by using the command db2cluster -cm -set -option autofailback -value off on the host which you want to prevent to become part of the cluster since you are still making sure of the hardware issues etc.
When you have this option on a host turned off, you will continue to see WAITING_FOR_FAILBACK in the output from db2instance -list. Once you are sure that the server is now ready to integrate with the pureScale cluster, turn this option on that host.
Final Note: I did not use db2cluster -cm -set -option autofailback -value off on any node and I booted / rebooted all of my 3 servers many times today and just by following the 2nd procedure to clear the alerts, every thing came back up online in seconds. This is great but I wish that this intelligence is built in future releases or fix packs so that I do not have to manually clear it. If GPFS is healthy on all nodes, there should not be a reason to not to do it automatically by the DB2.