In DB2 pureScale if a node goes down, it is started on another host in a light mode just to do the crash recovery. When the home host becomes available, the host should fail back to its home host seamlessly.

For example:

This is a sample output from the db2instance -list command which shows that things are working pretty normal.

db2psc@node02:~> db2instance -list
ID        TYPE             STATE                HOME_HOST               CURRENT_HOST            ALERT   
--        ----             -----                ---------               ------------            -----   
0       MEMBER           STARTED                   node02                     node02               NO
1       MEMBER           STARTED                   node03                     node03               NO
2       MEMBER           STARTED                   node04                     node03               NO
128     CF               PRIMARY                   node02                     node02               NO
129     CF                  PEER                   node03                     node03               NO

HOSTNAME                   STATE                INSTANCE_STOPPED        ALERT
--------                   -----                ----------------        -----
  node04                INACTIVE                              NO           NO
  node03                  ACTIVE                              NO           NO
  node02                  ACTIVE                              NO           NO

Now, I rebooted node04 and this is a sample output from the db2instance -list command which shows that node04 has started on node03 (look at the CURRENT_HOST column).

db2psc@node02:~> db2instance -list
ID        TYPE             STATE                HOME_HOST               CURRENT_HOST            ALERT   
--        ----             -----                ---------               ------------            -----   
0       MEMBER           STARTED                   node02                     node02              YES
1       MEMBER           STARTED                   node03                     node03              YES
2       MEMBER  WAITING_FOR_FAILBACK               node04                     node03              YES
128     CF               PRIMARY                   node02                     node02               NO
129     CF                  PEER                   node03                     node03               

HOSTNAME                   STATE                INSTANCE_STOPPED        ALERT
--------                   -----                ----------------        -----
  node04                INACTIVE                              NO          YES
  node03                  ACTIVE                              NO          YES
  node02                  ACTIVE                              NO          YES

Look at the 3rd line in the db2nodes.cfg file and notice that number 2 shows after node03.local and this is the logical host. In normal circumstances, this would be 0 (unless using LPARs on AIX).

db2psc@node02:~> cat sqllib/db2nodes.cfg
0 node02.local 0 node02-ib.local - MEMBER
1 node03.local 0 node03-ib.local - MEMBER
2 node03.local 2 node03-ib.local - MEMBER
128 node02.local 0 node02-ib.local - CF
129 node03.local 0 node03-ib.local - CF

Meanwhile, the node03 came back up and it did integrate well with the cluster. The output from db2instance -list is clean and db2nodes.cfg shows thing OK.

Now the real part comes here. I got into this after doing several reboots on all hosts (actually to fix the slow POST problem on the servers) and I noticed that I am somehow in a Restart Light loop and it will not come out from the WAITING_FOR_FAILBACK loop on its own.

The db2instance -list output.

db2psc@node02:~> db2instance -list
ID        TYPE             STATE                HOME_HOST               CURRENT_HOST            ALERT
--        ----             -----                ---------               ------------            -----
0       MEMBER           STARTED                   node02                     node02              YES
1       MEMBER  WAITING_FOR_FAILBACK               node03                     node02              YES
2       MEMBER  WAITING_FOR_FAILBACK               node04                     node02              YES
128     CF               PRIMARY                   node02                     node02               NO
129     CF               CATCHUP                   node03                     node03               NO

HOSTNAME                   STATE                INSTANCE_STOPPED        ALERT
--------                   -----                ----------------        -----
  node04                  ACTIVE                              NO          YES
  node03                  ACTIVE                              NO          YES
  node02                  ACTIVE                              NO           NO

Please notice that two hosts node03 and node04 are waiting to fail back to their home hosts and currently their current host is node02.

The list of alerts shown seem to be in a cycle as shown.

Alert: DB2 member '2' failed to start on its home host 'node04'. The cluster manager will attempt to restart the DB2 member in restart light mode on another host. Check the db2diag.log for messages concerning failures on host 'node04' for member '2'.

Alert: The DB2 member '1' could not be started in restart light mode on host 'node04'. Check the db2diag.log for messages concerning a restart light or database crash recovery failure on the indicated host for DB2 member '1'.

Alert: DB2 member '1' failed to start on its home host 'node03'. The cluster manager will attempt to restart the DB2 member in restart light mode on another host. Check the db2diag.log for messages concerning failures on host 'node03' for member '1'.

Alert: The DB2 member '0' could not be started in restart light mode on host 'node03'. Check the db2diag.log for messages concerning a restart light or database crash recovery failure on the indicated host for DB2 member '0'.

There are 2 solutions to this problem.

Method – 1 : Brute Force

Look at the db2nodes.cfg file and compare it with your original file and if you see any thing different, correct it and restart DB2. This is not the recommended procedure to hand edit the db2nodes.cfg file.

In my case, this was my db2nodes.cfg file.

db2psc@node03:~> cat sqllib/db2nodes.cfg
0 node04.local 1 node04-ib.local - MEMBER
1 node02.local 3 node02-ib.local - MEMBER
2 node03.local 1 node03-ib.local - MEMBER
128 node02.local 0 node02-ib.local - CF
129 node03.local 0 node03-ib.local - CF

Member 0 should be on host node02, member 1 should be on node03 and member 2 should be on node04. But, due to several reboots, the restart light happened on different nodes. Stop DB2, restore your original db2nodes.cfg and clean up all alerts manually as shown below.

Remove Alert Files Manually From Each Host

Go to each host and remove the alert file manually.

Remove Alert Files From node02, node03 and node04

$ ssh node02 "rm -fr ~/sqllib/ctrlha/.*"
$ ssh node03 "rm -fr ~/sqllib/ctrlha/.*"
$ ssh node04 "rm -fr ~/sqllib/ctrlha/.*"
$ ssh node02 "rm -fr ~/sqllib/ctrlhamirror/.*"
$ ssh node03 "rm -fr ~/sqllib/ctrlhamirror/.*"
$ ssh node04 "rm -fr ~/sqllib/ctrlhamirror/.*"

Start DB2 again and monitor db2instance -list and lssam output and it should come clean.

Let me repeat that above is not the recommended procedure and if you tell this to support folks, they will cringe, fringe and fry.

Method – 2 : Clear Alerts on Each member Individually (Recommended and Best)

For example, in my case, I have 3 hosts named node02, node03 and node04 and I used the following commands to clear alerts on each hosts.

Clear alerts on node02 for all members.

# ssh node02 db2cluster -cm -clear -alert -member 0
# ssh node02 db2cluster -cm -clear -alert -member 1
# ssh node02 db2cluster -cm -clear -alert -member 2

Clear alerts on node03 for all members.

# ssh node03 db2cluster -cm -clear -alert -member 0
# ssh node03 db2cluster -cm -clear -alert -member 1
# ssh node03 db2cluster -cm -clear -alert -member 2

Clear alerts on node04 for all members.

# ssh node04 db2cluster -cm -clear -alert -member 0
# ssh node04 db2cluster -cm -clear -alert -member 1
# ssh node04 db2cluster -cm -clear -alert -member 2

Check the output from db2instance -list and lssam and give it minimum 5 minutes to settle down.

This should fix this issue. The above procedure should also fix the issued of "Failed Offline" as seen in lssam output for the offline resources. That is nothing but the issue of restart light and somehow the idle process not starting on its own since there is an alert sitting. The TSA will not start that resource automatically since the code logic seems to say that there is an alert and a DBA needs to clear it. I wish that it was more automatic given the situation that GPFS is healthy on all nodes which is indicative of the fact that the network is working well.

In DB2 10.1, there is an additional option available to prevent automatic failback and that is useful when someone needs to do several reboots on a server or decide when a failback should happen.

This can be done by using the command db2cluster -cm -set -option autofailback -value off on the host which you want to prevent to become part of the cluster since you are still making sure of the hardware issues etc.

When you have this option on a host turned off, you will continue to see WAITING_FOR_FAILBACK in the output from db2instance -list. Once you are sure that the server is now ready to integrate with the pureScale cluster, turn this option on that host.

Final Note: I did not use db2cluster -cm -set -option autofailback -value off on any node and I booted / rebooted all of my 3 servers many times today and just by following the 2nd procedure to clear the alerts, every thing came back up online in seconds. This is great but I wish that this intelligence is built in future releases or fix packs so that I do not have to manually clear it. If GPFS is healthy on all nodes, there should not be a reason to not to do it automatically by the DB2.