DB2 10.1 pureScale provides db2cluster command to do things that are related to the RSCT.
For example:
You can use db2cluster -cm -list -hostfailuredetectiontime to find out the host failure detection time.
node02:~ # db2cluster -cm -list -hostfailuredetectiontime The host failure detection time is 8 seconds.
This is actually related to the network settings in RSCT. Type command lscomg and notice the parameters sensitivity, Period, Priority and Grace.
node02:~ # lscomg Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace CG1 4 0.8 1 Yes Yes 60
If you want to change the host failure detection time to 4 seconds. You will use the command db2cluster -cm -set -hostfailuredetectiontime -value 4.
node02:~ # db2cluster -cm -set -hostfailuredetectiontime -value 4 The host failure detection time has been set to 4 seconds.
Again check lscomg and notice Period and Grace column which is now 0.4 relating to 4 seconds and the grace period is reduced from 60 to 30 seconds.
node02:~ # lscomg Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace CG1 4 0.4 1 Yes Yes 30
From the RSCT documentation, the explanation of Sensitivity, Period, Priority and Grace are as follows:
Sensitivity The number of missed heartbeats that constitute a failure Period The number of seconds between heartbeats Priority The relative priority of the communication group Grace The number of seconds for the grace period
As per the developerworks article, the host failure detection time is computed as:
Quote from the paper:
Time to detect node failure(in sec) = Sensitivity x (Period x 2) of the CommunicationGroup (CommGroup) the nodes in the cluster belong to.
Notes: – A NIC/IP can belong to only one Communications Group at a time per single node – i.e., if there are multiple NICs on one or more nodes in a cluster, they will be placed in different Communication Groups.
- Sensitivity: The number of missed heartbeats that constitute a failure - Period: The number of seconds between heartbeats
If the cluster is already operational, you can query the Cluster CommGroups to find out these two values. For example, these are the specific values in our three-node multi-site TSA/RSCT cluster topology.
root> lscomg Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace CG1 10 3 1 Yes Yes 30 Hence in our cluster, time to detect node failure= 10 x (3 x 2) = 60 sec
Another quote from the above paper:
Whenever a node loses connection with rest of the cluster nodes, the RSCT Topology Services subsystem will issue an ICMP echo to check whetherthe system is still reachable. If that node responds within the timeperiod set by Ping Grace Period, the cluster will not detect this as a node failure. Note that Ping Grace Period is not really meant for network glitches, but for cases where daemons get blocked because of memory starvation or other factors. We et this value to 30 seconds in both of our cluster topologies.