DB2 pureScale and Host Failure Detection Time

DB2 10.1 pureScale provides db2cluster command to do things that are related to the RSCT.

For example:

You can use db2cluster -cm -list -hostfailuredetectiontime to find out the host failure detection time.

node02:~ # db2cluster -cm -list -hostfailuredetectiontime
The host failure detection time is 8 seconds.

This is actually related to the network settings in RSCT. Type command lscomg and notice the parameters sensitivity, Period, Priority and Grace.

node02:~ # lscomg 
Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace 
CG1  4           0.8    1        Yes       Yes                                     60

If you want to change the host failure detection time to 4 seconds. You will use the command db2cluster -cm -set -hostfailuredetectiontime -value 4.

node02:~ # db2cluster -cm -set -hostfailuredetectiontime -value 4 
The host failure detection time has been set to 4 seconds.

Again check lscomg and notice Period and Grace column which is now 0.4 relating to 4 seconds and the grace period is reduced from 60 to 30 seconds.

node02:~ # lscomg 
Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace  
CG1  4           0.4    1        Yes       Yes                                     30

From the RSCT documentation, the explanation of Sensitivity, Period, Priority and Grace are as follows:

Sensitivity
    The number of missed heartbeats that constitute a failure
Period
    The number of seconds between heartbeats
Priority
    The relative priority of the communication group
Grace
    The number of seconds for the grace period

As per the developerworks article, the host failure detection time is computed as:

Quote from the paper:

Time to detect node failure(in sec) = Sensitivity x (Period x 2) of the CommunicationGroup (CommGroup) 
the nodes in the cluster belong to.

Notes: – A NIC/IP can belong to only one Communications Group at a time per single node – i.e., if there are multiple NICs on one or more nodes in a cluster, they will be placed in different Communication Groups.

- Sensitivity: The number of missed heartbeats that constitute a failure 
- Period: The number of seconds between heartbeats

If the cluster is already operational, you can query the Cluster CommGroups to find out these two values. For example, these are the specific values in our three-node multi-site TSA/RSCT cluster topology.

root> lscomg 
Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace 
CG1  10          3      1        Yes       Yes                                     30 

Hence in our cluster, time to detect node failure= 10 x (3 x 2) = 60 sec

Another quote from the above paper:

Whenever a node loses connection with rest of the cluster nodes, the RSCT Topology Services subsystem will issue an ICMP echo to check whetherthe system is still reachable. If that node responds within the timeperiod set by Ping Grace Period, the cluster will not detect this as a node failure. Note that Ping Grace Period is not really meant for network glitches, but for cases where daemons get blocked because of memory starvation or other factors. We et this value to 30 seconds in both of our cluster topologies.

DB2 pureScale and Host Failure Detection Time

Follow Me on Linked In

Archives

Categories

DB2 pureScale and Host Failure Detection Time

Follow Me on Linked In

What did you read most?

Archives

Categories