Best practice or things that appear to work well in a two nodes cluser or a single node (just with one DB2 and a CF).
The cthats subsystem handles heartbeating between nodes. When in steady state, there is a ring formed where each member is sending heartbeats to their downstream neighbor, and monitoring for heart beats arriving from their upstream neighbor (a higher IP address is your upstream neighbor). But, in the case with only two nodes, both members of the heartbeat ring are sending to and receiving from the same neighbor and RSCT is not really aware of that fact.
Commands to use:
# lsrsrc -Ab IBM.NetworkInterface Name OpState --> OpState 2 is a problem # lssrc -ls cthats
If you have only 2 node pureScale cluster and there are not that many network adapters (My case of VMware pureScale where I only have one adapter), it is important that we have an IP address defined in /var/ct/cfg/netmon.cf so that RSCT can use that for the heart beat purpose. For a two node cluster, when only one node is alive and thus has nobody else to heartbeat with and in that case the IP address defined in /var/ct/cfg/netmon.cf comes handy. Generally, we use the switch IP address. Please make sure that you update this file to reflect the new IP address should the switch IP address change in future. No need to restart the cluster as this file is picked up automatically when you modify it.
Another example of the importance of netmon.cf is only one node DB2 pureScale cluster in which one host has both CF and a DB2 member.
After install of DB2 pureScale on one node (CF and a Member colocated), you may notice that the CF was not able to start. The DB2 installer shows the following DB2 SQL error.
ERROR: 04/22/2013 14:51:52 128 0 SQL1677N DB2START or DB2STOP processing failed due to a DB2 cluster services error. 04/22/2013 14:51:53 0 0 SQL1685N An error was encountered during DB2START processing of DB2 member with identifier "0" because the database manager failed to start one or more CFs. SQL1677N DB2START or DB2STOP processing failed due to a DB2 cluster services error. ERROR: An error occurred while trying to start the "db2inst1" instance. The return code is "55" and the SQL Message is: "PROCESS_ERROR"
The db2diag.log for the CF (128) may show the following error.
2013-04-22-220.127.116.115440-420 I3963E448 LEVEL: Severe PID : 52505 TID : 47845314865664 PROC : ca-wdog 128 [db2inst1] INSTANCE: db2inst1 NODE : 128 HOSTNAME: purescale1.chq.ei FUNCTION: DB2 UDB, high avail services, rocmCAWatchDog, probe:2331 MESSAGE : ZRC=0x80050801=-2147153919=SQLE_RC_ADAPTER_NOT_FOUND "Adapter not found" DATA #1 : String, 40 bytes netmon.cf validation fails on this host. 2013-04-22-14.51.52.087100-420 I4412E454 LEVEL: Severe PID : 52416 TID : 47918334802432 PROC : db2rocme 128 [db2inst1] INSTANCE: db2inst1 NODE : 128 HOSTNAME: purescale1.chq.ei FUNCTION: DB2 UDB, high avail services, rocmCAStart, probe:3087 MESSAGE : ZRC=0x80050801=-2147153919=SQLE_RC_ADAPTER_NOT_FOUND "Adapter not found" DATA #1 : String, 48 bytes Error starting up ca-server. Attempting cleanup.
The above is all due to the heart-beating failure since this is a single node and there is no ring formed for the RSCT to do the heart-beating for the cluster with other members. The IP addresses defined in the /var/ct/cfg/netmon.cf are used for the heart-beating purposes when there are not enough hosts or members to do the heartbeat.
Follow these simple to use rules to add entries in your netmon.cf file.
- Find out the IP address of the 10GbE or Infiniband Switch. If two switches are used, find out both IP addresses. Generally command show interface ip may work. It all depends upon the switch manufacturer.
- Use /sbin/route command to find the subnet used by the 10GbE or InfiniBand adpaters (like eth8 or eth9) and for each interface, add the IP address of the switch to the netmon.cf file.
The format of the netmon.cf file is:
!REQD <adapter> <ip>
For example: If the IP address of the switch is 192.168.100.1 and the IP subnet for 10GbE adpater (eth8) is 192.168.100.0. So, you should add entry
!REQD eth8 192.168.100.1
If you have two switches then you must be using both ports on the adpaters and they will be on different subnets. In that case, you should add 2 entries in the netmon.cf file for each adpater so that they are able to reach to the right switch.
Now the big question comes – Why did the installer not create the appropriate entries in the netmon.cf file? Well, you need to open a PMR for this. The installer probably is not able to find the IP interface addresses of the HCA and it only added the entry for the eth0 adpater as it can find out the gateway address easily.
Make sure that the adpaters name used by the 10GbE or Infiniband adapters (like eth8, eth9) are defined in the netmon.cf so that RSCT knows the IP address to ping for the adpaters used for the CF communication. If communivcation is uDAPL (RDMA), we need another entry in the netmon.cf file.
How do you know what type of communication is used by the adpater? The answer is – wheather it is socket (non-RDMA) or uDAPL.
Use the command:
grep -iE 'PsOpen|PsConnect' db2diag.log
If it is PsOpen – it is then using socket otherwise it is RDMA using uDAPL.
If it is uDAPL, you have to make sure that the entry for adpater must exist in the netmon.cf.
Sample netmon.cf created by the installer:
!IBQPORTONLY !ALL !REQD eth0 10.100.100.1
The problem is that eth0 adapter is the public network and the entry added was for the gateway address. The entry for the HCA (10GbE adpater – say eth8) is missing from this. So, we need to add the IP address of the switch for the adapter that it can reach. So, add the entry.
!IBQPORTONLY !ALL !REQD eth0 10.100.100.1 --> This is the gateway address for the host and eth0 is the public adapater or a 1Gbps Ethernet Card !REQD eth8 192.168.100.1 --> This is the switch address