Before Install

DB2 10.1 pureScale feature uses RSCT, GPFS and Tivoli SA MP which is an application of RSCT. A DB2 DBA can take care of DB2 but there are additional things one must take care particularly on RSCT. In using DB2 pureScale, one is going from a stand alone DB2 on a single host to a cluster of servers. The complexities increase but it is not rocket science to learn.

Couple of things a DBA must watch or take care.

  • Decide your host name, IP addresses and gateway IP addresses up front. You do not have the luxury of changing them after the install. There is a lengthy procedure to change the IP addresses or host name and if I get time, I will document those some time later. This is not DB2 but RSCT where it keeps the IP addresses etc in some config files and if they change, it is not going to work.
  • Read though the DB2 pureScale prerequisite in the Information Center even if you are an experienced DB2 DBA. It is not DB2 but RSCT and Tivoli SA MP and high speed interconnect that you will be concerned with. Check this page for supported platform and environments.
  • Stick to the supported OS and documented release number even if a newer release is available. Several things can go wrong on a newer OS release so avoid the temptation to go with the latest and greatest OS release unless you are willing to open all PMRs with the newer release of the OS software.
  • Set up your NTP before install and do not ignore it even if you are 100% sure that your servers clock are dead accurate. My sample ntp.conf for the server and for the client are posted here for an easy reference. There are 'n' number of ways to set this and this is how I set this up. I use a separate machine as a NTP server within the network and use all nodes of DB2 pureScale to sync time with this internal server. Your sysadmin may have a better way.

    ## ntp.conf for the NTP server
    tinker panic 0
    restrict default kod nomodify notrap
    restrict 127.0.0.1 
    
    # -- CLIENT NETWORK -------
    restrict 192.168.142.0 mask 255.255.0.0 nomodify notrap
    
    # --- OUR TIMESERVERS ----- 
    server 0.pool.ntp.org iburst
    server 1.pool.ntp.org iburst
    server 2.pool.ntp.org iburst
    server 127.127.1.0
    
    # Undisciplined Local Clock.
    # fudge   127.127.1.0 stratum 9
    
    driftfile /var/lib/ntp/drift/ntp.drift # path for drift file
    broadcastdelay  0.008
    
    logfile   /var/log/ntp		# alternate log file
    keys /etc/ntp.keys		# path for keys file
    trustedkey 1			# define trusted keys
    requestkey 1			# key (7) for accessing server variables
    
    

    ## ntp.conf for the client (i.e. the pureScale nodes)
    tinker panic 0
    restrict default kod nomodify notrap
    restrict 127.0.0.1 
    
    # -- CLIENT NETWORK -------
    
    # --- OUR TIMESERVERS ----- 
    # This IP address below is the address of the local NTP server that you set up.
    server 192.168.142.101 iburst
    server 127.127.1.0
    
    # Undisciplined Local Clock.
    # fudge   127.127.1.0 stratum 9
    
    driftfile /var/lib/ntp/drift/ntp.drift # path for drift file
    broadcastdelay  0.008
    
    logfile   /var/log/ntp		# alternate log file
    keys /etc/ntp.keys		# path for keys file
    trustedkey 1			# define trusted keys
    requestkey 1			# key (7) for accessing server variables
    
    
  • Add following entries in your /etc/sysctl.conf file so that the Automatic Client Reroute works as expected.

    # Configure all of the TCP/IP parameters for pureScale on Linux
    net.ipv4.tcp_keepalive_time = 10
    net.ipv4.tcp_keepalive_intvl = 3
    net.ipv4.tcp_keepalive_probes = 10
    
  • For Linux, some modules are to be blacklisted and do not ignore them before the install. For example, add following enries for SuSe Linux in /etc/modprobe.d/blacklist.

    ## Entried required for DB2 pureScale
    blacklist iTCO_wdt
    blacklist iTCO_vendor_support

After Install

  • Chances are that your sysadmin will take away the root password from you after install of the product. Before surrendering your root password to control freaks inept sysadmins, add following entry through visudo.

    # visudo
    # Add following entry and replace db2psc with the your db2 pureScale instance name
    db2psc ALL=(ALL) NOPASSWD: ALL
    
    
  • Chances are that you have smart sysadmins who will open a trouble ticket for you if you do the above. Take your smart sysadmins to the lunch and ask for sudo for everything for the following directories.

    /opt/ibm/db2/V10.1
    /usr/sbin/rsct
    /usr/lpp/mmfs
    /opt/IBM/tsamp/sam
    /var/ct
    /var/log
    
  • In the name of corporate policies if your sysadmin refuses you to give you anything, chances are that your work relationship is not going well. In that case, you are actually safe and protected since your life will be much simpler and easier as you have sysadmin to blame for every issue that comes out of RSCT.
  • After install, the IP address of the gateway is in the file /var/ct/cfg/netmon.cfg. This file may look like as follows. In future, if gatway IP address change, do not forget to update this file.

    ## Contents of /var/ct/cfg/netmon.cf file
    node02:/var/ct/cfg # cat netmon.cf 
    
    !IBQPORTONLY !ALL
    !REQD eth0 192.168.142.2
    
  • RSCT keeps information about the IP addresses interanally. Try this command after install to verify.

    # lscomg
    # lscomg -i CG1
    
  • Use lsrpdomain to find the status of the peer domain

    # lsrpdomain
    
    
  • Use lsrpnode to find the status of all nodes in the domain

    # lsrpnode
  • Use lssam to view the RSCT resources and their status.

    # lssam
    

Troubleshooting – Part 1

Several things can go wrong when RSCT refuses to start. This is not a comprehensive guide but based upon the experiences that I had.

For example: when you type the following commands, you see the unusual output.

# lsrpdomain
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.
2610-412 A Resource Manager terminated while attempting to enumerate resources for this command.
2610-408 Resource selection could not be performed.

# lssam
lssam: No resource groups defined or cluster is offline!

The first thing, you must do is to use db2cluster command to try to repair the resources.

$ db2cluster -cm -verify -resources
$ db2cluster -cm -repair -resources

You must keep a copy of ~/sqllib/db2nodes.cfg. Check the contents of the db2nodes.cfg file and if this file is messed up or have some garbage, restore this file and try to verify and repair resources first before going any further.

The next command lssrc -a gives the status of the sub-systems that are running.

# lssrc -a
Subsystem         Group            PID     Status 
 ctrmc            rsct             6374    active
 IBM.ERRM         rsct_rm          6468    active
 ctcas            rsct             7000    active
 IBM.SensorRM     rsct_rm                  inoperative
 IBM.LPRM         rsct_rm                  inoperative
 cthats           cthats                   inoperative
 cthags           cthags                   inoperative
 cthagsglsm       cthags                   inoperative
 IBM.ConfigRM     rsct_rm                  inoperative
 IBM.HostRM       rsct_rm                  inoperative
 IBM.GblResRM     rsct_rm                  inoperative
 IBM.RecoveryRM   rsct_rm                  inoperative
 IBM.StorageRM    rsct_rm                  inoperative
 IBM.TestRM       rsct_rm                  inoperative
 IBM.AuditRM      rsct_rm                  inoperative

It looks that the major subsystems such as IBM.ConfigRM and IBM.RecoveryRM are not running and something happened.

Type command fcsrptlog /var/log/messages which scnas the syslog and picks up only the related RSCT messages. I am showing messages that are of relevance only.

# fsclogrpt /var/log/messages

Jun 19 13:39:30 node02 RMCdaemon[12070]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....GZ9sD/FLI..XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,230                          
	RMCD_INFO_0_ST The daemon is started.
Jun 19 13:39:41 node02 RMCdaemon[12070]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....RZ9sD/ZIE/.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,900                          
	RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 3
Jun 19 13:39:51 node02 RMCdaemon[12425]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....bZ9sD/SCz..XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,230                          
	RMCD_INFO_0_ST The daemon is started.
Jun 19 13:39:58 node02 RMCdaemon[12425]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....iZ9sD/E3O..XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,900                          
	RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 3
Jun 19 13:40:04 node02 RMCdaemon[12496]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....oZ9sD/Yr61.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,230                          
	RMCD_INFO_0_ST The daemon is started.
Jun 19 13:40:48 node02 RMCdaemon[12496]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....Ua9sD/aMf0.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,900                          
	RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 3
Jun 19 13:40:56 node02 RMCdaemon[13910]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....ca9sD/jGs/.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,230                          
	RMCD_INFO_0_ST The daemon is started.
Jun 19 13:54:12 node02 cthats[27562]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....2n9sD/k2c0.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: rsct,bootstrp.C,1.215.1.13,4956               
	TS_START_ST Topology Services daemon started Topology Services daemon started by: 
        SRC Topology Services daemon log file location /var/ct/db2domain_20120619135407/log/cthats/cthats.19.135412.C 
        Topology Services daemon run directory /var/ct/db2domain_20120619135407/run/cthats/
Jun 19 13:54:13 node02 cthags[27596]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....3n9sD/6a31.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,pgsd.C,1.62.1.23,695                     
	GS_START_ST Group Services daemon started DIAGNOSTIC EXPLANATION HAGS daemon started by SRC. 
        Log file is /var/ct/2ZADmLsxxktfAP5xaAz4iW/log/cthags/trace.
Jun 19 13:54:14 node02 RMCdaemon[13910]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....4n9sD/Fqo..XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,1102                         
	RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 1
Jun 19 13:54:14 node02 RMCdaemon[27668]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....4n9sD/lD90.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,230                          
	RMCD_INFO_0_ST The daemon is started.
Jun 19 13:54:17 node02 RecoveryRM[27728]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....7n9sD/2OX..XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,Protocol.C,1.54.1.59,442                 
	RECOVERYRM_INFO_7_ST This node has joined the IBM.RecoveryRM group. My node number =  1 ; 
        Master node number =  1
Jun 19 13:56:45 node02 ctcasd[13498]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....Rp9sD/6kH1.XeXa/...................
	Reference ID:  
	Template ID: 532f32bf
	Details File:  
	Location: rsct.core.sec,ctcas_main.c,1.30,325           
	ctcasd Daemon Started
Jun 19 13:57:27 node02 ConfigRM[27445]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,PeerDomain.C,1.99.22.61,18346            
	CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM.  
        In this state, cluster resources may be recovered and controlled as needed by  management applications.
Jun 19 13:57:31 node02 RecoveryRM[27728]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 825....9q9sD/B4N/.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,Protocol.C,1.54.1.59,2722                
	RECOVERYRM_INFO_3_ST A new member has joined. Node number =  2
Jun 19 13:58:34 node02 RecoveryRM[27728]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 825....8r9sD/8Dm0.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,Protocol.C,1.54.1.59,2722                
	RECOVERYRM_INFO_3_ST A new member has joined. Node number =  3
Jun 19 14:40:26 node02 RMCdaemon[6374]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....OSAsD/Yk6/.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,rmcd.c,1.80,230                          
	RMCD_INFO_0_ST The daemon is started.
Jun 19 14:40:36 node02 ctcasd[7000]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....YSAsD/qYe0.XeXa/...................
	Reference ID:  
	Template ID: 532f32bf
	Details File:  
	Location: rsct.core.sec,ctcas_main.c,1.30,325           
	ctcasd Daemon Started
Jun 19 14:52:21 node02 RecoveryRM[15977]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....ZdAsD/b5X1.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,IBM.RecoveryRMd.C,1.21.2.6,165           
	RECOVERYRM_INFO_0_ST IBM.RecoveryRM daemon has started. 
Jun 19 14:52:21 node02 RecoveryRM[15977]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 822....ZdAsD/pPX1.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,RecoveryRMDaemon.C,1.15.2.22,400         
	RECOVERYRM_2621_402_ER 2621-402 IBM.RecoveryRM daemon stopped by SRC command or exiting due to an 
        error condition . Error id  0
Jun 19 14:52:22 node02 StorageRM[16036]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,IBM.StorageRMd.C,1.44,146                
	STORAGERM_STARTED_ST IBM.StorageRM daemon has started. 
Jun 19 14:52:22 node02 StorageRM[16036]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,StorageRMDaemon.C,1.57,324               
	STORAGERM_STOPPED_ST IBM.StorageRM daemon has been stopped.

In reality, this type of condition should not happen and it is a time to open a PMR and enagage DB2 support and they will get the right folks from the RSCT to look into this.

You should send the following output or file to the support. After host switch include a list of all nodes comma separated.

# db2support -pureScale -host node02,node03,node04
# /var/log/messages file

 

After you send the data, you may try to debug this as per the following commands.

# export CT_MANAGEMENT_SCOPE=2
# stopsrc -g rsct_rm
0513-044 The IBM.ERRM Subsystem was requested to stop.
# stopsrc -g rsct
0513-044 The ctrmc Subsystem was requested to stop.
0513-044 The ctcas Subsystem was requested to stop.
# lssrc -a
Subsystem         Group            PID     Status 
 ctcas            rsct                     inoperative
 ctrmc            rsct                     inoperative
 IBM.SensorRM     rsct_rm                  inoperative
 IBM.LPRM         rsct_rm                  inoperative
 cthats           cthats                   inoperative
 cthags           cthags                   inoperative
 cthagsglsm       cthags                   inoperative
 IBM.ConfigRM     rsct_rm                  inoperative
 IBM.ERRM         rsct_rm                  inoperative
 IBM.HostRM       rsct_rm                  inoperative
 IBM.GblResRM     rsct_rm                  inoperative
 IBM.RecoveryRM   rsct_rm                  inoperative
 IBM.StorageRM    rsct_rm                  inoperative
 IBM.TestRM       rsct_rm                  inoperative
 IBM.AuditRM      rsct_rm                  inoperative

 

All subsystems are now stopped. Try bringing them one by one and sometime by group or by doing more than one time. This is not the recommended procedure as this is all automatic and RSCT is supposed to start necessary subsystems automatically.

# startsrc -g rsct
0513-059 The ctcas Subsystem has been started. Subsystem PID is 626.
0513-059 The ctrmc Subsystem has been started. Subsystem PID is 627.

# startsrc -g rsct_rm
0513-059 The IBM.SensorRM Subsystem has been started. Subsystem PID is 1785.
0513-059 The IBM.LPRM Subsystem has been started. Subsystem PID is 1786.
0513-059 The IBM.HostRM Subsystem has been started. Subsystem PID is 1788.
0513-059 The IBM.GblResRM Subsystem has been started. Subsystem PID is 1789.
0513-059 The IBM.RecoveryRM Subsystem has been started. Subsystem PID is 1790.
0513-059 The IBM.StorageRM Subsystem has been started. Subsystem PID is 1792.
0513-059 The IBM.TestRM Subsystem has been started. Subsystem PID is 1794.
0513-029 The IBM.ERRM Subsystem is already active.
Multiple instances are not supported.
0513-029 The IBM.AuditRM Subsystem is already active.
Multiple instances are not supported.
0513-029 The IBM.ConfigRM Subsystem is already active.
Multiple instances are not supported.

# startsrc -g cthats
0513-059 The cthats Subsystem has been started. Subsystem PID is 2463.

# startsrc -g cthags
0513-059 The cthags Subsystem has been started. Subsystem PID is 3639.
0513-059 The cthagsglsm Subsystem has been started. Subsystem PID is 3640.

 

The lssrc -a should show everything up and running again. If not, repeat same commands again.

node02:~ # lssrc -a
Subsystem         Group            PID     Status 
 ctcas            rsct             626     active
 ctrmc            rsct             627     active
 IBM.ERRM         rsct_rm          678     active
 IBM.AuditRM      rsct_rm          709     active
 IBM.ConfigRM     rsct_rm          725     active
 IBM.SensorRM     rsct_rm          1785    active
 IBM.LPRM         rsct_rm          1786    active
 IBM.HostRM       rsct_rm          1788    active
 cthats           cthats           2463    active
 cthags           cthags           3639    active
 cthagsglsm       cthags           3640    active
 IBM.GblResRM     rsct_rm                  inoperative
 IBM.RecoveryRM   rsct_rm                  inoperative
 IBM.StorageRM    rsct_rm                  inoperative
 IBM.TestRM       rsct_rm                  inoperative

 

Your output may vary. Still the important subsystem IBM.RecoveryRM is still not operative. The output of fcslogrpt /var/log/messages shows the system started but somehow ctrmc stopped it again due to some conditions

# fcslogrpt /var/log/messages
Jun 19 16:37:08 node02 RecoveryRM[1790]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 824....o9CsD/nPT/.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,IBM.RecoveryRMd.C,1.21.2.6,165           
	RECOVERYRM_INFO_0_ST IBM.RecoveryRM daemon has started. 
Jun 19 16:37:08 node02 RecoveryRM[1790]: (Recorded using libct_ffdc.a cv 2)
	Error ID: 822....o9CsD/BMU/.XeXa/...................
	Reference ID:  
	Template ID: 0
	Details File:  
	Location: RSCT,RecoveryRMDaemon.C,1.15.2.22,400         
	RECOVERYRM_2621_402_ER 2621-402 IBM.RecoveryRM daemon stopped by SRC command 
        or exiting due to an error condition . Error id  0

The error IDs are 822 and 824.

Test if we can see the output from lsrpdomain

# lsrpdomain
Name                     OpState RSCTActiveVersion MixedVersions TSPort GSPort 
db2domain_20120619135407 Offline 3.1.2.2           No            12347  12348  

 

The good news is that we are able to see the output from lsrpdomain. Try the same sequence of commands on all other nodes.

Repeat the sequence of these commands on all other nodes.
# stopsrc -g rsct
# stopsrc -g rsct_rm
# lssrc -a
# startsrc -g rsct
# startsrc -g rsct_rm
# startsrc -g cthats
# startsrc -g cthags

 

The RSCT may bring the cluster offline again if we are missing some critical resources like network etc. If the rsources are up, the RSCT may bring up IBM.RecoveryRM or IBM.ConfigRM automatically and it may bring the cluster online. It may also so happen that RSCT may reboot a node to recover.

The one of main reason for cluster issues is the network. Check if something has changed after you installed the product such as /etc/hosts, gateway address, MASK address or broadcast address.

# lscomg 
Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership 
CG1  4           0.8    1        Yes       Yes                                     60    1 (IP)    1                    
# lscomg -i CG1
Name NodeName                   IPAddress       Subnet      SubnetMask  
eth0 node04.purescale.ibm.local 192.168.142.104 192.168.0.0 255.255.0.0 
eth0 node02.purescale.ibm.local 192.168.142.102 192.168.0.0 255.255.0.0 
eth0 node03.purescale.ibm.local 192.168.142.103 192.168.0.0 255.255.0.0 

 

Match the IP addresses from the output of lscomg to the /etc/hosts on all hosts and see if something changed. Run command /usr/sbin/rsct/bin/ctsvhbal to see which addresses RSCT is using for network.

# ctsvhbal
ctsvhbal: The Host Based Authentication (HBA) mechanism identities for
the local system are:

                Identity:  node02.purescale.ibm.local

                Identity:  fe80::20c:29ff:fe16:149f%eth0

                Identity:  fe80::20c:29ff:fe16:149f

                Identity:  192.168.142.102

ctsvhbal: In order for remote authentication to be successful, at least one
of the above identities for the local system must appear in the trusted host
list on the remote node where a service application resides.  Ensure that at
least one host name and one network address identity from the above list
appears in the trusted host list on any remote systems that act as servers
for applications executing on this local system.

 

No matter what people say, your /etc/hosts file should have the following format. Please note short name is after the FQDN.

192.168.142.101 node01.purescale.ibm.local node01
192.168.142.102 node02.purescale.ibm.local node02
192.168.142.103 node03.purescale.ibm.local node03
192.168.142.104 node04.purescale.ibm.local node04

 

At the end, you should be able to see the domain up and all nodes online

node02:~ # lsrpdomain
Name                     OpState RSCTActiveVersion MixedVersions TSPort GSPort 
db2domain_20120619135407 Online  3.1.2.2           No            12347  12348  
node02:~ # lsrpnode
Name   OpState RSCTVersion 
node03 Online  3.1.2.2     
node02 Online  3.1.2.2     
node04 Online  3.1.2.2     

This may not be necessary but start GPFS.

# mmstartup -a
# mmmount all

 

Type lssam to see the resources.

node02:~ # lssam
Pending online IBM.ResourceGroup:ca_db2psc_0-rg Nominal=Online
        '- Offline IBM.Application:ca_db2psc_0-rs
                |- Offline IBM.Application:ca_db2psc_0-rs:node02
                '- Offline IBM.Application:ca_db2psc_0-rs:node03
Pending online IBM.ResourceGroup:db2_db2psc_0-rg Nominal=Online
        '- Offline IBM.Application:db2_db2psc_0-rs
                |- Offline IBM.Application:db2_db2psc_0-rs:node02
                |- Offline IBM.Application:db2_db2psc_0-rs:node03
                '- Offline IBM.Application:db2_db2psc_0-rs:node04
Pending online IBM.ResourceGroup:db2_db2psc_1-rg Nominal=Online
        '- Offline IBM.Application:db2_db2psc_1-rs
                |- Offline IBM.Application:db2_db2psc_1-rs:node02
                |- Offline IBM.Application:db2_db2psc_1-rs:node03
                '- Offline IBM.Application:db2_db2psc_1-rs:node04
Pending online IBM.ResourceGroup:db2_db2psc_2-rg Nominal=Online
        '- Offline IBM.Application:db2_db2psc_2-rs
                |- Offline IBM.Application:db2_db2psc_2-rs:node02
                |- Offline IBM.Application:db2_db2psc_2-rs:node03
                '- Offline IBM.Application:db2_db2psc_2-rs:node04
Online IBM.ResourceGroup:db2mnt-db2sd_20120619135527-rg Nominal=Online
        '- Online IBM.Application:db2mnt-db2sd_20120619135527-rs
                |- Online IBM.Application:db2mnt-db2sd_20120619135527-rs:node02
                |- Online IBM.Application:db2mnt-db2sd_20120619135527-rs:node03
                '- Online IBM.Application:db2mnt-db2sd_20120619135527-rs:node04
Failed offline IBM.ResourceGroup:idle_db2psc_997_node02-rg Control=MemberInProblemState Nominal=Online                                                          
        '- Failed offline IBM.Application:idle_db2psc_997_node02-rs
                '- Failed offline IBM.Application:idle_db2psc_997_node02-rs:node02                                                                              
Online IBM.ResourceGroup:idle_db2psc_997_node03-rg Nominal=Online
        '- Online IBM.Application:idle_db2psc_997_node03-rs
                '- Online IBM.Application:idle_db2psc_997_node03-rs:node03
Online IBM.ResourceGroup:idle_db2psc_997_node04-rg Nominal=Online
        '- Online IBM.Application:idle_db2psc_997_node04-rs
                '- Online IBM.Application:idle_db2psc_997_node04-rs:node04
Failed offline IBM.ResourceGroup:idle_db2psc_998_node02-rg Control=MemberInProblemState Nominal=Online                                                          
        '- Failed offline IBM.Application:idle_db2psc_998_node02-rs
                '- Failed offline IBM.Application:idle_db2psc_998_node02-rs:node02                                                                              
Online IBM.ResourceGroup:idle_db2psc_998_node03-rg Nominal=Online
        '- Online IBM.Application:idle_db2psc_998_node03-rs
                '- Online IBM.Application:idle_db2psc_998_node03-rs:node03
Online IBM.ResourceGroup:idle_db2psc_998_node04-rg Nominal=Online
        '- Online IBM.Application:idle_db2psc_998_node04-rs
                '- Online IBM.Application:idle_db2psc_998_node04-rs:node04
Failed offline IBM.ResourceGroup:idle_db2psc_999_node02-rg Control=MemberInProblemState Nominal=Online                                                          
        '- Failed offline IBM.Application:idle_db2psc_999_node02-rs
                '- Failed offline IBM.Application:idle_db2psc_999_node02-rs:node02                                                                              
Online IBM.ResourceGroup:idle_db2psc_999_node03-rg Nominal=Online
        '- Online IBM.Application:idle_db2psc_999_node03-rs
                '- Online IBM.Application:idle_db2psc_999_node03-rs:node03
Online IBM.ResourceGroup:idle_db2psc_999_node04-rg Nominal=Online
        '- Online IBM.Application:idle_db2psc_999_node04-rs
                '- Online IBM.Application:idle_db2psc_999_node04-rs:node04
Pending online IBM.ResourceGroup:primary_db2psc_900-rg Nominal=Online
        '- Offline IBM.Application:primary_db2psc_900-rs Control=StartInhibited
                |- Offline IBM.Application:primary_db2psc_900-rs:node02
                '- Offline IBM.Application:primary_db2psc_900-rs:node03
Online IBM.Equivalency:ca_db2psc_0-rg_group-equ
        |- Online IBM.PeerNode:node02:node02
        '- Online IBM.PeerNode:node03:node03
Online IBM.Equivalency:cacontrol_db2psc_equ
        |- Online IBM.Application:cacontrol_db2psc_128_node02:node02
        '- Online IBM.Application:cacontrol_db2psc_129_node03:node03
Online IBM.Equivalency:db2_db2psc_0-rg_group-equ
        |- Online IBM.PeerNode:node02:node02
        |- Online IBM.PeerNode:node03:node03
        '- Online IBM.PeerNode:node04:node04
Online IBM.Equivalency:db2_db2psc_1-rg_group-equ
        |- Online IBM.PeerNode:node03:node03
        |- Online IBM.PeerNode:node04:node04
        '- Online IBM.PeerNode:node02:node02
Online IBM.Equivalency:db2_db2psc_2-rg_group-equ
        |- Online IBM.PeerNode:node04:node04
        |- Online IBM.PeerNode:node02:node02
        '- Online IBM.PeerNode:node03:node03
Online IBM.Equivalency:db2_public_network_db2psc_0
        |- Online IBM.NetworkInterface:eth0:node02
        |- Online IBM.NetworkInterface:eth0:node03
        '- Online IBM.NetworkInterface:eth0:node04
Online IBM.Equivalency:db2mnt-db2sd_20120619135527-rg_group-equ
        |- Online IBM.PeerNode:node02:node02
        |- Online IBM.PeerNode:node03:node03
        '- Online IBM.PeerNode:node04:node04
Online IBM.Equivalency:idle_db2psc_997_node02-rg_group-equ
        '- Online IBM.PeerNode:node02:node02
Online IBM.Equivalency:idle_db2psc_997_node03-rg_group-equ
        '- Online IBM.PeerNode:node03:node03
Online IBM.Equivalency:idle_db2psc_997_node04-rg_group-equ
        '- Online IBM.PeerNode:node04:node04
Online IBM.Equivalency:idle_db2psc_998_node02-rg_group-equ
        '- Online IBM.PeerNode:node02:node02
Online IBM.Equivalency:idle_db2psc_998_node03-rg_group-equ
        '- Online IBM.PeerNode:node03:node03
Online IBM.Equivalency:idle_db2psc_998_node04-rg_group-equ
        '- Online IBM.PeerNode:node04:node04
Online IBM.Equivalency:idle_db2psc_999_node02-rg_group-equ
        '- Online IBM.PeerNode:node02:node02
Online IBM.Equivalency:idle_db2psc_999_node03-rg_group-equ
        '- Online IBM.PeerNode:node03:node03
Online IBM.Equivalency:idle_db2psc_999_node04-rg_group-equ
        '- Online IBM.PeerNode:node04:node04
Online IBM.Equivalency:instancehost_db2psc-equ
        |- Online IBM.Application:instancehost_db2psc_node03:node03
        |- Online IBM.Application:instancehost_db2psc_node02:node02
        '- Online IBM.Application:instancehost_db2psc_node04:node04
Online IBM.Equivalency:primary_db2psc_900-rg_group-equ
        |- Online IBM.PeerNode:node02:node02
        '- Online IBM.PeerNode:node03:node03
node02:~ # 

Troubleshooting – Part 2

The first part of troubleshooting was more drastic and it is unlikely that you will be in that situation. This part of troubleshooting assumes that you atleast can see online RSCT peer domain from one of the node and you do not see message

A Resource Manager terminated while attempting to enumerate resources for this command

The first thing, you must do is to use db2cluster command to try to repair the resources.

$ db2cluster -cm -verify -resources
$ db2cluster -cm -repair -resources

You must keep a copy of ~/sqllib/db2nodes.cfg. Check the contents of the db2nodes.cfg file and if this file is messed up or have some garbage, restore this file and try to verify and repair resources first before going any further.

The most common issue of a node or peer domain being offline is GPFS mount points not mounted. This could be due to the several reasons. The one of the reason is that the node is fenced by the RSCT since the node is not being reachable. Once the node is fenced, the GPFS will not be mounted. This could be due to the reason of the tie breaker issue or network or something else.

Check lsrpnode command and see which node is offline.

# lsrpnode -B -Q -P
Name   OpState RSCTVersion Quorum Preferred Tiebreaker 
node04 Offline 3.1.2.2     Yes    Yes       Yes        
node02 Online  3.1.2.2     Yes    Yes       Yes        
node03 Online  3.1.2.2     Yes    Yes       Yes

Go to the offline node and try these commands (Note: You should not run these commands unless the procedure mentioned in the next section does not yield results.)

# stopsrc -g rsct
# stopsrc -g rsct_rm
# stopsrc -g cthats
# stopsrc -g cthags
# mmstartup -a
# mmmount all
# startsrc -g rsct
# startsrc -g rsct_rm
# startsrc -g cthats
# startsrc -g cthags

Again, please notice that there should not be any need to run above commands and run them only in extreme circumstances when the following procedure fails.

Use this method first to bring a node online.

# db2cluster -cm -start -host  --> Run from other online node

If you get the problem Resource was terminated after running the lsrpdomain command, chances are that the GPFS was not mounted through RSCT. Type in command db2cluster –cfs –mount -filesystem <fsname>  and repeat same for all other file systems.

If you do not remember the file system names, run these GPFS commands directly and you need root access to do these. The mm commands are in /usr/lpp/mmfs/bin

# mmstartup -a --> Run from the node which is offline.
# mmmount all

Wait for few minutes for RSCT to bring the nodes online. Type-in lsrpnode and lsrpdomain on each node to see if node and domain comes online

The other most common problem is network. Try this command to see which nodes are down from teh network perspective.

# lssrc -ls cthats
Subsystem         Group            PID     Status
 cthats           cthats           19413   active
Network Name   Indx Defd  Mbrs  St   Adapter ID      Group ID
CG1            [ 0] 3     2     S    192.168.142.103 192.168.142.103
CG1            [ 0] eth0             0x87e116f9      0x87e11f13
HB Interval = 0.800 secs. Sensitivity = 4 missed beats
Ping Grace Period Interval = 60.000 secs.
Missed HBs: Total: 0 Current group: 0
Packets sent    : 5802 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 6376 ICMP 0 Dropped: 0
NIM's PID: 19466
  2 locally connected Clients with PIDs:
 rmcd( 20076) hagsd( 19454) 
  Dead Man Switch Enabled:
     reset interval = 1 seconds
     trip  interval = 67 seconds
     Watchdog module in use: softdog
  Client Heartbeating Enabled. Period: 6 secs. Timeout: 13 secs.
  Configuration Instance = 1340128716
  Daemon employs no security
  Segments pinned: Text Data Stack.
  Text segment size: 650 KB. Static data segment size: 1475 KB.
  Dynamic data segment size: 1190. Number of outstanding malloc: 97
  User time 0 sec. System time 3 sec.
  Number of page faults: 0. Process swapped out 0 times.
  Number of nodes up: 2. Number of nodes down: 1.
  Nodes down : 3 

 

Make sure that you can ping the IP address from any node to any node as they are known to the RSCT.

# lscomg
Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership 
CG1  4           0.8    1        Yes       Yes                                     60    1 (IP)    1                    
# lscomg -i CG1
Name NodeName                   IPAddress       Subnet      SubnetMask  
eth0 node04.purescale.ibm.local 192.168.142.104 192.168.0.0 255.255.0.0 
eth0 node02.purescale.ibm.local 192.168.142.102 192.168.0.0 255.255.0.0 
eth0 node03.purescale.ibm.local 192.168.142.103 192.168.0.0 255.255.0.0 

Check the status of the nodes using lsrsrc –Ab IBM.PeerNode and look for OpUsabilityState and it should be 1.

# lsrsrc -Ab IBM.PeerNode
Resource Persistent and Dynamic Attributes for IBM.PeerNode
resource 1:
	Name               = "node02"
	NodeList           = {1}
	RSCTVersion        = "3.1.2.2"
	ClassVersions      = {}
	CritRsrcProtMethod = 0
	IsQuorumNode       = 1
	IsPreferredGSGL    = 1
	NodeUUID           = ""
	ActivePeerDomain   = "db2domain_20120619135407"
	NodeNameList       = {"node02"}
	OpState            = 1
	ConfigChanged      = 0
	CritRsrcActive     = 1
	OpUsabilityState   = 1

If OpUsabilityState is not 1, use command runact -s "'Name like '%'" IBM.PeerNode SetOpUsabilityState StateValue=1

If RSCT does not mount GPFS, this is most probably due to the lost quorum. Run fcslogrpt /var/log/messages command and look for NO_QUORUM message.

The lsrpnode command may show the node is online in a peer domain which supports the VerifyQuorum state action but it may be in a subset of the cluster which does not have a cluster quorum (NO_QUORUM). As a result, the node is IO fenced and kicked out of GPFS cluster preventing it to mount the file system.

You may be also in a situation when peer domain is subdivided. For example, each node sees itself online but sees other node offline.

Check your tie breaker.

# lsrsrc -c IBM.PeerNode
Resource Class Persistent Attributes for IBM.PeerNode
resource 1:
	CommittedRSCTVersion     = ""
	ActiveVersionChanging    = 0
	OpQuorumOverride         = 0
	CritRsrcProtMethod       = 1
	OpQuorumTieBreaker       = "Operator"
	QuorumType               = 0
	QuorumGroupName          = ""
	Fanout                   = 32
	OpFenceGroup             = "gpfs_grp"
	NodeCleanupCommand       = "/usr/sbin/rsct/sapolicies/db2/hostCleanupV10.ksh"
	NodeCleanupCriteria      = "Enable,RetryCount=10,RetryInterval=30000,
               Parms= 1 DB2 0 CLEANUP_ALL"
	QuorumLessStartupTimeout = 120

The tie breaker used here is operator which means that a poor DB2 DBA. You do not want to be in this situation. Your tie-breaker should be a SCSI-3 PR capable disk or an IP address. Please remember that the IP address is not supported and you will not find a mention of this in Information Center mainly due to the fact that it is not a recommended approach. But, you can use the IP address as a tie-breaker in case when you do not have a SCSI-3 PR disk.

# db2cluster -cm -set -tiebreaker -ip 192.168.142.2

Generally you can use highly available IP address such as router gateway address. Please remember that if gateway is down, you have a trouble ticket and not in a good situation.

# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker
Resource Class Persistent Attributes for IBM.PeerNode
resource 1:
	OpQuorumTieBreaker = "db2_Quorum_Network_192_168_142_2:21_51_17"

List all tie breakers.

# lsrsrc -Ab IBM.TieBreaker Name
Resource Persistent Attributes for IBM.TieBreaker
resource 1:
	Name = "db2_Quorum_Network_192_168_142_2:21_51_17"
resource 2:
	Name = "Operator"
resource 3:
	Name = "Fail"

Suppose, you want to delete db2_Quorum_Network_192_168_142_2:21_51_17. You can not since it is an active tie-breaker. You have to add other tie-breaker and then delete it.

# export CT_MANAGEMENT_SCOPE=2
# rmrsrc -s "Name == 'db2_Quorum_Network_192_168_142_2:21_51_17'" IBM.TieBreaker
2632-092 The active tie breaker cannot be removed.

You can add a majority tie-breaker and then delete the IP address tie-breaker. For example:

# db2cluster -cm -set -tiebreaker -majority
Configuring quorum device for domain 'db2domain_20120619135407' ...
Configuring quorum device for domain 'db2domain_20120619135407' was successful.
# rmrsrc -s "Name == 'db2_Quorum_Network_192_168_142_2:21_51_17'" IBM.TieBreaker
# lsrsrc -Ab IBM.TieBreaker NameResource Persistent Attributes for IBM.TieBreaker
resource 1:
	Name = "db2_Quorum_MNS:22_7_6"
resource 2:
	Name = "Fail"
resource 3:
	Name = "Operator"

If you want to change the tie-breaker to the operator.

# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker
Resource Class Persistent Attributes for IBM.PeerNode
resource 1:
	OpQuorumTieBreaker = "db2_Quorum_MNS:22_7_6"
# chrsrc -c IBM.PeerNode OpQuorumTieBreaker="Operator"

Please remember that you do not want to turn the tie breaker to the Operator as it is now a human being who has to resolve cases and with a result, the RSCT is waiting on you to resolve conflicts or problems. But, you now have a clue how to be an important person as a DBA.

Set disk as a tie-breaker. The disk that you want to use as a tie-breaker must pass SCSI-3 PR Type 5 test. Please see my other article for how to test the disk.

# /lib/udev/scsi_id -g -u /dev/sdh
1494554000000000031323334353637383930000000000000
# db2cluster -cm -set -tiebreaker -disk WWID=1494554000000000031323334353637383930000000000000
Configuring quorum device for domain 'db2domain_20120619135407' ...
Configuring quorum device for domain 'db2domain_20120619135407' was successful.

# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker
Resource Class Persistent Attributes for IBM.PeerNode
resource 1:
	OpQuorumTieBreaker = "db2_Quorum_Disk:22_19_43"

Check who has the disk reservation on this disk. # sg_persist -d /dev/sdh –in –read-keys IET VIRTUAL-DISK 0 Peripheral device type: disk PR generation=0x2, there are NO registered reservation keys

 

Troubleshooting Part – 3

For example if one node shows as offline from lsrpnode command, follow these steps.

db2psc@node04:~> ssh node02 db2cluster -cm -verify -resources
Cluster manager resource states for the DB2 instance are consistent.
db2psc@node04:~> ssh node03 db2cluster -cm -verify -resources
Cluster manager resource states for the DB2 instance are consistent.
db2psc@node04:~> ssh node04 db2cluster -cm -verify -resources
Cluster manager resource states for the DB2 instance are inconsistent.  
Refer to the db2diag.log for more information on inconsistencies.

Our node04 has the problem.

$ db2cluster -cm -repair -resources
Query failed. Refer to db2diag.log and the DB2 Information Center for details.
There was an error with one of the issued cluster manager commands. 
Refer to db2diag.log and the DB2 Information Center for details.

The above command did not work since our domain is not even online on this node.

Disclaimer: Please use whatever you get from here at your own risk and like a smart DBA, practice these commands first in a test environment and prepare yourself to face and fix issues that might come later in actual scenerios. This way, you will increase your dollar value certainly.