Before Install
DB2 10.1 pureScale feature uses RSCT, GPFS and Tivoli SA MP which is an application of RSCT. A DB2 DBA can take care of DB2 but there are additional things one must take care particularly on RSCT. In using DB2 pureScale, one is going from a stand alone DB2 on a single host to a cluster of servers. The complexities increase but it is not rocket science to learn.
Couple of things a DBA must watch or take care.
- Decide your host name, IP addresses and gateway IP addresses up front. You do not have the luxury of changing them after the install. There is a lengthy procedure to change the IP addresses or host name and if I get time, I will document those some time later. This is not DB2 but RSCT where it keeps the IP addresses etc in some config files and if they change, it is not going to work.
- Read though the DB2 pureScale prerequisite in the Information Center even if you are an experienced DB2 DBA. It is not DB2 but RSCT and Tivoli SA MP and high speed interconnect that you will be concerned with. Check this page for supported platform and environments.
- Stick to the supported OS and documented release number even if a newer release is available. Several things can go wrong on a newer OS release so avoid the temptation to go with the latest and greatest OS release unless you are willing to open all PMRs with the newer release of the OS software.
-
Set up your NTP before install and do not ignore it even if you are 100% sure that your servers clock are dead accurate. My sample ntp.conf for the server and for the client are posted here for an easy reference. There are 'n' number of ways to set this and this is how I set this up. I use a separate machine as a NTP server within the network and use all nodes of DB2 pureScale to sync time with this internal server. Your sysadmin may have a better way.
## ntp.conf for the NTP server tinker panic 0 restrict default kod nomodify notrap restrict 127.0.0.1 # -- CLIENT NETWORK ------- restrict 192.168.142.0 mask 255.255.0.0 nomodify notrap # --- OUR TIMESERVERS ----- server 0.pool.ntp.org iburst server 1.pool.ntp.org iburst server 2.pool.ntp.org iburst server 127.127.1.0 # Undisciplined Local Clock. # fudge 127.127.1.0 stratum 9 driftfile /var/lib/ntp/drift/ntp.drift # path for drift file broadcastdelay 0.008 logfile /var/log/ntp # alternate log file keys /etc/ntp.keys # path for keys file trustedkey 1 # define trusted keys requestkey 1 # key (7) for accessing server variables
## ntp.conf for the client (i.e. the pureScale nodes) tinker panic 0 restrict default kod nomodify notrap restrict 127.0.0.1 # -- CLIENT NETWORK ------- # --- OUR TIMESERVERS ----- # This IP address below is the address of the local NTP server that you set up. server 192.168.142.101 iburst server 127.127.1.0 # Undisciplined Local Clock. # fudge 127.127.1.0 stratum 9 driftfile /var/lib/ntp/drift/ntp.drift # path for drift file broadcastdelay 0.008 logfile /var/log/ntp # alternate log file keys /etc/ntp.keys # path for keys file trustedkey 1 # define trusted keys requestkey 1 # key (7) for accessing server variables
-
Add following entries in your /etc/sysctl.conf file so that the Automatic Client Reroute works as expected.
# Configure all of the TCP/IP parameters for pureScale on Linux net.ipv4.tcp_keepalive_time = 10 net.ipv4.tcp_keepalive_intvl = 3 net.ipv4.tcp_keepalive_probes = 10
-
For Linux, some modules are to be blacklisted and do not ignore them before the install. For example, add following enries for SuSe Linux in /etc/modprobe.d/blacklist.
## Entried required for DB2 pureScale
blacklist iTCO_wdt
blacklist iTCO_vendor_support
After Install
-
Chances are that your sysadmin will take away the root password from you after install of the product. Before surrendering your root password to control freaks inept sysadmins, add following entry through visudo.
# visudo # Add following entry and replace db2psc with the your db2 pureScale instance name db2psc ALL=(ALL) NOPASSWD: ALL
-
Chances are that you have smart sysadmins who will open a trouble ticket for you if you do the above. Take your smart sysadmins to the lunch and ask for sudo for everything for the following directories.
/opt/ibm/db2/V10.1 /usr/sbin/rsct /usr/lpp/mmfs /opt/IBM/tsamp/sam /var/ct /var/log
- In the name of corporate policies if your sysadmin refuses you to give you anything, chances are that your work relationship is not going well. In that case, you are actually safe and protected since your life will be much simpler and easier as you have sysadmin to blame for every issue that comes out of RSCT.
-
After install, the IP address of the gateway is in the file /var/ct/cfg/netmon.cfg. This file may look like as follows. In future, if gatway IP address change, do not forget to update this file.
## Contents of /var/ct/cfg/netmon.cf file node02:/var/ct/cfg # cat netmon.cf !IBQPORTONLY !ALL !REQD eth0 192.168.142.2
-
RSCT keeps information about the IP addresses interanally. Try this command after install to verify.
# lscomg # lscomg -i CG1
-
Use lsrpdomain to find the status of the peer domain
# lsrpdomain
-
Use lsrpnode to find the status of all nodes in the domain
# lsrpnode
-
Use lssam to view the RSCT resources and their status.
# lssam
Troubleshooting – Part 1
Several things can go wrong when RSCT refuses to start. This is not a comprehensive guide but based upon the experiences that I had.
For example: when you type the following commands, you see the unusual output.
# lsrpdomain 2610-412 A Resource Manager terminated while attempting to enumerate resources for this command. 2610-408 Resource selection could not be performed. 2610-412 A Resource Manager terminated while attempting to enumerate resources for this command. 2610-408 Resource selection could not be performed. # lssam lssam: No resource groups defined or cluster is offline!
The first thing, you must do is to use db2cluster command to try to repair the resources.
$ db2cluster -cm -verify -resources $ db2cluster -cm -repair -resources
You must keep a copy of ~/sqllib/db2nodes.cfg. Check the contents of the db2nodes.cfg file and if this file is messed up or have some garbage, restore this file and try to verify and repair resources first before going any further.
The next command lssrc -a gives the status of the sub-systems that are running.
# lssrc -a Subsystem Group PID Status ctrmc rsct 6374 active IBM.ERRM rsct_rm 6468 active ctcas rsct 7000 active IBM.SensorRM rsct_rm inoperative IBM.LPRM rsct_rm inoperative cthats cthats inoperative cthags cthags inoperative cthagsglsm cthags inoperative IBM.ConfigRM rsct_rm inoperative IBM.HostRM rsct_rm inoperative IBM.GblResRM rsct_rm inoperative IBM.RecoveryRM rsct_rm inoperative IBM.StorageRM rsct_rm inoperative IBM.TestRM rsct_rm inoperative IBM.AuditRM rsct_rm inoperative
It looks that the major subsystems such as IBM.ConfigRM and IBM.RecoveryRM are not running and something happened.
Type command fcsrptlog /var/log/messages which scnas the syslog and picks up only the related RSCT messages. I am showing messages that are of relevance only.
# fsclogrpt /var/log/messages
Jun 19 13:39:30 node02 RMCdaemon[12070]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....GZ9sD/FLI..XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,230 RMCD_INFO_0_ST The daemon is started. Jun 19 13:39:41 node02 RMCdaemon[12070]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....RZ9sD/ZIE/.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,900 RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 3 Jun 19 13:39:51 node02 RMCdaemon[12425]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....bZ9sD/SCz..XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,230 RMCD_INFO_0_ST The daemon is started. Jun 19 13:39:58 node02 RMCdaemon[12425]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....iZ9sD/E3O..XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,900 RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 3 Jun 19 13:40:04 node02 RMCdaemon[12496]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....oZ9sD/Yr61.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,230 RMCD_INFO_0_ST The daemon is started. Jun 19 13:40:48 node02 RMCdaemon[12496]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....Ua9sD/aMf0.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,900 RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 3 Jun 19 13:40:56 node02 RMCdaemon[13910]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....ca9sD/jGs/.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,230 RMCD_INFO_0_ST The daemon is started. Jun 19 13:54:12 node02 cthats[27562]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....2n9sD/k2c0.XeXa/................... Reference ID: Template ID: 0 Details File: Location: rsct,bootstrp.C,1.215.1.13,4956 TS_START_ST Topology Services daemon started Topology Services daemon started by: SRC Topology Services daemon log file location /var/ct/db2domain_20120619135407/log/cthats/cthats.19.135412.C Topology Services daemon run directory /var/ct/db2domain_20120619135407/run/cthats/ Jun 19 13:54:13 node02 cthags[27596]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....3n9sD/6a31.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,pgsd.C,1.62.1.23,695 GS_START_ST Group Services daemon started DIAGNOSTIC EXPLANATION HAGS daemon started by SRC. Log file is /var/ct/2ZADmLsxxktfAP5xaAz4iW/log/cthags/trace. Jun 19 13:54:14 node02 RMCdaemon[13910]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....4n9sD/Fqo..XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,1102 RMCD_INFO_1_ST The daemon is stopped. Number of command that stopped the daemon 1 Jun 19 13:54:14 node02 RMCdaemon[27668]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....4n9sD/lD90.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,230 RMCD_INFO_0_ST The daemon is started. Jun 19 13:54:17 node02 RecoveryRM[27728]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....7n9sD/2OX..XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,Protocol.C,1.54.1.59,442 RECOVERYRM_INFO_7_ST This node has joined the IBM.RecoveryRM group. My node number = 1 ; Master node number = 1 Jun 19 13:56:45 node02 ctcasd[13498]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....Rp9sD/6kH1.XeXa/................... Reference ID: Template ID: 532f32bf Details File: Location: rsct.core.sec,ctcas_main.c,1.30,325 ctcasd Daemon Started Jun 19 13:57:27 node02 ConfigRM[27445]: (Recorded using libct_ffdc.a cv 2) Error ID: Reference ID: Template ID: 0 Details File: Location: RSCT,PeerDomain.C,1.99.22.61,18346 CONFIGRM_HASQUORUM_ST The operational quorum state of the active peer domain has changed to HAS_QUORUM. In this state, cluster resources may be recovered and controlled as needed by management applications. Jun 19 13:57:31 node02 RecoveryRM[27728]: (Recorded using libct_ffdc.a cv 2) Error ID: 825....9q9sD/B4N/.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,Protocol.C,1.54.1.59,2722 RECOVERYRM_INFO_3_ST A new member has joined. Node number = 2 Jun 19 13:58:34 node02 RecoveryRM[27728]: (Recorded using libct_ffdc.a cv 2) Error ID: 825....8r9sD/8Dm0.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,Protocol.C,1.54.1.59,2722 RECOVERYRM_INFO_3_ST A new member has joined. Node number = 3 Jun 19 14:40:26 node02 RMCdaemon[6374]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....OSAsD/Yk6/.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,rmcd.c,1.80,230 RMCD_INFO_0_ST The daemon is started. Jun 19 14:40:36 node02 ctcasd[7000]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....YSAsD/qYe0.XeXa/................... Reference ID: Template ID: 532f32bf Details File: Location: rsct.core.sec,ctcas_main.c,1.30,325 ctcasd Daemon Started Jun 19 14:52:21 node02 RecoveryRM[15977]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....ZdAsD/b5X1.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,IBM.RecoveryRMd.C,1.21.2.6,165 RECOVERYRM_INFO_0_ST IBM.RecoveryRM daemon has started. Jun 19 14:52:21 node02 RecoveryRM[15977]: (Recorded using libct_ffdc.a cv 2) Error ID: 822....ZdAsD/pPX1.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,RecoveryRMDaemon.C,1.15.2.22,400 RECOVERYRM_2621_402_ER 2621-402 IBM.RecoveryRM daemon stopped by SRC command or exiting due to an error condition . Error id 0 Jun 19 14:52:22 node02 StorageRM[16036]: (Recorded using libct_ffdc.a cv 2) Error ID: Reference ID: Template ID: 0 Details File: Location: RSCT,IBM.StorageRMd.C,1.44,146 STORAGERM_STARTED_ST IBM.StorageRM daemon has started. Jun 19 14:52:22 node02 StorageRM[16036]: (Recorded using libct_ffdc.a cv 2) Error ID: Reference ID: Template ID: 0 Details File: Location: RSCT,StorageRMDaemon.C,1.57,324 STORAGERM_STOPPED_ST IBM.StorageRM daemon has been stopped.
In reality, this type of condition should not happen and it is a time to open a PMR and enagage DB2 support and they will get the right folks from the RSCT to look into this.
You should send the following output or file to the support. After host switch include a list of all nodes comma separated.
# db2support -pureScale -host node02,node03,node04 # /var/log/messages file
After you send the data, you may try to debug this as per the following commands.
# export CT_MANAGEMENT_SCOPE=2 # stopsrc -g rsct_rm 0513-044 The IBM.ERRM Subsystem was requested to stop. # stopsrc -g rsct 0513-044 The ctrmc Subsystem was requested to stop. 0513-044 The ctcas Subsystem was requested to stop. # lssrc -a Subsystem Group PID Status ctcas rsct inoperative ctrmc rsct inoperative IBM.SensorRM rsct_rm inoperative IBM.LPRM rsct_rm inoperative cthats cthats inoperative cthags cthags inoperative cthagsglsm cthags inoperative IBM.ConfigRM rsct_rm inoperative IBM.ERRM rsct_rm inoperative IBM.HostRM rsct_rm inoperative IBM.GblResRM rsct_rm inoperative IBM.RecoveryRM rsct_rm inoperative IBM.StorageRM rsct_rm inoperative IBM.TestRM rsct_rm inoperative IBM.AuditRM rsct_rm inoperative
All subsystems are now stopped. Try bringing them one by one and sometime by group or by doing more than one time. This is not the recommended procedure as this is all automatic and RSCT is supposed to start necessary subsystems automatically.
# startsrc -g rsct 0513-059 The ctcas Subsystem has been started. Subsystem PID is 626. 0513-059 The ctrmc Subsystem has been started. Subsystem PID is 627. # startsrc -g rsct_rm 0513-059 The IBM.SensorRM Subsystem has been started. Subsystem PID is 1785. 0513-059 The IBM.LPRM Subsystem has been started. Subsystem PID is 1786. 0513-059 The IBM.HostRM Subsystem has been started. Subsystem PID is 1788. 0513-059 The IBM.GblResRM Subsystem has been started. Subsystem PID is 1789. 0513-059 The IBM.RecoveryRM Subsystem has been started. Subsystem PID is 1790. 0513-059 The IBM.StorageRM Subsystem has been started. Subsystem PID is 1792. 0513-059 The IBM.TestRM Subsystem has been started. Subsystem PID is 1794. 0513-029 The IBM.ERRM Subsystem is already active. Multiple instances are not supported. 0513-029 The IBM.AuditRM Subsystem is already active. Multiple instances are not supported. 0513-029 The IBM.ConfigRM Subsystem is already active. Multiple instances are not supported. # startsrc -g cthats 0513-059 The cthats Subsystem has been started. Subsystem PID is 2463. # startsrc -g cthags 0513-059 The cthags Subsystem has been started. Subsystem PID is 3639. 0513-059 The cthagsglsm Subsystem has been started. Subsystem PID is 3640.
The lssrc -a should show everything up and running again. If not, repeat same commands again.
node02:~ # lssrc -a Subsystem Group PID Status ctcas rsct 626 active ctrmc rsct 627 active IBM.ERRM rsct_rm 678 active IBM.AuditRM rsct_rm 709 active IBM.ConfigRM rsct_rm 725 active IBM.SensorRM rsct_rm 1785 active IBM.LPRM rsct_rm 1786 active IBM.HostRM rsct_rm 1788 active cthats cthats 2463 active cthags cthags 3639 active cthagsglsm cthags 3640 active IBM.GblResRM rsct_rm inoperative IBM.RecoveryRM rsct_rm inoperative IBM.StorageRM rsct_rm inoperative IBM.TestRM rsct_rm inoperative
Your output may vary. Still the important subsystem IBM.RecoveryRM is still not operative. The output of fcslogrpt /var/log/messages shows the system started but somehow ctrmc stopped it again due to some conditions
# fcslogrpt /var/log/messages Jun 19 16:37:08 node02 RecoveryRM[1790]: (Recorded using libct_ffdc.a cv 2) Error ID: 824....o9CsD/nPT/.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,IBM.RecoveryRMd.C,1.21.2.6,165 RECOVERYRM_INFO_0_ST IBM.RecoveryRM daemon has started. Jun 19 16:37:08 node02 RecoveryRM[1790]: (Recorded using libct_ffdc.a cv 2) Error ID: 822....o9CsD/BMU/.XeXa/................... Reference ID: Template ID: 0 Details File: Location: RSCT,RecoveryRMDaemon.C,1.15.2.22,400 RECOVERYRM_2621_402_ER 2621-402 IBM.RecoveryRM daemon stopped by SRC command or exiting due to an error condition . Error id 0
The error IDs are 822 and 824.
Test if we can see the output from lsrpdomain
# lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort GSPort db2domain_20120619135407 Offline 3.1.2.2 No 12347 12348
The good news is that we are able to see the output from lsrpdomain. Try the same sequence of commands on all other nodes.
Repeat the sequence of these commands on all other nodes. # stopsrc -g rsct # stopsrc -g rsct_rm # lssrc -a # startsrc -g rsct # startsrc -g rsct_rm # startsrc -g cthats # startsrc -g cthags
The RSCT may bring the cluster offline again if we are missing some critical resources like network etc. If the rsources are up, the RSCT may bring up IBM.RecoveryRM or IBM.ConfigRM automatically and it may bring the cluster online. It may also so happen that RSCT may reboot a node to recover.
The one of main reason for cluster issues is the network. Check if something has changed after you installed the product such as /etc/hosts, gateway address, MASK address or broadcast address.
# lscomg Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership CG1 4 0.8 1 Yes Yes 60 1 (IP) 1 # lscomg -i CG1 Name NodeName IPAddress Subnet SubnetMask eth0 node04.purescale.ibm.local 192.168.142.104 192.168.0.0 255.255.0.0 eth0 node02.purescale.ibm.local 192.168.142.102 192.168.0.0 255.255.0.0 eth0 node03.purescale.ibm.local 192.168.142.103 192.168.0.0 255.255.0.0
Match the IP addresses from the output of lscomg to the /etc/hosts on all hosts and see if something changed. Run command /usr/sbin/rsct/bin/ctsvhbal to see which addresses RSCT is using for network.
# ctsvhbal ctsvhbal: The Host Based Authentication (HBA) mechanism identities for the local system are: Identity: node02.purescale.ibm.local Identity: fe80::20c:29ff:fe16:149f%eth0 Identity: fe80::20c:29ff:fe16:149f Identity: 192.168.142.102 ctsvhbal: In order for remote authentication to be successful, at least one of the above identities for the local system must appear in the trusted host list on the remote node where a service application resides. Ensure that at least one host name and one network address identity from the above list appears in the trusted host list on any remote systems that act as servers for applications executing on this local system.
No matter what people say, your /etc/hosts file should have the following format. Please note short name is after the FQDN.
192.168.142.101 node01.purescale.ibm.local node01 192.168.142.102 node02.purescale.ibm.local node02 192.168.142.103 node03.purescale.ibm.local node03 192.168.142.104 node04.purescale.ibm.local node04
At the end, you should be able to see the domain up and all nodes online
node02:~ # lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort GSPort db2domain_20120619135407 Online 3.1.2.2 No 12347 12348 node02:~ # lsrpnode Name OpState RSCTVersion node03 Online 3.1.2.2 node02 Online 3.1.2.2 node04 Online 3.1.2.2
This may not be necessary but start GPFS.
# mmstartup -a # mmmount all
Type lssam to see the resources.
node02:~ # lssam Pending online IBM.ResourceGroup:ca_db2psc_0-rg Nominal=Online '- Offline IBM.Application:ca_db2psc_0-rs |- Offline IBM.Application:ca_db2psc_0-rs:node02 '- Offline IBM.Application:ca_db2psc_0-rs:node03 Pending online IBM.ResourceGroup:db2_db2psc_0-rg Nominal=Online '- Offline IBM.Application:db2_db2psc_0-rs |- Offline IBM.Application:db2_db2psc_0-rs:node02 |- Offline IBM.Application:db2_db2psc_0-rs:node03 '- Offline IBM.Application:db2_db2psc_0-rs:node04 Pending online IBM.ResourceGroup:db2_db2psc_1-rg Nominal=Online '- Offline IBM.Application:db2_db2psc_1-rs |- Offline IBM.Application:db2_db2psc_1-rs:node02 |- Offline IBM.Application:db2_db2psc_1-rs:node03 '- Offline IBM.Application:db2_db2psc_1-rs:node04 Pending online IBM.ResourceGroup:db2_db2psc_2-rg Nominal=Online '- Offline IBM.Application:db2_db2psc_2-rs |- Offline IBM.Application:db2_db2psc_2-rs:node02 |- Offline IBM.Application:db2_db2psc_2-rs:node03 '- Offline IBM.Application:db2_db2psc_2-rs:node04 Online IBM.ResourceGroup:db2mnt-db2sd_20120619135527-rg Nominal=Online '- Online IBM.Application:db2mnt-db2sd_20120619135527-rs |- Online IBM.Application:db2mnt-db2sd_20120619135527-rs:node02 |- Online IBM.Application:db2mnt-db2sd_20120619135527-rs:node03 '- Online IBM.Application:db2mnt-db2sd_20120619135527-rs:node04 Failed offline IBM.ResourceGroup:idle_db2psc_997_node02-rg Control=MemberInProblemState Nominal=Online '- Failed offline IBM.Application:idle_db2psc_997_node02-rs '- Failed offline IBM.Application:idle_db2psc_997_node02-rs:node02 Online IBM.ResourceGroup:idle_db2psc_997_node03-rg Nominal=Online '- Online IBM.Application:idle_db2psc_997_node03-rs '- Online IBM.Application:idle_db2psc_997_node03-rs:node03 Online IBM.ResourceGroup:idle_db2psc_997_node04-rg Nominal=Online '- Online IBM.Application:idle_db2psc_997_node04-rs '- Online IBM.Application:idle_db2psc_997_node04-rs:node04 Failed offline IBM.ResourceGroup:idle_db2psc_998_node02-rg Control=MemberInProblemState Nominal=Online '- Failed offline IBM.Application:idle_db2psc_998_node02-rs '- Failed offline IBM.Application:idle_db2psc_998_node02-rs:node02 Online IBM.ResourceGroup:idle_db2psc_998_node03-rg Nominal=Online '- Online IBM.Application:idle_db2psc_998_node03-rs '- Online IBM.Application:idle_db2psc_998_node03-rs:node03 Online IBM.ResourceGroup:idle_db2psc_998_node04-rg Nominal=Online '- Online IBM.Application:idle_db2psc_998_node04-rs '- Online IBM.Application:idle_db2psc_998_node04-rs:node04 Failed offline IBM.ResourceGroup:idle_db2psc_999_node02-rg Control=MemberInProblemState Nominal=Online '- Failed offline IBM.Application:idle_db2psc_999_node02-rs '- Failed offline IBM.Application:idle_db2psc_999_node02-rs:node02 Online IBM.ResourceGroup:idle_db2psc_999_node03-rg Nominal=Online '- Online IBM.Application:idle_db2psc_999_node03-rs '- Online IBM.Application:idle_db2psc_999_node03-rs:node03 Online IBM.ResourceGroup:idle_db2psc_999_node04-rg Nominal=Online '- Online IBM.Application:idle_db2psc_999_node04-rs '- Online IBM.Application:idle_db2psc_999_node04-rs:node04 Pending online IBM.ResourceGroup:primary_db2psc_900-rg Nominal=Online '- Offline IBM.Application:primary_db2psc_900-rs Control=StartInhibited |- Offline IBM.Application:primary_db2psc_900-rs:node02 '- Offline IBM.Application:primary_db2psc_900-rs:node03 Online IBM.Equivalency:ca_db2psc_0-rg_group-equ |- Online IBM.PeerNode:node02:node02 '- Online IBM.PeerNode:node03:node03 Online IBM.Equivalency:cacontrol_db2psc_equ |- Online IBM.Application:cacontrol_db2psc_128_node02:node02 '- Online IBM.Application:cacontrol_db2psc_129_node03:node03 Online IBM.Equivalency:db2_db2psc_0-rg_group-equ |- Online IBM.PeerNode:node02:node02 |- Online IBM.PeerNode:node03:node03 '- Online IBM.PeerNode:node04:node04 Online IBM.Equivalency:db2_db2psc_1-rg_group-equ |- Online IBM.PeerNode:node03:node03 |- Online IBM.PeerNode:node04:node04 '- Online IBM.PeerNode:node02:node02 Online IBM.Equivalency:db2_db2psc_2-rg_group-equ |- Online IBM.PeerNode:node04:node04 |- Online IBM.PeerNode:node02:node02 '- Online IBM.PeerNode:node03:node03 Online IBM.Equivalency:db2_public_network_db2psc_0 |- Online IBM.NetworkInterface:eth0:node02 |- Online IBM.NetworkInterface:eth0:node03 '- Online IBM.NetworkInterface:eth0:node04 Online IBM.Equivalency:db2mnt-db2sd_20120619135527-rg_group-equ |- Online IBM.PeerNode:node02:node02 |- Online IBM.PeerNode:node03:node03 '- Online IBM.PeerNode:node04:node04 Online IBM.Equivalency:idle_db2psc_997_node02-rg_group-equ '- Online IBM.PeerNode:node02:node02 Online IBM.Equivalency:idle_db2psc_997_node03-rg_group-equ '- Online IBM.PeerNode:node03:node03 Online IBM.Equivalency:idle_db2psc_997_node04-rg_group-equ '- Online IBM.PeerNode:node04:node04 Online IBM.Equivalency:idle_db2psc_998_node02-rg_group-equ '- Online IBM.PeerNode:node02:node02 Online IBM.Equivalency:idle_db2psc_998_node03-rg_group-equ '- Online IBM.PeerNode:node03:node03 Online IBM.Equivalency:idle_db2psc_998_node04-rg_group-equ '- Online IBM.PeerNode:node04:node04 Online IBM.Equivalency:idle_db2psc_999_node02-rg_group-equ '- Online IBM.PeerNode:node02:node02 Online IBM.Equivalency:idle_db2psc_999_node03-rg_group-equ '- Online IBM.PeerNode:node03:node03 Online IBM.Equivalency:idle_db2psc_999_node04-rg_group-equ '- Online IBM.PeerNode:node04:node04 Online IBM.Equivalency:instancehost_db2psc-equ |- Online IBM.Application:instancehost_db2psc_node03:node03 |- Online IBM.Application:instancehost_db2psc_node02:node02 '- Online IBM.Application:instancehost_db2psc_node04:node04 Online IBM.Equivalency:primary_db2psc_900-rg_group-equ |- Online IBM.PeerNode:node02:node02 '- Online IBM.PeerNode:node03:node03 node02:~ #
Troubleshooting – Part 2
The first part of troubleshooting was more drastic and it is unlikely that you will be in that situation. This part of troubleshooting assumes that you atleast can see online RSCT peer domain from one of the node and you do not see message
A Resource Manager terminated while attempting to enumerate resources for this command
The first thing, you must do is to use db2cluster command to try to repair the resources.
$ db2cluster -cm -verify -resources $ db2cluster -cm -repair -resources
You must keep a copy of ~/sqllib/db2nodes.cfg. Check the contents of the db2nodes.cfg file and if this file is messed up or have some garbage, restore this file and try to verify and repair resources first before going any further.
The most common issue of a node or peer domain being offline is GPFS mount points not mounted. This could be due to the several reasons. The one of the reason is that the node is fenced by the RSCT since the node is not being reachable. Once the node is fenced, the GPFS will not be mounted. This could be due to the reason of the tie breaker issue or network or something else.
Check lsrpnode command and see which node is offline.
# lsrpnode -B -Q -P Name OpState RSCTVersion Quorum Preferred Tiebreaker node04 Offline 3.1.2.2 Yes Yes Yes node02 Online 3.1.2.2 Yes Yes Yes node03 Online 3.1.2.2 Yes Yes Yes
Go to the offline node and try these commands (Note: You should not run these commands unless the procedure mentioned in the next section does not yield results.)
# stopsrc -g rsct # stopsrc -g rsct_rm # stopsrc -g cthats # stopsrc -g cthags # mmstartup -a # mmmount all # startsrc -g rsct # startsrc -g rsct_rm # startsrc -g cthats # startsrc -g cthags
Again, please notice that there should not be any need to run above commands and run them only in extreme circumstances when the following procedure fails.
Use this method first to bring a node online.
# db2cluster -cm -start -host--> Run from other online node
If you get the problem Resource was terminated after running the lsrpdomain command, chances are that the GPFS was not mounted through RSCT. Type in command db2cluster –cfs –mount -filesystem <fsname>
If you do not remember the file system names, run these GPFS commands directly and you need root access to do these. The mm commands are in /usr/lpp/mmfs/bin
# mmstartup -a --> Run from the node which is offline. # mmmount all
Wait for few minutes for RSCT to bring the nodes online. Type-in lsrpnode and lsrpdomain on each node to see if node and domain comes online
The other most common problem is network. Try this command to see which nodes are down from teh network perspective.
# lssrc -ls cthats Subsystem Group PID Status cthats cthats 19413 active Network Name Indx Defd Mbrs St Adapter ID Group ID CG1 [ 0] 3 2 S 192.168.142.103 192.168.142.103 CG1 [ 0] eth0 0x87e116f9 0x87e11f13 HB Interval = 0.800 secs. Sensitivity = 4 missed beats Ping Grace Period Interval = 60.000 secs. Missed HBs: Total: 0 Current group: 0 Packets sent : 5802 ICMP 0 Errors: 0 No mbuf: 0 Packets received: 6376 ICMP 0 Dropped: 0 NIM's PID: 19466 2 locally connected Clients with PIDs: rmcd( 20076) hagsd( 19454) Dead Man Switch Enabled: reset interval = 1 seconds trip interval = 67 seconds Watchdog module in use: softdog Client Heartbeating Enabled. Period: 6 secs. Timeout: 13 secs. Configuration Instance = 1340128716 Daemon employs no security Segments pinned: Text Data Stack. Text segment size: 650 KB. Static data segment size: 1475 KB. Dynamic data segment size: 1190. Number of outstanding malloc: 97 User time 0 sec. System time 3 sec. Number of page faults: 0. Process swapped out 0 times. Number of nodes up: 2. Number of nodes down: 1. Nodes down : 3
Make sure that you can ping the IP address from any node to any node as they are known to the RSCT.
# lscomg Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership CG1 4 0.8 1 Yes Yes 60 1 (IP) 1 # lscomg -i CG1 Name NodeName IPAddress Subnet SubnetMask eth0 node04.purescale.ibm.local 192.168.142.104 192.168.0.0 255.255.0.0 eth0 node02.purescale.ibm.local 192.168.142.102 192.168.0.0 255.255.0.0 eth0 node03.purescale.ibm.local 192.168.142.103 192.168.0.0 255.255.0.0
Check the status of the nodes using lsrsrc –Ab IBM.PeerNode and look for OpUsabilityState and it should be 1.
# lsrsrc -Ab IBM.PeerNode Resource Persistent and Dynamic Attributes for IBM.PeerNode resource 1: Name = "node02" NodeList = {1} RSCTVersion = "3.1.2.2" ClassVersions = {} CritRsrcProtMethod = 0 IsQuorumNode = 1 IsPreferredGSGL = 1 NodeUUID = "" ActivePeerDomain = "db2domain_20120619135407" NodeNameList = {"node02"} OpState = 1 ConfigChanged = 0 CritRsrcActive = 1 OpUsabilityState = 1
If OpUsabilityState is not 1, use command runact -s "'Name like '%'" IBM.PeerNode SetOpUsabilityState StateValue=1
If RSCT does not mount GPFS, this is most probably due to the lost quorum. Run fcslogrpt /var/log/messages command and look for NO_QUORUM message.
The lsrpnode command may show the node is online in a peer domain which supports the VerifyQuorum state action but it may be in a subset of the cluster which does not have a cluster quorum (NO_QUORUM). As a result, the node is IO fenced and kicked out of GPFS cluster preventing it to mount the file system.
You may be also in a situation when peer domain is subdivided. For example, each node sees itself online but sees other node offline.
Check your tie breaker.
# lsrsrc -c IBM.PeerNode Resource Class Persistent Attributes for IBM.PeerNode resource 1: CommittedRSCTVersion = "" ActiveVersionChanging = 0 OpQuorumOverride = 0 CritRsrcProtMethod = 1 OpQuorumTieBreaker = "Operator" QuorumType = 0 QuorumGroupName = "" Fanout = 32 OpFenceGroup = "gpfs_grp" NodeCleanupCommand = "/usr/sbin/rsct/sapolicies/db2/hostCleanupV10.ksh" NodeCleanupCriteria = "Enable,RetryCount=10,RetryInterval=30000, Parms= 1 DB2 0 CLEANUP_ALL" QuorumLessStartupTimeout = 120
The tie breaker used here is operator which means that a poor DB2 DBA. You do not want to be in this situation. Your tie-breaker should be a SCSI-3 PR capable disk or an IP address. Please remember that the IP address is not supported and you will not find a mention of this in Information Center mainly due to the fact that it is not a recommended approach. But, you can use the IP address as a tie-breaker in case when you do not have a SCSI-3 PR disk.
# db2cluster -cm -set -tiebreaker -ip 192.168.142.2
Generally you can use highly available IP address such as router gateway address. Please remember that if gateway is down, you have a trouble ticket and not in a good situation.
# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker Resource Class Persistent Attributes for IBM.PeerNode resource 1: OpQuorumTieBreaker = "db2_Quorum_Network_192_168_142_2:21_51_17"
List all tie breakers.
# lsrsrc -Ab IBM.TieBreaker Name Resource Persistent Attributes for IBM.TieBreaker resource 1: Name = "db2_Quorum_Network_192_168_142_2:21_51_17" resource 2: Name = "Operator" resource 3: Name = "Fail"
Suppose, you want to delete db2_Quorum_Network_192_168_142_2:21_51_17. You can not since it is an active tie-breaker. You have to add other tie-breaker and then delete it.
# export CT_MANAGEMENT_SCOPE=2 # rmrsrc -s "Name == 'db2_Quorum_Network_192_168_142_2:21_51_17'" IBM.TieBreaker 2632-092 The active tie breaker cannot be removed.
You can add a majority tie-breaker and then delete the IP address tie-breaker. For example:
# db2cluster -cm -set -tiebreaker -majority Configuring quorum device for domain 'db2domain_20120619135407' ... Configuring quorum device for domain 'db2domain_20120619135407' was successful. # rmrsrc -s "Name == 'db2_Quorum_Network_192_168_142_2:21_51_17'" IBM.TieBreaker # lsrsrc -Ab IBM.TieBreaker NameResource Persistent Attributes for IBM.TieBreaker resource 1: Name = "db2_Quorum_MNS:22_7_6" resource 2: Name = "Fail" resource 3: Name = "Operator"
If you want to change the tie-breaker to the operator.
# lsrsrc -c IBM.PeerNode OpQuorumTieBreaker Resource Class Persistent Attributes for IBM.PeerNode resource 1: OpQuorumTieBreaker = "db2_Quorum_MNS:22_7_6" # chrsrc -c IBM.PeerNode OpQuorumTieBreaker="Operator"
Please remember that you do not want to turn the tie breaker to the Operator as it is now a human being who has to resolve cases and with a result, the RSCT is waiting on you to resolve conflicts or problems. But, you now have a clue how to be an important person as a DBA.
Set disk as a tie-breaker. The disk that you want to use as a tie-breaker must pass SCSI-3 PR Type 5 test. Please see my other article for how to test the disk.
# /lib/udev/scsi_id -g -u /dev/sdh 1494554000000000031323334353637383930000000000000 # db2cluster -cm -set -tiebreaker -disk WWID=1494554000000000031323334353637383930000000000000 Configuring quorum device for domain 'db2domain_20120619135407' ... Configuring quorum device for domain 'db2domain_20120619135407' was successful. # lsrsrc -c IBM.PeerNode OpQuorumTieBreaker Resource Class Persistent Attributes for IBM.PeerNode resource 1: OpQuorumTieBreaker = "db2_Quorum_Disk:22_19_43"
Check who has the disk reservation on this disk. # sg_persist -d /dev/sdh –in –read-keys IET VIRTUAL-DISK 0 Peripheral device type: disk PR generation=0x2, there are NO registered reservation keys
Troubleshooting Part – 3
For example if one node shows as offline from lsrpnode command, follow these steps.
db2psc@node04:~> ssh node02 db2cluster -cm -verify -resources Cluster manager resource states for the DB2 instance are consistent. db2psc@node04:~> ssh node03 db2cluster -cm -verify -resources Cluster manager resource states for the DB2 instance are consistent. db2psc@node04:~> ssh node04 db2cluster -cm -verify -resources Cluster manager resource states for the DB2 instance are inconsistent. Refer to the db2diag.log for more information on inconsistencies.
Our node04 has the problem.
$ db2cluster -cm -repair -resources Query failed. Refer to db2diag.log and the DB2 Information Center for details. There was an error with one of the issued cluster manager commands. Refer to db2diag.log and the DB2 Information Center for details.
The above command did not work since our domain is not even online on this node.