If you need to drop a member in DB2 pureScale, the db2iupdt process requires that member to be available during the drop process. But, the actual world is more cruel than what db2iupdt thinks.
For example: If you have a 3 member cluster and someone moved / migrated the data store from SAN and during the process, one of the member becomes inaccessible or due to some hardware failure, you cannot bring that member online. The only option left for you is to rebuild that node and integrate with the pureScale cluster.
So – these are the steps that you can take since you know for sure that db2iupdt will fail.
From one of the available member when you do db2instance -list, you see that one of the member is down since you can not bring it up and running.
Please note: Commands prefixed with # require root and $ requires db2 instance owner. So, please pay attention to the prefix.
$ db2instance -list ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME -- ---- ----- --------- ------------ ----- ---------------- ------------ ------- 0 MEMBER STARTED node02 node02 NO 0 0 node02.purescale.ibm.local 1 MEMBER STARTED node03 node03 NO 0 0 node03.purescale.ibm.local 2 MEMBER WAITING_FOR_FAILBACK node04 node03 NO 0 1 node03.purescale.ibm.local 128 CF CATCHUP node02 node02 NO - 0 node02.purescale.ibm.local 129 CF PRIMARY node03 node03 NO - 0 node03.purescale.ibm.local HOSTNAME STATE INSTANCE_STOPPED ALERT -------- ----- ---------------- ----- node02 ACTIVE NO NO node03 ACTIVE NO NO node04 INACTIVE NO YES
There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the alert, its impact, and how to clear it, run the following command: ‘db2cluster -cm -list -alert’.
So, we want to remove
node04 from the cluster.
Step-1: Stop db2 instance
Dropping a member requires an outage so stop db2 first.
$ db2stop -force
Step-2: Run db2iupdt -drop command
Knowing well that it will fail but it will remove the entry from the db2nodes.cfg file, which is good thing for the failed node.
Login as root # ./db2iupdt -d -drop -m node04 db2psc
You may receive that a minor error occurred. Actually this minor is a major error.
A minor error occurred during the execution.
For more information see the DB2 installation log at “/tmp/db2iupdt.log.74369”. DBI1264E This program failed. Errors encountered during execution were written to the installation log file. Program name: db2iupdt. Log file name: /tmp/db2iupdt.log.74369.
Check your db2nodes.cfg and node04 entry is gone now. Look at the log file and it indicates that db2iupdt skipped RSCT and GPFS portion of removing the resources. Now, we have to clean-up this leftover.
[root@node02 instance]# cat /db2sd/db2psc/sqllib_shared/db2nodes.cfg 0 node02.purescale.ibm.local 0 node02.purescale.ibm.local - MEMBER 1 node03.purescale.ibm.local 0 node03.purescale.ibm.local - MEMBER 128 node02.purescale.ibm.local 0 node02.purescale.ibm.local - CF 129 node03.purescale.ibm.local 0 node03.purescale.ibm.local - CF
Step-3: Remove failed host from GPFS domain
# db2cluster -cfs -remove -host -node04
Host ‘node04’ has been successfully removed from the shared file system cluster.
Step-4 : Remove failed host from RSCT domain
Now, we will start running into problems as it will fail.
# db2cluster -cm -remove -host node04
The above command fails for the following message:
Removing cluster node ‘node04’ from the cluster …
Host ‘node04’ could not be removed from the peer domain as it contains managed resources. These resources must be removed before the host can be removed from the peer domain.
Now, there must be a better way to remove node04 resources either by using chrsrc or rmrsrc commands. And, I did not have time to explore this. So, I took a more surgical operation.
# rmrpnode node04
# lsrpnode Name OpState RSCTVersion node03 Online 22.214.171.124 node02 Online 126.96.36.199
Now, logout and login as db2 instance user, if you try to run db2instance -list, we won’t get anything.
$ db2instance -list
The member, CF, or host information could not be obtained. Verify the cluster manager resources are valid by entering db2cluster -cm -verify -resources. Check the db2diag.log for more information.
Now, try to verify the resources and they are all bad now.
$ db2cluster -cm -verify -resources
Cluster manager resource states for the DB2 instance are inconsistent. Refer to the db2diag.log for more information on inconsistencies.
Now, try to repair the resources, it can’t be done.
$ db2cluster -cm -repair -resources
Resources could not be repaired for the DB2 instance, and the resource state is now inconsistent. Refer to the db2diag.log for more details.
Basically, we have hosed the rsct resources. Now, let’s rebuild them now.
Step-5: Rebuild Resources Again
Login as root and stop and remove the RSCT domain. Please make sure that you note the RSCT domain name.
# lsrpdomain Name OpState RSCTActiveVersion MixedVersions TSPort GSPort db2domain_20160519174133 Online 188.8.131.52 No 12347 12348
# stoprpdomain -f db2domain_20160519174133 # rmrpdomain -f db2domain_20160519174133
# lsrpdomain # lssam lssam: No resource groups defined or cluster is offline!
Now, wait for 2-3 minute the domain should disappear from node03 also but if doesn’t, go to node03 and remove it.
# stoprpdomain -f db2domain_20160519174133 # rmrpdomain -f db2domain_20160519174133
Switch back to node02 and run the following commands.
Create domain with the same name.
# db2cluster -cm -create -domain db2domain_20160519174133 -host node02 Creating domain 'db2domain_20160519174133' in the cluster ... Creating domain 'db2domain_20160519174133' in the cluster was successful.
Add node03 to the domain
# db2cluster -cm -add -host node03 Adding node 'node03' to the cluster ... Adding node 'node03' to the cluster was successful.
Logout and login as db2 instance owner.
$ db2cluster -cm -repair -resources All cluster configurations have been completed successfully. db2cluster exiting ...
$ db2instance -list ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME -- ---- ----- --------- ------------ ----- ---------------- ------------ ------- 0 MEMBER STOPPED node02 node02 NO 0 0 node02.purescale.ibm.local 1 MEMBER STOPPED node03 node03 NO 0 0 node03.purescale.ibm.local 128 CF STOPPED node02 node02 NO - 0 node02.purescale.ibm.local 129 CF STOPPED node03 node03 NO - 0 node03.purescale.ibm.local
HOSTNAME STATE INSTANCE_STOPPED ALERT -------- ----- ---------------- ----- node03 ACTIVE NO NO node02 ACTIVE NO NO [db2psc@node02 ~]$ db2start 05/20/2016 19:56:22 0 0 SQL1063N DB2START processing was successful. 05/20/2016 19:56:23 1 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful.
Now, you are up and running with 2 hosts and you were able to remove / drop the failed hosts which never came back.
But, you need to reset your tie-breaker and resetting the host failure detection time again. If you miss these, recovering from failure will be tough and your job may be online.
Step-6 – Set-up tie-breaker again
# db2cluster -cm -list -tiebreaker The current quorum device is of type Operator. # lsrsrc IBM.TieBreaker Name Resource Persistent Attributes for IBM.TieBreaker resource 1: Name = "Success" resource 2: Name = "Fail" resource 3: Name = "Operator" # lsscsi [0:0:0:0] disk VMware, VMware Virtual S 1.0 /dev/sda [2:0:0:0] cd/dvd NECVMWar VMware IDE CDR10 1.00 /dev/sr0 [3:0:0:0] storage IET Controller 0001 - [3:0:0:1] disk IET VIRTUAL-DISK 0001 /dev/sdb [4:0:0:0] storage IET Controller 0001 - [4:0:0:1] disk IET VIRTUAL-DISK 0001 /dev/sdc # db2cluster -cm -set -tiebreaker -disk "ID=0 LUN=1 HOST=4 CHAN=0" Configuring quorum device for domain 'db2domain_20160519174133' ... Configuring quorum device for domain 'db2domain_20160519174133' was successful.
Note: I am using iSCSI disk so I had to use ID, LUN, HOST and CHAN attributes that I got from lsscsi output but in your case, if you are using a SAN disk, specify the disk name such as /dev/hdisk7 or /dev/dm-7 as the case may be depending upon AIX or Linux.
Step-7: Set host failure detection time
Now this step requires an outage so plan for it. If the immediate need is to be up and running, you can do this later on during planned outage window. Please follow these steps.
$ db2stop force 05/20/2016 20:44:56 0 0 SQL1064N DB2STOP processing was successful. 05/20/2016 20:45:04 1 0 SQL1064N DB2STOP processing was successful. SQL1064N DB2STOP processing was successful.
[db2psc@node02 ~]$ db2stop instance on node02 SQL1064N DB2STOP processing was successful.
[db2psc@node02 ~]$ db2stop instance on node03 SQL1064N DB2STOP processing was successful.
# db2cluster -cfs -stop -all All specified hosts have been stopped successfully.
# db2cluster -cm -set -HOSTFAILUREDETECTIONTIME -value 4 The host failure detection time has been set to 4 seconds.
# db2cluster -cfs -start -all All specified hosts have been started successfully.
$ db2start instance on node02 SQL1063N DB2START processing was successful. $ db2start instance on node03 SQL1063N DB2START processing was successful.
$ db2start 05/20/2016 20:53:07 0 0 SQL1063N DB2START processing was successful. 05/20/2016 20:53:08 1 0 SQL1063N DB2START processing was successful. SQL1063N DB2START processing was successful.
Now – you are all good. This takes around 20 minutes to complete the operation and be back in business. Remember that this is not the unplanned outage as you were already up and running and you did this to remove the host from cluster that died and never came back to life again.
Step-8: Add host again to the cluster
Now – the reverse happens. After you are done removing the host, your SA comes back and says that hey!, I fixed everything and you are good to go. Do you have a brick in your cube? If yes, make a good use of it.
On node04, make sure that when it starts, it does not have previous domain leftover.
Run lsrpdomain and if it shows domain name which may be offline or online, remove it. Run stoprpdomain -f <domainame> and rmrpdomain -f <domainName>
Do the following clean-up before adding this node back to the cluster.
# cd /opt/ibm/db2 –> Your base directory for DB2. Find out using db2ls if not sure.
# cd instance # ./db2idrop_local db2psc DBI1081E The file or directory "/db2sd/db2psc/sqllib_shared/.update" is missing. DBI1070I Program db2idrop_local completed successfully.
# ./db2greg -dump S,DB2,10.5.0.7,/opt/ibm/db2,,,7,0,,1463699309,0 V,TSA_MIN_REQUIRED_VERSION,NAME,184.108.40.206.1,/opt/ibm/db2,- V,GPFS_MIN_REQUIRED_VERSION,NAME,220.127.116.11.5,/opt/ibm/db2,- V,DB2GPRF,DB2SYSTEM,node04,/opt/ibm/db2, V,INSTPROF,db2psc,/db2sd,-,- V,GPFS_CLUSTER,NAME,db2cluster_20160519174147.purescale.ibm.local,-,DB2_CREATED V,PEER_DOMAIN,NAME,db2domain_20160519174133,-,DB2_CREATED
Since entries are there in registry, better to delete the registry as it will be created again.
# rm -f /var/db2/global.reg
Now, db2greg -dump should be blank.
Go back to the active node and add the node04 again. This does not require an outage as adding a node is an online process.
# /opt/ibm/db2/instance/db2iupdt -d -add -m node04 -mnet node04.purescale.ibm.local db2psc
Oops. This fails for the following error message.
ERROR: Host ‘node04’ is part of another shared file system cluster. Remove the host from that domain before adding the host to this domain.
Now, I again take a surgical operation rather than fixing it.
Remove /var/mmfs/gen directory and your whole GPFS is gone now from this node. Reboot the node and run the db2iudt again.
If above again fails with the message:
ERROR: The “db2start add MEMBER 2 hostname node04.purescale.ibm.local netname node04.purescale.ibm.local port 0 -iupdt” command failed with the return code: “31”, and following error: “SQL6073N An add or drop operation of a database partition or DB2 member failed. SQLCODE = “-1015”. Database name = “PSDB “.”. ERROR: The “db2stop -fixtopology” command failed with the return code: “8”, and following error: “SQL6030N START or STOP DATABASE MANAGER failed. Reason code “36”.”.
Just activate the database again and resubmit the command:
$ db2 activate database psdb $ db2 restart database psdb