If you need to drop a member in DB2 pureScale, the db2iupdt process requires that member to be available during the drop process. But, the actual world is more cruel than what db2iupdt thinks.

For example: If you have a 3 member cluster and someone moved / migrated the data store from SAN and during the process, one of the member becomes inaccessible or due to some hardware failure, you cannot bring that member online. The only option left for you is to rebuild that node and integrate with the pureScale cluster.

So – these are the steps that you can take since you know for sure that db2iupdt will fail.

From one of the available member when you do db2instance -list, you see that one of the member is down since you can not bring it up and running.

Please note: Commands prefixed with # require root and $ requires db2 instance owner. So, please pay attention to the prefix.

$ db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER STARTED node02 node02 NO 0 0 node02.purescale.ibm.local
1 MEMBER STARTED node03 node03 NO 0 0 node03.purescale.ibm.local
2 MEMBER WAITING_FOR_FAILBACK node04 node03 NO 0 1 node03.purescale.ibm.local
128 CF CATCHUP node02 node02 NO - 0 node02.purescale.ibm.local
129 CF PRIMARY node03 node03 NO - 0 node03.purescale.ibm.local

HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
 node02 ACTIVE NO NO
 node03 ACTIVE NO NO
 node04 INACTIVE NO YES

There is currently an alert for a member, CF, or host in the data-sharing instance. For more information on the alert, its impact, and how to clear it, run the following command: ‘db2cluster -cm -list -alert’.

So, we want to remove node04 from the cluster.

Step-1: Stop db2 instance

Dropping a member requires an outage so stop db2 first.

$ db2stop -force

Step-2: Run db2iupdt -drop command

Knowing well that it will fail but it will remove the entry from the db2nodes.cfg file, which is good thing for the failed node.

Login as root
# ./db2iupdt -d -drop -m node04 db2psc

You may receive that a minor error occurred. Actually this minor is a major error.

A minor error occurred during the execution.

For more information see the DB2 installation log at “/tmp/db2iupdt.log.74369”. DBI1264E This program failed. Errors encountered during execution were written to the installation log file. Program name: db2iupdt. Log file name: /tmp/db2iupdt.log.74369.

Check your db2nodes.cfg and node04 entry is gone now. Look at the log file and it indicates that db2iupdt skipped RSCT and GPFS portion of removing the resources. Now, we have to clean-up this leftover.

[root@node02 instance]# cat /db2sd/db2psc/sqllib_shared/db2nodes.cfg 
0 node02.purescale.ibm.local 0 node02.purescale.ibm.local - MEMBER
1 node03.purescale.ibm.local 0 node03.purescale.ibm.local - MEMBER
128 node02.purescale.ibm.local 0 node02.purescale.ibm.local - CF
129 node03.purescale.ibm.local 0 node03.purescale.ibm.local - CF

Step-3: Remove failed host from GPFS domain

# db2cluster -cfs -remove -host -node04

Host ‘node04’ has been successfully removed from the shared file system cluster.

Step-4 : Remove failed host from RSCT domain

Now, we will start running into problems as it will fail.

# db2cluster -cm -remove -host node04

The above command fails for the following message:

Removing cluster node ‘node04’ from the cluster …
Host ‘node04’ could not be removed from the peer domain as it contains managed resources. These resources must be removed before the host can be removed from the peer domain.

Now, there must be a better way to remove node04 resources either by using chrsrc or rmrsrc commands. And, I did not have time to explore this. So, I took a more surgical operation.

# rmrpnode node04
# lsrpnode
Name OpState RSCTVersion 
node03 Online 3.1.5.5 
node02 Online 3.1.5.5 

Now, logout and login as db2 instance user, if you try to run db2instance -list, we won’t get anything.

$ db2instance -list

The member, CF, or host information could not be obtained. Verify the cluster manager resources are valid by entering db2cluster -cm -verify -resources. Check the db2diag.log for more information.

Now, try to verify the resources and they are all bad now.

$ db2cluster -cm -verify -resources

Cluster manager resource states for the DB2 instance are inconsistent. Refer to the db2diag.log for more information on inconsistencies.

Now, try to repair the resources, it can’t be done.

$ db2cluster -cm -repair -resources

Resources could not be repaired for the DB2 instance, and the resource state is now inconsistent. Refer to the db2diag.log for more details.

Basically, we have hosed the rsct resources. Now, let’s rebuild them now.

Step-5: Rebuild Resources Again

$ lsrpdomain

Login as root and stop and remove the RSCT domain. Please make sure that you note the RSCT domain name.

# lsrpdomain
Name OpState RSCTActiveVersion MixedVersions TSPort GSPort 
db2domain_20160519174133 Online 3.1.5.5 No 12347 12348
# stoprpdomain -f db2domain_20160519174133
# rmrpdomain -f db2domain_20160519174133
# lsrpdomain
# lssam
lssam: No resource groups defined or cluster is offline!

Now, wait for 2-3 minute the domain should disappear from node03 also but if doesn’t, go to node03 and remove it.

node03:

# stoprpdomain -f db2domain_20160519174133
# rmrpdomain -f db2domain_20160519174133

Switch back to node02 and run the following commands.

Create domain with the same name.

# db2cluster -cm -create -domain db2domain_20160519174133 -host node02
Creating domain 'db2domain_20160519174133' in the cluster ...
Creating domain 'db2domain_20160519174133' in the cluster was successful.

Add node03 to the domain

# db2cluster -cm -add -host node03
Adding node 'node03' to the cluster ...
Adding node 'node03' to the cluster was successful.

Logout and login as db2 instance owner.

$ db2cluster -cm -repair -resources
All cluster configurations have been completed successfully. db2cluster exiting ...
$ db2instance -list
ID TYPE STATE HOME_HOST CURRENT_HOST ALERT PARTITION_NUMBER LOGICAL_PORT NETNAME
-- ---- ----- --------- ------------ ----- ---------------- ------------ -------
0 MEMBER STOPPED node02 node02 NO 0 0 node02.purescale.ibm.local
1 MEMBER STOPPED node03 node03 NO 0 0 node03.purescale.ibm.local
128 CF STOPPED node02 node02 NO - 0 node02.purescale.ibm.local
129 CF STOPPED node03 node03 NO - 0 node03.purescale.ibm.local
HOSTNAME STATE INSTANCE_STOPPED ALERT
-------- ----- ---------------- -----
 node03 ACTIVE NO NO
 node02 ACTIVE NO NO
[db2psc@node02 ~]$ db2start
05/20/2016 19:56:22 0 0 SQL1063N DB2START processing was successful.
05/20/2016 19:56:23 1 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.

Now, you are up and running with 2 hosts and you were able to remove / drop the failed hosts which never came back.

But, you need to reset your tie-breaker and resetting the host failure detection time again. If you miss these, recovering from failure will be tough and your job may be online.

Step-6 – Set-up tie-breaker again

# db2cluster -cm -list -tiebreaker
The current quorum device is of type Operator.
# lsrsrc IBM.TieBreaker Name
Resource Persistent Attributes for IBM.TieBreaker
resource 1:
 Name = "Success"
resource 2:
 Name = "Fail"
resource 3:
 Name = "Operator"
# lsscsi
[0:0:0:0] disk VMware, VMware Virtual S 1.0 /dev/sda 
[2:0:0:0] cd/dvd NECVMWar VMware IDE CDR10 1.00 /dev/sr0 
[3:0:0:0] storage IET Controller 0001 - 
[3:0:0:1] disk IET VIRTUAL-DISK 0001 /dev/sdb 
[4:0:0:0] storage IET Controller 0001 - 
[4:0:0:1] disk IET VIRTUAL-DISK 0001 /dev/sdc 
# db2cluster -cm -set -tiebreaker -disk "ID=0 LUN=1 HOST=4 CHAN=0"
Configuring quorum device for domain 'db2domain_20160519174133' ...
Configuring quorum device for domain 'db2domain_20160519174133' was successful.

Note: I am using iSCSI disk so I had to use ID, LUN, HOST and CHAN attributes that I got from lsscsi output but in your case, if you are using a SAN disk, specify the disk name such as /dev/hdisk7 or /dev/dm-7 as the case may be depending upon AIX or Linux.

Step-7: Set host failure detection time

Now this step requires an outage so plan for it. If the immediate need is to be up and running, you can do this later on during planned outage window. Please follow these steps.

$ db2stop force
05/20/2016 20:44:56 0 0 SQL1064N DB2STOP processing was successful.
05/20/2016 20:45:04 1 0 SQL1064N DB2STOP processing was successful.
SQL1064N DB2STOP processing was successful.
[db2psc@node02 ~]$ db2stop instance on node02
SQL1064N DB2STOP processing was successful.
[db2psc@node02 ~]$ db2stop instance on node03
SQL1064N DB2STOP processing was successful.
# db2cluster -cfs -stop -all
All specified hosts have been stopped successfully.
# db2cluster -cm -set -HOSTFAILUREDETECTIONTIME -value 4
The host failure detection time has been set to 4 seconds.
# db2cluster -cfs -start -all
All specified hosts have been started successfully.
$ db2start instance on node02
SQL1063N DB2START processing was successful.
$ db2start instance on node03
SQL1063N DB2START processing was successful.
$ db2start
05/20/2016 20:53:07 0 0 SQL1063N DB2START processing was successful.
05/20/2016 20:53:08 1 0 SQL1063N DB2START processing was successful.
SQL1063N DB2START processing was successful.

Now – you are all good. This takes around 20 minutes to complete the operation and be back in business. Remember that this is not the unplanned outage as you were already up and running and you did this to remove the host from cluster that died and never came back to life again.

Step-8: Add host again to the cluster

Now – the reverse happens. After you are done removing the host, your SA comes back and says that hey!, I fixed everything and you are good to go. Do you have a brick in your cube? If yes, make a good use of it.

On node04, make sure that when it starts, it does not have previous domain leftover.

Run lsrpdomain and if it shows domain name which may be offline or online, remove it. Run stoprpdomain -f <domainame> and rmrpdomain -f <domainName>

Do the following clean-up before adding this node back to the cluster.

# cd /opt/ibm/db2 –> Your base directory for DB2. Find out using db2ls if not sure.

# cd instance
# ./db2idrop_local db2psc
DBI1081E The file or directory "/db2sd/db2psc/sqllib_shared/.update" is missing.
DBI1070I Program db2idrop_local completed successfully.

# ./db2greg -dump
S,DB2,10.5.0.7,/opt/ibm/db2,,,7,0,,1463699309,0
V,TSA_MIN_REQUIRED_VERSION,NAME,3.2.2.8.1,/opt/ibm/db2,-
V,GPFS_MIN_REQUIRED_VERSION,NAME,3.5.0.24.5,/opt/ibm/db2,-
V,DB2GPRF,DB2SYSTEM,node04,/opt/ibm/db2,
V,INSTPROF,db2psc,/db2sd,-,-
V,GPFS_CLUSTER,NAME,db2cluster_20160519174147.purescale.ibm.local,-,DB2_CREATED
V,PEER_DOMAIN,NAME,db2domain_20160519174133,-,DB2_CREATED

Since entries are there in registry, better to delete the registry as it will be created again.

# rm -f /var/db2/global.reg

Now, db2greg -dump should be blank.

Go back to the active node and add the node04 again. This does not require an outage as adding a node is an online process.

# /opt/ibm/db2/instance/db2iupdt -d -add -m node04 -mnet node04.purescale.ibm.local db2psc

Oops. This fails for the following error message.

ERROR: Host ‘node04’ is part of another shared file system cluster. Remove the host from that domain before adding the host to this domain.

Now, I again take a surgical operation rather than fixing it.

Remove /var/mmfs/gen directory and your whole GPFS is gone now from this node. Reboot the node and run the db2iudt again.

If above again fails with the message:

ERROR: The “db2start add MEMBER 2 hostname node04.purescale.ibm.local netname node04.purescale.ibm.local port 0 -iupdt” command failed with the return code: “31”, and following error: “SQL6073N An add or drop operation of a database partition or DB2 member failed. SQLCODE = “-1015”. Database name = “PSDB “.”. ERROR: The “db2stop -fixtopology” command failed with the return code: “8”, and following error: “SQL6030N START or STOP DATABASE MANAGER failed. Reason code “36”.”.

Just activate the database again and resubmit the command:

$ db2 activate database psdb
$ db2 restart database psdb