DB2 10.1 GPFS and Fast I/O Fencing

Useful commands for day to day work to manage DB2 pureScale with RSCT and GPFS

GPFS:

Did you wonder why GPFS commands start with mm? The GPFS started as a IBM research project the the early 90s to build a multi-media (music and video) networked file system for some university and that is why our Ph.D researchers chose to prefix each file system command with mm (multi-media) and that legacy continues till today.

Look for the latest log in /var/adm/ras/mmfs.log.latest

You can see entries like when a node was evicted from the cluster and will tell you why you are not seeing the mount points on a node even though it is online. Once a node has been evicted, look for the clues in this file.

Check the status of GPFS (General Parallel File System) mount on all nodes in the GPFS cluster and determine which node does not show a file system online.

# mmlsnsd -M

 Disk name    NSD volume ID      Device         Node name                Remarks       
---------------------------------------------------------------------------------------
 gpfs1nsd     C0A88E664FE7F0D7   /dev/sdc       node02.purescale.ibm.local 
 gpfs1nsd     C0A88E664FE7F0D7   /dev/sdc       node03.purescale.ibm.local 
 gpfs1nsd     C0A88E664FE7F0D7   /dev/sdc       node04.purescale.ibm.local 
 gpfs2nsd     C0A88E664FE7F468   /dev/sdb       node02.purescale.ibm.local 
 gpfs2nsd     C0A88E664FE7F468   /dev/sdb       node03.purescale.ibm.local 
 gpfs2nsd     C0A88E664FE7F468   /dev/sdb       node04.purescale.ibm.local 
 gpfs3nsd     C0A88E664FE7F485   /dev/sdd       node02.purescale.ibm.local 
 gpfs3nsd     C0A88E664FE7F485   /dev/sdd       node03.purescale.ibm.local 
 gpfs3nsd     C0A88E664FE7F485   /dev/sdd       node04.purescale.ibm.local 
 gpfs4nsd     C0A88E664FE7F4B2   /dev/sde       node02.purescale.ibm.local 
 gpfs4nsd     C0A88E664FE7F4B2   /dev/sde       node03.purescale.ibm.local 
 gpfs4nsd     C0A88E664FE7F4B2   /dev/sde       node04.purescale.ibm.local 
 gpfs5nsd     C0A88E664FE7F4DB   /dev/sdf       node02.purescale.ibm.local 
 gpfs5nsd     C0A88E664FE7F4DB   /dev/sdf       node03.purescale.ibm.local 
 gpfs5nsd     C0A88E664FE7F4DB   /dev/sdf       node04.purescale.ibm.local 
 gpfs6nsd     C0A88E664FE7F4FB   /dev/sdg       node02.purescale.ibm.local 
 gpfs6nsd     C0A88E664FE7F4FB   /dev/sdg       node03.purescale.ibm.local 
 gpfs6nsd     C0A88E664FE7F4FB   /dev/sdg       node04.purescale.ibm.local

Check the status of GPFS on the node the command was run

# mmlsnsd -X
 Disk name    NSD volume ID      Device         Devtype  Node name                Remarks          
---------------------------------------------------------------------------------------------------
 gpfs1nsd     C0A88E664FE7F0D7   /dev/sdc       generic  node02.purescale.ibm.local 
 gpfs2nsd     C0A88E664FE7F468   /dev/sdb       generic  node02.purescale.ibm.local 
 gpfs3nsd     C0A88E664FE7F485   /dev/sdd       generic  node02.purescale.ibm.local 
 gpfs4nsd     C0A88E664FE7F4B2   /dev/sde       generic  node02.purescale.ibm.local 
 gpfs5nsd     C0A88E664FE7F4DB   /dev/sdf       generic  node02.purescale.ibm.local 
 gpfs6nsd     C0A88E664FE7F4FB   /dev/sdg       generic  node02.purescale.ibm.local

If SCSI-3 PR is enabled, you will see an additional message pr=yes in the remarks column. When you install DB2 pureScale for the first time and depending upon storage, the GPFS can be made aware about the SCSI-3 PR capability and you should see the pr=yes in the remark column. If you do not and know that you have SCSI-3 PR capable storage, you should use the manual procedure outlined in the Information Center to enable this. The link is here.

The procedure is outlined as:

$ db2stop force
$ db2stop instance on <hostname> ---> Repeat this for all hosts

```
# db2cluster -cm -stop -domain <name>
```
—> Find domain name by using lsrpdomain command. This will shut RSCT
```
# db2cluster -cfs -stop -all
```
—> This will bring nodes down but GPFS domain is still active.
```
# touch /var/mmfs/etc/prcapdevices
```

# /usr/lpp/mmfs/bin/tsprinquiry >> /var/mmfs/etc/prcapdevices

# /usr/lpp/mmfs/bin/mmchconfig usePersistentReserve=yes

# scp /var/mmfs/etc/prcapdevices <nodename>:/var/mmfs/etc/prcapdevices

–> Repeat this for all nodes

```
# /usr/lpp/mmfs/bin/mmchconfig totalPingTimeout=45
```
–> Reduce from default 75 to less now since we have PR
```
# db2cluster -cfs -start -all
```

# db2cluster -cm -start -domain <domainname>

```
# /usr/lpp/mmfs/bin/mmlsnsd -X
```
–> Check pr=yes set in the output in the Remarks column and check on all nodes.

# db2cluster -cm -list -hostfailuredetectiontime

–> The default is 8 seconds

# db2cluster -cm -set -hostfailuredetectiontime -value 4

–> Since PR is enabled, 4 seconds is good.

```
# db2cluster -cfs -verify -resources
```
```
# db2cluster -cm -verify -resources
```
$ db2start instance on <hostname> –> Repeat for all nodes
```
$ db2start
```

To test if SCSI-3 PR is effective, do a node failover or pull the power chord or kill -11 on the PID of the GPFS daemon and then do the following check.

Rum mmfsadm command

# mmfsadm dump sgmgr

Stripe groups managed by this node:
  (none)

If you see the above output, you then need to determine the GPFS manager node. Run mmlsmgr command.

# mmlsmgr
file system      manager node
---------------- ------------------
db2data2         192.168.142.103 (node03)
db2data4         192.168.142.103 (node03)
db2log           192.168.142.103 (node03)
db2data1         192.168.142.104 (node04)
db2data3         192.168.142.104 (node04)
db2fs1           192.168.142.104 (node04)

Cluster manager node: 192.168.142.103 (node03)

The GPFS cluster manager is node03. Run the mmfsadm dump sgmgr command from the GPFS cluster manager node. If you do not still the output, run it from other node

node02:~ # ssh node03 mmfsadm dump sgmgr

Stripe groups managed by this node:
  "db2data2" id C0A88E68:4FE7F48A: status recovered, fsck not active seq 1662705587
     mgrTakeover noTakeover mgrRestricted 0
     asyncRecovery: needed 0 inProgress 0, onetimeRecoveryDone 1
     dmDoDeferredDeletions dmnoDefDel
     pending operations 0 [ ], quiesce level -1, blocked 0
     mgrOperationInProgress 0, logFileAssignmentInProgress 0
     initialLogRecoveryCompleted 0, logMigrateWhat 0x00
     FenceDone 0, aclGarbageCollectInProgress 0 resetEFOptions 0
     mounts:    3 nodes:  :1  :9  :9
     multiTMWanted true multiTMCfgChange false
     panics:    0 nodes
     unfenced:  3 nodes:  :0  :0  :0
   log group  1, index    0, flags 0x00, replicas 1, status     in use, user , migratePending 0
   log group  2, index    1, flags 0x00, replicas 1, status     in use, user , migratePending 0
----
----
----
   Node failure recovery statistics for last 2 failures:

    Completed at        |Total sec nodes|TM recov|AllocMgr| Fencing nodes disks|Log recov logs|
    --------------------+--------- -----+--------+--------+-------- ----- -----+--------- ----|
    2012-06-25@18:54:59 |   62.016     1|   0.007|   0.000|  62.001     1     1|    0.006    1|
    2012-06-25@19:33:53 |   52.031     1|   0.014|   0.000|  52.000     1     1|    0.012    1|

    Totals for 2 since  |Total sec nodes|TM recov|AllocMgr| Fencing nodes disks|Log recov logs|
    --------------------+--------- -----+--------+--------+-------- ----- -----+--------- ----|
    2012-06-25@18:54:59 |  114.047     2|   0.021|   0.000| 114.001     2     2|    0.019    2|

Please notice the Fencing column and it shows my value 62 and 52 seconds when I did the failover test. This is due to the fact that I am running pureScale in a VM environment in my laptop and I do not have SCSI-3 PR on my storage which is the laptop hard drive. But in a real cluster, you will see this value to 0.5 to 1-2 seconds. Including DB2 recovery time, you will see the applications start processing transactions again in 5-10 seconds. This is possible due to the fast disk I/O fencing due to SCSI-3 PR capable storage.

Please see my previous articles on how to use the script that I attached to test if the storage supports SCSI-3 PR capability.

Just remember this:

You need Type 7 PR (Persistent Reserve) for GPFS (Fast I/O Fencing) and Type 5 PR (Tie Breaker) for RSCT or Tivoli System Automation (TSA) or cluster manager (db2cluster -cm switch).

Type 7 PR is “Write exclusive, all registrants” – GPFS

Type -5 is “”

SCSI-3 PR uses a concept of registration and reservation. Each system (DB2 server) registers its key with a SCSI-3 device. Multiple DB2 servers registering keys form a membership and establish a reservation, typically set to “Write Exclusive All Registrants – Type 7”. This setting enables only registered systems to perform write operations. For a given disk, only one reservation can exist amidst numerous registrations. With SCSI-3 PR, blocking write access is as simple as removing a registration from a device and this process takes 3-20 seconds. Only registered DB2 members can eject the registration of another DB2 member.

DB2 10.1 GPFS and Fast I/O Fencing

Follow Me on Linked In

Archives

Categories

DB2 10.1 GPFS and Fast I/O Fencing

Follow Me on Linked In

What did you read most?

Archives

Categories