DB2 pureScale GPFS Error

Say for example, you are creating a file system using db2clutser command in DB2 pureScale and it fails with this error:

# db2cluster -cfs -create -filesystem db2data1 -disk /dev/dm-2 -mount /db2data/data1
There was an internal db2cluster error. Refer to the diagnostic logs (db2diag.log or 
/tmp/ibm.db2.cluster.*) 
and the DB2 Information Center for details.
A diagnostic log has been saved to '/tmp/ibm.db2.cluster.ZaVD0P'.

The following excerpts are shown from the diagnostic logs.

 # cat /tmp/ibm.db2.cluster.ZaVD0P

Information in this record is only valid at the time when this file was
created (see this record's time stamp)

2014-03-06-07.38.07.957949-300 I1833E342             LEVEL: Info
PID     : 27676                TID : 139915009038112 PROC : db2cluster
INSTANCE: db2psc               NODE : 000
HOSTNAME: node02
FUNCTION: <0>, <0>, <0>, probe:2562
DATA #1 : String, 83 bytes
db2cluster -cfs -create -filesystem db2data1 -disk /dev/dm-2 -mount /db2data/data1

mmcrnsd: syncServerNodes: Unable to obtain mmsdrfs version line from backup server.
mmcrnsd: 6027-1639 Command failed.  Examine previous error messages to determine cause.
DATA #5 : signed integer, 4 bytes
255

2014-03-06-07.38.16.702172-300 I3045E1163            LEVEL: Severe
PID     : 27676                TID : 139915009038112 PROC : db2cluster
INSTANCE: db2psc               NODE : 000
HOSTNAME: node02
FUNCTION: DB2 UDB, high avail services, GPFSCluster::addDisk, probe:3592
DATA #1 : String, 49 bytes
We found less NSD created from the disk supplied.
DATA #2 : unsigned integer, 8 bytes
0
DATA #3 : unsigned integer, 8 bytes
1

2014-03-06-07.38.16.703015-300 I4209E1232            LEVEL: Severe
PID     : 27676                TID : 139915009038112 PROC : db2cluster
INSTANCE: db2psc               NODE : 000
HOSTNAME: node02
FUNCTION: DB2 UDB, high avail services, GPFSCluster::addDisk, probe:3617
DATA #1 : String, 157 bytes
The disk is not available as a free disk for a file system.  A concurrent create file system or add disk to 
file system may have been run with the same disk.
DATA #2 : String, 9 bytes
/dev/dm-2

The above two errors in red shows two different issues. Let’s take one by one.

Error 1 – Unable to obtain mmsdrfs version line from backup server

The above is actually is not the GPFS error. It merely indicates that GPFS was not able to contact other members in the cluster. This issue can be seen by the output of the following.

node02:~ # ssh node02 mmlsnsd -M

Disk name    NSD volume ID      Device         Node name                Remarks
 ---------------------------------------------------------------------------------------
 gpfs1nsd     0A4D130C5317E766   /dev/dm-1      node02.rtp.purescale.local
 gpfs1nsd     0A4D130C5317E766   -              node03.rtp.purescale.local (not found) directly attached
 gpfs1nsd     0A4D130C5317E766   -              node04.rtp.purescale.local (not found) directly attached

node02:~ # ssh node03 mmlsnsd -M

Disk name    NSD volume ID      Device         Node name                Remarks
 ---------------------------------------------------------------------------------------
 gpfs1nsd     0A4D130C5317E766   /dev/dm-1      node03.rtp.purescale.local
 gpfs1nsd     0A4D130C5317E766   /dev/dm-1      node04.rtp.purescale.local
 gpfs1nsd     0A4D130C5317E766   -              node02.rtp.purescale.local (not found) directly attached

node02:~ # ssh node04 mmlsnsd -M

Disk name    NSD volume ID      Device         Node name                Remarks
 ---------------------------------------------------------------------------------------
 gpfs1nsd     0A4D130C5317E766   /dev/dm-1      node03.rtp.purescale.local
 gpfs1nsd     0A4D130C5317E766   /dev/dm-1      node04.rtp.purescale.local
 gpfs1nsd     0A4D130C5317E766   -              node02.rtp.purescale.local (not found) directly attached

Check line in red which is the communication problem. Check your password less SSH configuration between machines. In DB2 pureScale, the password less communication between members is done by the db2locssh command. So, check the following:

node02:/var/db2/db2ssh # /var/db2/db2ssh/db2locssh node02 date
 Thu Mar  6 09:00:55 EST 2014
 node02:/var/db2/db2ssh # /var/db2/db2ssh/db2locssh node03 date
 failure - examine the system log on the remotehost for additional information
 node02:/var/db2/db2ssh # /var/db2/db2ssh/db2locssh node04 date
 failure - examine the system log on the remotehost for additional information

node03:/var/log # /var/db2/db2ssh/db2locssh node02 date
 failure - examine the system log on the remotehost for additional information
 node03:/var/log # /var/db2/db2ssh/db2locssh node03 date
 Thu Mar  6 10:00:27 EST 2014
 node03:/var/log # /var/db2/db2ssh/db2locssh node04 date
 Thu Mar  6 10:00:29 EST 2014

node04:~ # /var/db2/db2ssh/db2locssh node02 date
 failure - examine the system log on the remotehost for additional information
 node04:~ # /var/db2/db2ssh/db2locssh node03 date
 Thu Mar  6 10:01:38 EST 2014
 node04:~ # /var/db2/db2ssh/db2locssh node04 date
 Thu Mar  6 10:01:38 EST 2014

Here, if communication using db2locssh works OK, we should be able to get the date from other machine. So, something is wrong in the keys set up of db2locssh.

But, it turns out that it was not the case. On close examination, it was found that one of the server time was 1 hour behind. As soon as server date was corrected, the db2locssh <OtherserverName> date worked fine from all servers. And, consequently the output of mmlsnsd -M showed correct output from all members. This problem of db2locssh sensitive to server time difference has been fixed in DB2 10.5 FP 3. See the domino effect of server time out of sync on db2locssh and then on the file system creation. Server time synchronization between members of DB2 pureScale cluster is very important and cannot be stressed less.

Error 2 – The disk is not available as a free disk for a file system.

This happens when the disk you are using has been a GPFS disk before and there was not a proper clean-up done when this disk was made available for use again. Generally, you should follow the procedure outlined in Information Center when you are permanently uninstalling GPFS.

In a short cut, if you accidentally delete /var/mmfs/cfg folder, you are dead in the water. So, backup this folder as a best practice. After disks are freed up, it is a good idea to clean up header of the disk by using the command.

# dd if=/dev/zero of=/dev/dm-1 bs=512 count=1

Now, you cannot guarantee that you will run this command on the right disk and not on an actual disk. As humans, we are fallible and make mistakes. So, it is also a good idea to backup disks header in case we loose it or overwrite it by mistake.

# dd if=/dev/dm-2 of=/dm2.header bs=512 count=1

You may also see this error in the diagnostic log indicating other problem which prevents from creating a file system on the disk.

mmcrnsd: 6027-1940 Unable to set reserve policy PR_shared on disk dm-4 on node node02

This comes from the fact that it was probably a failed GPFS operation in which disk reservation was obtained but not released. When you see this error, check disks reservation.

# sg_persist -i -k /dev/dm-4
 IBM 1814 FAStT 1060
 Peripheral device type: disk
 PR generation=0xb0, 1 registered reservation key follows:
 0x6d0000000001

If there are disks reservation, you can use this script to clear the reservations.

node02:~/bin # cat clearreservation
 #!/bin/bash

if [ "$#" != "1" ]; then
 echo "Usage: $0 <diskdevicename>"  1>&2
 echo "Usage: $0 /dev/dm-1"  1>&2
 exit 1
 fi

ddevice=$1
 echo "============================================================="
 echo "disk device=${ddevice}"
 echo "============================================================="
 echo -n "Check the state of the disk ............. "
 sg_persist -d ${ddevice} --in --read-keys > /tmp/sgp.txt
 if [ $? -eq 0 ] ; then
 echo " Success"
 KEYS=$(sg_persist -i -k -d $ddevice | grep -E "^[\t ]*0x.*" | sed -e 's/^ *//')
 for ik in $KEYS
 do
 sg_persist -d ${ddevice} --out -C -K $ik > /dev/null
 done
 else
 echo " Failure RC=$?"
 echo "============================================================="
 cat < /tmp/sgp.txt
 echo "============================================================="
 exit 1
 fi

After you clear the disk reservation, you can now use the disk to create the file system.

DB2 pureScale GPFS Error

Error 1 – Unable to obtain mmsdrfs version line from backup server

Error 2 – The disk is not available as a free disk for a file system.

Follow Me on Linked In

Archives

Categories

DB2 pureScale GPFS Error

Error 1 – Unable to obtain mmsdrfs version line from backup server

Error 2 – The disk is not available as a free disk for a file system.

Follow Me on Linked In

What did you read most?

Archives

Categories