Additional Packages

After installation of basic Redhat 6.2 OS, the following additional components are required.

# yum install perftest
# yum install sysfsutils
# yum install kernel-devel
# yum install compat-libstdc++
# yum install tcl
# yum install tk
# yum install kernel-headers
# yum install kernel-devel
# yum install libstd++*
# yum install *i686*
# yum install libstdc++*.i686
# yum install compat-libstdc++*
# yum install make
# yum install gcc*
# yum install cpp*
# yum install multipath-device-mapper

Install OFED

 After Mellanox cards are put in the system, we need to make sure that it is properly configured for OFED.

There are 2 ways, OFED can be installed on RedHat.

  • Using Infiniband Support from RHN
  • Using Mellanox OFED packages

From RHN, do

# yum groupinstall "Infiniband Support"

If using Mellanox OFED package, mount the iso file

# mount -o loop /root/ofed/MLNX_OFED_LINUX-1.5.3-4.0.2-rhel6.2-x86_64.iso /mnt/ofed
# cd /mnt/ofed
# ls -l 
dr-xr-xr-x  7 root root   2048 May  9  2012 .
drwxr-xr-x. 3 root root   4096 Feb 22 23:59 ..
dr-xr-xr-x  8 root root   2048 May  7  2012 docs
dr-xr-xr-x  4 root root   2048 Sep 26  2011 firmware
-r--r--r--  1 root root     12 May  9  2012 .mlnx
-r-xr-xr-x  1 root root 163493 May  9  2012 mlnxofedinstall
dr-xr-xr-x  2 root root   2048 May  9  2012 repodata
dr-xr-xr-x  2 root root  22528 May  9  2012 RPMS
dr-xr-xr-x  2 root root   2048 May  9  2012 src
-r--r--r--  1 root root     22 May  9  2012 .supported_kernels
-r-xr-xr-x  1 root root  13093 May  9  2012 uninstall.sh
# ./mlnxofedinstall

 After installation of OFED from Mellanox, install 32 bit binaries also as those will be used by TSA.

# cd RPMS
# for x in `echo *i686*`
> do
> echo $x
> rpm -ivh --nodeps $x
> done

There are some basic differences between Redhat and Mellanox OFED distributions. For example:

The dat.conf file is in /etc/rdma in Redhat implementation but it is in /etc for Mellanox.

The service name used by Redhat is rdma whereas it is openibd by Mellanix or OFED.

The next important step is to make sure that we have proper entries in dat.conf file. So, we need to follow these steps.

1. Find out the device name using my Mellanoix card.

# lspci | grep -i Mellanox
1b:00.0 Ethernet controller: Mellanox Technologies MT26448 
[ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)

2. Find out the MAC address associated with the device 1b:00.0 (Output from above shows the device name)

# mstflint -d 1b:00.0 q
Image type:      ConnectX
FW Version:      2.9.1100
Rom Info:        type=UEFI version=4.0.0 proto=ETH
                 type=PXE  version=3.3.300 devid=26448 proto=VPI
Device ID:       26448
Description:     Port1            Port2
MACs:            0002c94cac48     0002c94cac49
Board ID:        (IBM08C0110009)
VSD:
PSID:            IBM08C0110009

For example, the MAC address shown for port1 and port 2 are : 0002c94cac48, 0002c94cac49

Once we know the MAC address, we need to associate this MAC address with an IP address in the network-script folder.

For example:

# cd /etc/sysconfig/network-scripts
# cat ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="static"
IPADDR="192.168.100.101"
IPV6INIT="no"
MTU=""
NETMASK="255.255.255.0"
NM_CONTROLLED="no"
ONBOOT="yes"
TYPE="Ethernet"
NAME='db2-roce'
STARTMODE=auto
USERCONTROL=no
BOOTPROTO=none
USERCTL=no
HWADDR=00:02:C9:4C:AC:48

<
Verify this with the following command.

# ip addr show eth0
9: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
    link/ether 00:02:c9:4c:ac:48 brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.101/24 brd 192.168.100.255 scope global eth0
    inet6 fe80::202:c9ff:fe4c:ac48/64 scope link
       valid_lft forever preferred_lft forever

Even though, if you are not using ipv6, make sure that scope link is shown for inet6 in the above output.
Once above is done, check the output from ibstat -l and ibstat -v

# ibstat -l
mlx4_0
# ibstat -v
CA 'mlx4_0'
        CA type: MT26448
        Number of ports: 2
        Firmware version: 2.9.1100
        Hardware version: b0
        Node GUID: 0x0002c903004cac48
        System image GUID: 0x0002c903004cac4b
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe4cac48
                Link layer: Ethernet
        Port 2:
                State: Active
                Physical state: LinkUp
                Rate: 10
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x0202c9fffe4cac49
                Link layer: Ethernet
# service openibd status

  HCA driver loaded

Configured MLX4_EN devices:
eth0 eth1

Currently active MLX4_EN devices:
eth0

The following OFED modules are loaded:

  rdma_ucm
  rdma_cm
  ib_addr
  ib_ipoib
  mlx4_core
  mlx4_ib
  mlx4_en
  ib_mthca
  ib_uverbs
  ib_umad
  ib_ucm
  ib_sa
  ib_cm
  ib_mad
  ib_core
  iw_cxgb3
  iw_nes

If only one port on Mellanox card is attached to a switch, we should see one ‘Currently active devices’ as eth0 only but the configured devices should show both the ports. If currently active devices show both eth0 and eth1 and eth1 port is not connected to the switch, delete the eth1 file.

Example of a configuration when both ports are connected to different switches. Please note that eth1 is not associated with an IP address but it is assocated with a MAC address for port 2.

# lspci | grep Mellanox
11:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3]
root@pdtxecomdb02.lowes.com:/etc/sysconfig/network-scripts> mstflint -d 11:00.0 q
Image type:      ConnectX
FW Version:      2.10.2326
Rom Info:        type=UEFI version=4.0.420 proto=ETH
Device ID:       4099
Description:     Node             Port1            Port2            Sys image
GUIDs:           ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
MACs:                                 0002c9e64a20     0002c9e64a21
Board ID:         (IBM0FE0140023)
VSD:
PSID:            IBM0FE0140023

# cd /etc/sysconfig/network-scripts
# cat ifcfg-eth0
DEVICE=eth0
TYPE=Ethernet
NAME='db2-roce'
STARTMODE=auto
USERCONTROL=no
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
VLAN=yes
HWADDR=00:02:C9:E6:4A:20
IPADDR=10.153.1.17
NETMASK=255.255.255.0

# cat ifcfg-eth
DEVICE=eth1
TYPE=Ethernet
NAME='db2-roce'
STARTMODE=auto
USERCONTROL=no
BOOTPROTO=none
ONBOOT=yes
USERCTL=no
VLAN=yes
HWADDR=00:02:C9:E6:4A:21

After IP addresses are properly attached to the mellanox cards port, the next step is to make sure that we have proper entries in the dat.conf file. For example:

ofa-v2-eth0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth0 0" ""
ofa-v2-eth1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth1 0" ""

The above 2 entries ofa-v2-eth0 and ofa-v2-eth1 are associated with eth0 and eth1 (Mellanox ports). In our setup, we are only using eth0.

Test RDMA

Once we have 10 GbE connectivity are established and machines are in cluster, we need to test RDMA using 2 machines. For example, run the following commands.

On one machine, start dtest as a server. After seeing the following message, you need to run the same command as client on another machine.

]#  dtest -P ofa-v2-eth0
9721 Running as server - ofa-v2-eth0
9721 Local Address AF_INET - 192.168.100.110 port 45248
9721 Server waiting for connect request on port b0c0

Start client on another machine.

]#  dtest -P ofa-v2-eth0 -h 192.168.100.110
[root@lwsecompsdb03 ofed]# dtest -P  ofa-v2-eth0 -h 192.168.134.110
17204 Running as client - ofa-v2-eth0
17204 Local Address AF_INET - 192.168.134.111 port 45248
17204 Server Name: 192.168.100.110
17204 Server Net Address: 192.168.134.110 port b0c0
17204 Waiting for connect response

17204 CONNECTED!

17204 Send RMR msg to remote: r_key_ctx=0x2803e51b,va=0xc6c7a0,len=0x40
17204 remote RMR data arrived!
17204 Received RMR from remote: r_iov: r_key_ctx=7003b419,va=21a17a0,len=0x40

17204 Query EP: LOCAL addr 192.168.100.111 port a001
17204 Query EP: REMOTE addr 192.168.100.110 port b0c0

 17204 RDMA WRITE DATA with SEND MSG

17204 Sending RDMA WRITE completion message
17204 inbound rdma_write; send message arrived!
17204 Received RMR from remote: r_iov: r_key_ctx=7003b419,va=21a17a0,len=0x40
17204 CLIENT: RDMA write buffer contains: server RDMA write data...

 17204 RDMA READ DATA with SEND MSG

17204 Sending RDMA read completion message
17204 Waiting for inbound message....
17204 inbound rdma_read; send message arrived!
17204 Received RMR from remote: r_iov: r_key_ctx=7003b419,va=21a17a0,len=0x40
17204 CLIENT: RCV RDMA read buffer contains: server RDMA read data...
 17204 PING DATA with SEND MSG
17204: DAPL Test Complete. PASSED

After the above test succeeds, you know that the RDMA works between servers.

The best practice is to have public network through which the clients would come and connect to the database server. The interconnect between pureScale members should be on a private network where traffic is just limited to the subnet even without going to the switch. An example of the /etc/hosts file may look like as shown:

$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
# Public network through which clients would connect to the databases
10.10.10.109 db01.zinox.com db01
10.10.10.110 db02.zinox.com db02
10.10.10.111 db03.zinox.com db03
10.10.10.112 db04.zinox.com db04
10.10.10.113 db05.zinox.com db05
# Private network through which DB2 servers talk to each other using Mellanix adapters and RDMA
192.168.100.109 db01-re.zinox.com db01-re
192.168.100.110 db02-re.zinox.com db02-re
192.168.100.111 db03-re.zinox.com db03-re
192.168.100.112 db04-re.zinox.com db04-re
192.168.100.113 db05-re.zinox.com db05-re

The other required things for Mellanox are:

# cd /etc/modprobe.d/
# ls -l
total 44
-rw-r--r--. 1 root root 52 Jan 21 22:35 anaconda.conf
-rw-r--r--. 1 root root 933 Feb 22 23:35 blacklist.conf
-rw-r--r--. 1 root root 382 Aug 10 2010 dist-alsa.conf
-rw-r--r--. 1 root root 5596 Aug 10 2010 dist.conf
-rw-r--r--. 1 root root 473 Aug 10 2010 dist-oss.conf
-rw-r--r--  1 root root 308 Feb 23 04:12 ib_ipoib.conf
-rw-r--r--  1 root root 46 May 9 2012 ib_sdp.conf
-rw-r--r--  1 root root 956 Feb 23 04:14 mlx4_en.conf
-rw-r--r--  1 root root 63 Feb 23 00:49 modprobe.conf
-rw-r--r--. 1 root root 30 Oct 9 2009 openfwwf.conf

# cat modprobe.conf
options mlx4_core log_mtts_per_seg=7
options mlx4_en num_lro=0

# cat mlx4_en.conf
install mlx4_core modprobe --ignore-install $((modprobe -c | 
grep -wq "^allow_unsupported_modules") && echo '--allow-unsupported-modules') 
mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if 
( grep -q "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); 
then modprobe mlx4_en; fi; else modprobe mlx4_en; fi
install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq 
"^allow_unsupported_modules") && echo '--allow-unsupported-modules') mlx4_en 
&& if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q "^RUN_SYSCTL=yes" 
/etc/infiniband/openib.conf > /dev/null 2>&1); then /sbin/sysctl_perf_tuning load; fi; fi
remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r --ignore-remove mlx4_en

# Configure Flow Control
# pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit mask (uint)
# pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit mask (uint)
options mlx4_core pfctx=0 pfcrx=0
Two things are added to the blacklist.conf
blacklist iTCO_wdt
blacklist iTCO_vendor_support