Additional Packages
After installation of basic Redhat 6.2 OS, the following additional components are required.
# yum install perftest # yum install sysfsutils # yum install kernel-devel # yum install compat-libstdc++ # yum install tcl # yum install tk # yum install kernel-headers # yum install kernel-devel # yum install libstd++* # yum install *i686* # yum install libstdc++*.i686 # yum install compat-libstdc++* # yum install make # yum install gcc* # yum install cpp* # yum install multipath-device-mapper
Install OFED
After Mellanox cards are put in the system, we need to make sure that it is properly configured for OFED.
There are 2 ways, OFED can be installed on RedHat.
- Using Infiniband Support from RHN
- Using Mellanox OFED packages
From RHN, do
# yum groupinstall "Infiniband Support"
If using Mellanox OFED package, mount the iso file
# mount -o loop /root/ofed/MLNX_OFED_LINUX-1.5.3-4.0.2-rhel6.2-x86_64.iso /mnt/ofed # cd /mnt/ofed # ls -l dr-xr-xr-x 7 root root 2048 May 9 2012 . drwxr-xr-x. 3 root root 4096 Feb 22 23:59 .. dr-xr-xr-x 8 root root 2048 May 7 2012 docs dr-xr-xr-x 4 root root 2048 Sep 26 2011 firmware -r--r--r-- 1 root root 12 May 9 2012 .mlnx -r-xr-xr-x 1 root root 163493 May 9 2012 mlnxofedinstall dr-xr-xr-x 2 root root 2048 May 9 2012 repodata dr-xr-xr-x 2 root root 22528 May 9 2012 RPMS dr-xr-xr-x 2 root root 2048 May 9 2012 src -r--r--r-- 1 root root 22 May 9 2012 .supported_kernels -r-xr-xr-x 1 root root 13093 May 9 2012 uninstall.sh # ./mlnxofedinstall
After installation of OFED from Mellanox, install 32 bit binaries also as those will be used by TSA.
# cd RPMS # for x in `echo *i686*` > do > echo $x > rpm -ivh --nodeps $x > done
There are some basic differences between Redhat and Mellanox OFED distributions. For example:
The dat.conf file is in /etc/rdma in Redhat implementation but it is in /etc for Mellanox.
The service name used by Redhat is rdma whereas it is openibd by Mellanix or OFED.
The next important step is to make sure that we have proper entries in dat.conf file. So, we need to follow these steps.
1. Find out the device name using my Mellanoix card.
# lspci | grep -i Mellanox 1b:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
2. Find out the MAC address associated with the device 1b:00.0 (Output from above shows the device name)
# mstflint -d 1b:00.0 q Image type: ConnectX FW Version: 2.9.1100 Rom Info: type=UEFI version=4.0.0 proto=ETH type=PXE version=3.3.300 devid=26448 proto=VPI Device ID: 26448 Description: Port1 Port2 MACs: 0002c94cac48 0002c94cac49 Board ID: (IBM08C0110009) VSD: PSID: IBM08C0110009
For example, the MAC address shown for port1 and port 2 are : 0002c94cac48, 0002c94cac49
Once we know the MAC address, we need to associate this MAC address with an IP address in the network-script folder.
For example:
# cd /etc/sysconfig/network-scripts # cat ifcfg-eth0 DEVICE="eth0" BOOTPROTO="static" IPADDR="192.168.100.101" IPV6INIT="no" MTU="" NETMASK="255.255.255.0" NM_CONTROLLED="no" ONBOOT="yes" TYPE="Ethernet" NAME='db2-roce' STARTMODE=auto USERCONTROL=no BOOTPROTO=none USERCTL=no HWADDR=00:02:C9:4C:AC:48
<
Verify this with the following command.
# ip addr show eth0 9: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000 link/ether 00:02:c9:4c:ac:48 brd ff:ff:ff:ff:ff:ff inet 192.168.100.101/24 brd 192.168.100.255 scope global eth0 inet6 fe80::202:c9ff:fe4c:ac48/64 scope link valid_lft forever preferred_lft forever
Even though, if you are not using ipv6, make sure that scope link is shown for inet6 in the above output.
Once above is done, check the output from ibstat -l and ibstat -v
# ibstat -l mlx4_0 # ibstat -v CA 'mlx4_0' CA type: MT26448 Number of ports: 2 Firmware version: 2.9.1100 Hardware version: b0 Node GUID: 0x0002c903004cac48 System image GUID: 0x0002c903004cac4b Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0202c9fffe4cac48 Link layer: Ethernet Port 2: State: Active Physical state: LinkUp Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0202c9fffe4cac49 Link layer: Ethernet # service openibd status HCA driver loaded Configured MLX4_EN devices: eth0 eth1 Currently active MLX4_EN devices: eth0 The following OFED modules are loaded: rdma_ucm rdma_cm ib_addr ib_ipoib mlx4_core mlx4_ib mlx4_en ib_mthca ib_uverbs ib_umad ib_ucm ib_sa ib_cm ib_mad ib_core iw_cxgb3 iw_nes
If only one port on Mellanox card is attached to a switch, we should see one ‘Currently active devices’ as eth0 only but the configured devices should show both the ports. If currently active devices show both eth0 and eth1 and eth1 port is not connected to the switch, delete the eth1 file.
Example of a configuration when both ports are connected to different switches. Please note that eth1 is not associated with an IP address but it is assocated with a MAC address for port 2.
# lspci | grep Mellanox 11:00.0 Ethernet controller: Mellanox Technologies MT27500 Family [ConnectX-3] root@pdtxecomdb02.lowes.com:/etc/sysconfig/network-scripts> mstflint -d 11:00.0 q Image type: ConnectX FW Version: 2.10.2326 Rom Info: type=UEFI version=4.0.420 proto=ETH Device ID: 4099 Description: Node Port1 Port2 Sys image GUIDs: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff MACs: 0002c9e64a20 0002c9e64a21 Board ID: (IBM0FE0140023) VSD: PSID: IBM0FE0140023 # cd /etc/sysconfig/network-scripts # cat ifcfg-eth0 DEVICE=eth0 TYPE=Ethernet NAME='db2-roce' STARTMODE=auto USERCONTROL=no BOOTPROTO=none ONBOOT=yes USERCTL=no VLAN=yes HWADDR=00:02:C9:E6:4A:20 IPADDR=10.153.1.17 NETMASK=255.255.255.0 # cat ifcfg-eth DEVICE=eth1 TYPE=Ethernet NAME='db2-roce' STARTMODE=auto USERCONTROL=no BOOTPROTO=none ONBOOT=yes USERCTL=no VLAN=yes HWADDR=00:02:C9:E6:4A:21
After IP addresses are properly attached to the mellanox cards port, the next step is to make sure that we have proper entries in the dat.conf file. For example:
ofa-v2-eth0 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth0 0" "" ofa-v2-eth1 u2.0 nonthreadsafe default libdaplofa.so.2 dapl.2.0 "eth1 0" ""
The above 2 entries ofa-v2-eth0 and ofa-v2-eth1 are associated with eth0 and eth1 (Mellanox ports). In our setup, we are only using eth0.
Test RDMA
Once we have 10 GbE connectivity are established and machines are in cluster, we need to test RDMA using 2 machines. For example, run the following commands.
On one machine, start dtest as a server. After seeing the following message, you need to run the same command as client on another machine.
]# dtest -P ofa-v2-eth0 9721 Running as server - ofa-v2-eth0 9721 Local Address AF_INET - 192.168.100.110 port 45248 9721 Server waiting for connect request on port b0c0
Start client on another machine.
]# dtest -P ofa-v2-eth0 -h 192.168.100.110 [root@lwsecompsdb03 ofed]# dtest -P ofa-v2-eth0 -h 192.168.134.110 17204 Running as client - ofa-v2-eth0 17204 Local Address AF_INET - 192.168.134.111 port 45248 17204 Server Name: 192.168.100.110 17204 Server Net Address: 192.168.134.110 port b0c0 17204 Waiting for connect response 17204 CONNECTED! 17204 Send RMR msg to remote: r_key_ctx=0x2803e51b,va=0xc6c7a0,len=0x40 17204 remote RMR data arrived! 17204 Received RMR from remote: r_iov: r_key_ctx=7003b419,va=21a17a0,len=0x40 17204 Query EP: LOCAL addr 192.168.100.111 port a001 17204 Query EP: REMOTE addr 192.168.100.110 port b0c0 17204 RDMA WRITE DATA with SEND MSG 17204 Sending RDMA WRITE completion message 17204 inbound rdma_write; send message arrived! 17204 Received RMR from remote: r_iov: r_key_ctx=7003b419,va=21a17a0,len=0x40 17204 CLIENT: RDMA write buffer contains: server RDMA write data... 17204 RDMA READ DATA with SEND MSG 17204 Sending RDMA read completion message 17204 Waiting for inbound message.... 17204 inbound rdma_read; send message arrived! 17204 Received RMR from remote: r_iov: r_key_ctx=7003b419,va=21a17a0,len=0x40 17204 CLIENT: RCV RDMA read buffer contains: server RDMA read data... 17204 PING DATA with SEND MSG 17204: DAPL Test Complete. PASSED
After the above test succeeds, you know that the RDMA works between servers.
The best practice is to have public network through which the clients would come and connect to the database server. The interconnect between pureScale members should be on a private network where traffic is just limited to the subnet even without going to the switch. An example of the /etc/hosts file may look like as shown:
$ cat /etc/hosts 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 # Public network through which clients would connect to the databases 10.10.10.109 db01.zinox.com db01 10.10.10.110 db02.zinox.com db02 10.10.10.111 db03.zinox.com db03 10.10.10.112 db04.zinox.com db04 10.10.10.113 db05.zinox.com db05 # Private network through which DB2 servers talk to each other using Mellanix adapters and RDMA 192.168.100.109 db01-re.zinox.com db01-re 192.168.100.110 db02-re.zinox.com db02-re 192.168.100.111 db03-re.zinox.com db03-re 192.168.100.112 db04-re.zinox.com db04-re 192.168.100.113 db05-re.zinox.com db05-re
The other required things for Mellanox are:
# cd /etc/modprobe.d/ # ls -l total 44 -rw-r--r--. 1 root root 52 Jan 21 22:35 anaconda.conf -rw-r--r--. 1 root root 933 Feb 22 23:35 blacklist.conf -rw-r--r--. 1 root root 382 Aug 10 2010 dist-alsa.conf -rw-r--r--. 1 root root 5596 Aug 10 2010 dist.conf -rw-r--r--. 1 root root 473 Aug 10 2010 dist-oss.conf -rw-r--r-- 1 root root 308 Feb 23 04:12 ib_ipoib.conf -rw-r--r-- 1 root root 46 May 9 2012 ib_sdp.conf -rw-r--r-- 1 root root 956 Feb 23 04:14 mlx4_en.conf -rw-r--r-- 1 root root 63 Feb 23 00:49 modprobe.conf -rw-r--r--. 1 root root 30 Oct 9 2009 openfwwf.conf # cat modprobe.conf options mlx4_core log_mtts_per_seg=7 options mlx4_en num_lro=0 # cat mlx4_en.conf install mlx4_core modprobe --ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") && echo '--allow-unsupported-modules') mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then modprobe mlx4_en; fi; else modprobe mlx4_en; fi install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq "^allow_unsupported_modules") && echo '--allow-unsupported-modules') mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then /sbin/sysctl_perf_tuning load; fi; fi remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r --ignore-remove mlx4_en # Configure Flow Control # pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit mask (uint) # pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit mask (uint) options mlx4_core pfctx=0 pfcrx=0
Two things are added to the blacklist.conf
blacklist iTCO_wdt blacklist iTCO_vendor_support