IPoIB, or Internet Protocol over Infiniband, is an Infiniband extension which allows TCP/IP data to be transmitted over an IB link. It is also used to address nodes on a network for other extensions such as RDMA or SDP. This guide will cover the installation and setup of an IB network between 2 computers using IPoIB. This basic configuration can act as a base for other extensions such as RDMA.
IB kernel modules are present in modern Ubuntu builds, so this guide will not cover building them into the kernel.
From now on we will refer to the 2 computers as the client and the server, and assume we are using Mellanox IB cards.
An Infiniband Subnet Manager (SM) is required on one or more devices in the network. It is advised that this be on the client as it only needs to sweep once at startup.
Install the subnet manager from the OFED package by running:
# apt install opensm
Optionally this can also be installed on the server as a backup, should the server SM fail or crash, the server will take over and keep the network up.
It is recommended that the sweep period be changed to allow for the network to operate more stably.
To do this, create a config file by stopping opensm:
# service opensm stop
and then run:
# opensm --create-config /etc/opensm/opensm.conf
In a network with changing devices, you can speed up sweeping of the network for faster bringup of nodes. Set this in
# # SWEEP OPTIONS # # The number of seconds between subnet sweeps (0 disables it) sweep_interval 5
You will need to start the SM by running:
# service opensm start
or by starting the daemon directly by running:
# opensm -B
Add the following to
/etc/modules on both the server and client:
ib_mthca mlx4_ib ib_umad ib_ipoib
You only need one of
mlx4_ib. The exact module depends on the card you are using.
Add the following to
/etc/network/interfaces on both the client and server:
auto ib0 iface ib0 inet static address 192.168.10.100 netmask 255.255.255.0 broadcast 192.168.10.255 post-up echo connected > /sys/class/net/ib0/mode post-up /sbin/ip link set dev $IFACE mtu 65520 post-up /sbin/ip link set dev $IFACE txqueuelen 10000
Make sure to change the IP address between the client and server.
The last 3 lines change from Unreliable Datagram mode to Connected mode, set the MTU to the maximum, and set the transmit queue to 10000. Remove those lines to use UD mode which provides better latency but lower throughput.
Reboot the client and server and verify the link is setup correctly by running on one computer:
$ cat /sys/class/infiniband/mthca0/ports/1/state
and verify that the output is "ACTIVE". This ensures the SM is running. If you get "INIT", then the link is up but the SM is not running.
Similarly verify that the link is in the correct mode by running:
$ cat /sys/class/net/ib0/mode
and ensuring the output is "connected" if you configured for connected mode.
At this point the IB link will work but will suffer from high overhead and poor performance due to the way the kernel handles TCP data by default. Some changes can be made to improve performance.
Mellanox recommends a set of parameters that can be applied by running the following on both the client and server:
# sysctl -w net.ipv4.tcp_timestamps=0 # sysctl -w net.ipv4.tcp_sack=1 # sysctl -w net.core.netdev_max_backlog=250000 # sysctl -w net.core.rmem_max=4194304 # sysctl -w net.core.wmem_max=4194304 # sysctl -w net.core.rmem_default=4194304 # sysctl -w net.core.wmem_default=4194304 # sysctl -w net.core.optmem_max=4194304 # sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304" # sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304" # sysctl -w net.ipv4.tcp_low_latency=1 # sysctl -w net.ipv4.tcp_adv_win_scale=1
These settings can be tweaked at runtime to maximize performance. Once satisfied, make them permanent by adding them to
net.ipv4.tcp_timestamps=0 net.ipv4.tcp_sack=1 net.core.netdev_max_backlog=250000 net.core.rmem_max=4194304 net.core.wmem_max=4194304 net.core.rmem_default=4194304 net.core.wmem_default=4194304 net.core.optmem_max=4194304 net.ipv4.tcp_rmem="4096 87380 4194304" net.ipv4.tcp_wmem="4096 65536 4194304" net.ipv4.tcp_low_latency=1 net.ipv4.tcp_adv_win_scale=1
In small networks with slightly more than 2 devices, it can be disadvantageous to buy an IB switch due to cost, size, or power consumption reasons. In small networks, dual-port IB HCAs can be used to daisy-chain hosts together. That is, hosts communicate by passing traffic through one or more other nodes to reach their final destination. For this to work, several configuration changes on each node need to be made. In this example, let's say we have 3 nodes: A, B, and C, connected together through node B.
IPv4 Packet Forwarding
All non-edge nodes (nodes which will pass traffic for other nodes) will need to be able to perform packet routing (forwarding). In this example this is node B. To enable this, add/uncomment the following line in
The change can be made on the fly by running:
# sysctl -w net.ipv4.ip_forward=1
All nodes will need to know how to get packets to their final destination (which is not immediately known by those nodes). You will need to add static routes to all devices without direct access to a particular subnet. In this example, nodes A and C will need static routes to C and A respectively.
Routes can be added temporarily for testing with the command:
# ip route add <dest subnet>/<dest mask> via <gateway IP> dev <local interface>
In our example let say A <-> B is on the 192.168.10.0/24 subnet, the B <-> C link is on the 192.168.11.0/24 subnet, and B has the IP 192.168.10.100. In that case the requisite static route on A would be:
# ip route add 192.168.11.0/24 via 192.168.10.100 dev ib0
Once static routes are tested and working, they must be made permanent by adding them to
auto ib0 iface ib0 inet static address 192.168.10.101 netmask 255.255.255.0 broadcast 192.168.10.255 post-up echo connected > /sys/class/net/ib0/mode post-up /sbin/ip link set dev $IFACE mtu 65520 post-up /sbin/ip link set dev $IFACE txqueuelen 10000 up route add -net 192.168.11.0 netmask 255.255.255.0 gw 192.168.10.100 down route del -net 192.168.11.0 netmask 255.255.255.0 gw 192.168.10.100