From Mark Furneaux's Wiki
Jump to: navigation, search

IPoIB, or Internet Protocol over Infiniband, is an Infiniband extension which allows TCP/IP data to be transmitted over an IB link. It is also used to address nodes on a network for other extensions such as RDMA or SDP. This guide will cover the installation and setup of an IB network between 2 computers using IPoIB. This basic configuration can act as a base for other extensions such as RDMA.


IB kernel modules are present in modern Ubuntu builds, so this guide will not cover building them into the kernel.

From now on we will refer to the 2 computers as the client and the server, and assume we are using Mellanox IB cards.

An Infiniband Subnet Manager (SM) is required on one or more devices in the network. It is advised that this be on the client as it only needs to sweep once at startup.
Install the subnet manager from the OFED package by running:
# apt install opensm
Optionally this can also be installed on the server as a backup, should the server SM fail or crash, the server will take over and keep the network up.

It is recommended that the sweep period be changed to allow for the network to operate more stably.
To do this, create a config file by stopping opensm:
# systemctl stop opensm
and then run:
# opensm --create-config /etc/opensm/opensm.conf

In a network with changing devices, you can speed up sweeping of the network for faster bringup of nodes. Set this in opensm.conf:

# The number of seconds between subnet sweeps (0 disables it)
sweep_interval 5

You will need to start the SM by running:
# systemctl start opensm
or by starting the daemon directly by running:
# opensm -B


Add the following to /etc/modules on both the server and client:


You only need one of ib_mthca or mlx4_ib. The exact module depends on the card you are using.

Add the following to /etc/network/interfaces on both the client and server:

auto ib0
iface ib0 inet static
	post-up echo connected > /sys/class/net/ib0/mode
	post-up /sbin/ip link set dev $IFACE mtu 65520
	post-up /sbin/ip link set dev $IFACE txqueuelen 10000

Make sure to change the IP address between the client and server.
The last 3 lines change from Unreliable Datagram mode to Connected mode, set the MTU to the maximum, and set the transmit queue to 10000. Remove those lines to use UD mode which provides better latency but lower throughput.

Reboot the client and server and verify the link is setup correctly by running on one computer:
$ cat /sys/class/infiniband/mthca0/ports/1/state
and verify that the output is "ACTIVE". This ensures the SM is running. If you get "INIT", then the link is up but the SM is not running.

Similarly verify that the link is in the correct mode by running:
$ cat /sys/class/net/ib0/mode
and ensuring the output is "connected" if you configured for connected mode.

Kernel Tuning

At this point the IB link will work but will suffer from high overhead and poor performance due to the way the kernel handles TCP data by default. Some changes can be made to improve performance.

Mellanox recommends a set of parameters that can be applied by running the following on both the client and server:

# sysctl -w net.ipv4.tcp_timestamps=0
# sysctl -w net.ipv4.tcp_sack=1
# sysctl -w net.core.netdev_max_backlog=250000
# sysctl -w net.core.rmem_max=4194304
# sysctl -w net.core.wmem_max=4194304
# sysctl -w net.core.rmem_default=4194304
# sysctl -w net.core.wmem_default=4194304
# sysctl -w net.core.optmem_max=4194304
# sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
# sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
# sysctl -w net.ipv4.tcp_low_latency=1
# sysctl -w net.ipv4.tcp_adv_win_scale=1

These settings can be tweaked at runtime to maximize performance. Once satisfied, make them permanent by adding them to /etc/sysctl.conf:

net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 65536 4194304"

Daisy-Chaining Hosts

In small networks with slightly more than 2 devices, it can be disadvantageous to buy an IB switch due to cost, size, or power consumption reasons. In small networks, dual-port IB HCAs can be used to daisy-chain hosts together. That is, hosts communicate by passing traffic through one or more other nodes to reach their final destination. For this to work, several configuration changes on each node need to be made. In this example, let's say we have 3 nodes: A, B, and C, connected together through node B.

IPv4 Packet Forwarding

All non-edge nodes (nodes which will pass traffic for other nodes) will need to be able to perform packet routing (forwarding). In this example this is node B. To enable this, add/uncomment the following line in /etc/sysctl.conf:


The change can be made on the fly by running:
# sysctl -w net.ipv4.ip_forward=1

Static Routes

All nodes will need to know how to get packets to their final destination (which is not immediately known by those nodes). You will need to add static routes to all devices without direct access to a particular subnet. In this example, nodes A and C will need static routes to C and A respectively.

Routes can be added temporarily for testing with the command:
# ip route add <dest subnet>/<dest mask> via <gateway IP> dev <local interface>

In our example let say A <-> B is on the subnet, the B <-> C link is on the subnet, and B has the IP In that case the requisite static route on A would be:
# ip route add via dev ib0

Once static routes are tested and working, they must be made permanent by adding them to /etc/network/interfaces:

auto ib0
iface ib0 inet static
        post-up echo connected > /sys/class/net/ib0/mode
        post-up /sbin/ip link set dev $IFACE mtu 65520
        post-up /sbin/ip link set dev $IFACE txqueuelen 10000
        up route add -net netmask gw
        down route del -net netmask gw

See Also