ZFS

From Mark Furneaux's Wiki
Jump to: navigation, search

ZFS is a combined filesystem and logical volume manager for UNIX and Linux systems.

This walkthrough assumes that we are building a RAIDZ1 pool named "tank" with 4x 1TB drives, a single hot spare, a single SSD for the L2ARC, and a single SSD partition for the ZIL.
We will then add a dataset called "storage".

Setup

Your system should contain all drives of the same type and size. It should have at least 1GB of free RAM. It must be running a 64 bit OS. You should make a list of all drives including their model, serial, location in the chassis, and port on the motherboard/HBA they are plugged into. This will make replacing dead disks much safer. Any new CPU will be fine for most applications; even Intel Atoms now have enough power if used as a dedicated file server, even with compression enabled.

Installation

For Ubuntu 16.04 and later, simply install the ZFS utilities by running:
# apt install zfsutils-linux

I highly recommend compiling ZFS from source if possible.

Setting Up A Pool

We are going to assume that you have recorded the drive models and serial numbers before they were installed in the server.

Partition the SSD for the ZIL. The ZIL will never need to be more than 4GB in size. You can use the remaining space for a root filesystem or an L2ARC.

Find the disk IDs for both the SSD partition and HDDs by running
$ ls -l /dev/disk/by-id/
to get IDs in the form of drive model/drive serial number.

Create the pool by running:
# zpool create -f -o ashift=12 -o autoreplace=on -o autoexpand=on tank raidz1 \
/dev/disk/by-id/ata-1000-1234-AAAA \
/dev/disk/by-id/ata-1000-1234-BBBB \
/dev/disk/by-id/ata-1000-1234-CCCC \
/dev/disk/by-id/ata-1000-1234-DDDD

-f forces the creation of the pool. May or may not be necessary.
ashift=12 forces 4KiB sectors for new AF disks. Use ashift=9 for older 512B drives.
autoreplace=on allows ZFS to automatically switch to an available hot spare if it detects hardware errors on an online disk.
autoexpand=on allows the pool to grow when all VDEVs have been replaced with larger ones. This must be set before any drives are replaced, so it is best to set it now.

Setting Up Automatic Scrub

It is best practise to scrub consumer-grade SATA disks on a weekly or biweekly basis.

Weekly

A weekly scrub can be achieved by running a scrub via cron by adding the following to /etc/crontab:
0 2 * * 3 root zpool scrub tank
This scrubs the pool "tank" every Wednesday at 2:00am.

Biweekly

Unfortunately cron does not have a method for biweekly execution. One elegant solution is to create the following script in /usr/local/bin/scrubzfs:

#!/bin/bash

cd /root

if [ -e ran_zfs_scrub_last_week ]; then
        rm -f ran_zfs_scrub_last_week
        exit 0
else
        touch ran_zfs_scrub_last_week
fi

zpool scrub tank

exit 0

and be sure to make it executable.

Then add the following to /etc/crontab:
0 2 * * 3 root /usr/local/bin/scrubzfs
This scrubs the pool "tank" every other Wednesday at 2:00am.

Adding Spares

Spare disks can be added simply by running:
# zpool add -f tank spare /dev/disk/by-id/ata-1000-1234-EEEE

Adding The ZIL

The ZIL partition created earlier can be added by running:
# zpool add -f tank log /dev/disk/by-id/ata-60-1234-AAAA-part2

Adding The L2ARC

The L2ARC drive can be added by running:
# zpool add -f tank cache /dev/disk/by-id/ata-60-5678-AAAA

Creating Datasets

Datasets can be created dynamically by running:
# zfs create tank/storage
# zfs set compression=lz4 tank/storage
# zfs set xattr=sa tank/storage
compression=lz4 turns on dynamic data compression.
xattr=sa allows small system attributes to be stored in inodes rather than hidden directories. This can speed up many operations by 3x.

Since the dataset will be owned by root, ownership must be changed for users to have write access. To do that run:
# chown -R mark:root /tank/storage

Tuning

Once ZFS is installed and working, it is beneficial to tune some parameters to improve scrub, resilver, and general performance. Changes can be temporarily made by echoing values to /sys/module/zfs/parameters/ and then made permanent by adding lines to /etc/modprobe.d/zfs.conf in the form shown below.

options zfs zfs_arc_max=21474836480
options zfs zfs_top_maxinflight=64
options zfs zfs_scrub_delay=0
options zfs zfs_scan_idle=0
options zfs zfs_resilver_delay=0
options zfs zfs_vdev_scrub_max_active=10
options zfs zfs_scan_min_time_ms=5000
options zfs zfs_resilver_min_time_ms=5000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_write_max_active=100

ARM Installation

The kmod packages cannot be built on ARM and thus you must install with regular make install. After installing, the binaries will not run due to bad library paths.
To fix this, add the following to /etc/ld.so.conf:

/usr/local/lib

And then run:
# ldconfig

Custom Packages

If installing a custom kmod package, several steps must be peformed after installation to ensure a bootable system.

Add the ZFS and SPL modules to the module tree:
# depmod -a <kernel version>

Generate a new initramfs:
# update-initramfs -u -k <kernel version>

Update the grub menu list:
# update-grub

Notes

Many storage engineers and sysadmins claim that RAID5 and by extension RAIDZ1 are "broken". This comes from the fact that with the high capacity of modern drives and their intrinsic error rates, it is very unlikely that a pool can recover from a failure without suffering bit damage. With RAID5, this is for the most part true, however ZFS adds in another level of checksumming and healing. However this in most cases will only detect that corruption has taken place and will not be able to recover from it. It is advised that large pools use RAIDZ2 at a minimum. The exact definition of "large" is up to the creator of the pool and depends on the type of data stored on it.

References