Data Infrastructures: ZFS survival guide

"584 files, 92,000 lines of change, 56 patents, 5 years... and there it is. Just like that." -- Jeff Bonwick

Abstract: ZFS is both a file system and volume manager that makes manipulation of large amount of data easier, faster and safer. It provides features such as transparent compression, clones, encryption, snapshots and other critical features for data engineering and data science. This survival guide is a crash course in ZFS essentials (and a little bit of history) and is part of a series of articles on data infrastructures: Data Infrastuctures for the rest of us Part I and Part II - several other parts upcoming.

Brief History

ZFS, a combined file system and logical volume manager, was conceptualized at Sun Microsystems in the early 2000s and by end of 2001 a prototype was up and running. It was finally made available as part of the Solaris Express program at the end of 2005. 584 files and 92,000 lines of code, according to Jeff Bonwick. At the 2007 Usenix LISA (Large Installation System Administration) conference, Bonwick famously started his presentation with a slide that had:


# zpool create tank mirror c2d0 c3d0

That's it. You're done.


# df
Filesystem size used avail capacity Mounted on
tank       233G  18K  233G     1%   /tank

Thank you for coming.

Goodbye.

He created a mirrored disk pool, using devices c2d0 and c3d0 which automatically mounted itself to /tank, all in one single, simple command. It might have been a funny introduction to his presentation, but it highlighted just how simple things were when volume management and file system management became one. Not only that, but ZFS broke the size limits by providing a 128 bit file system. And it provided an in memory cache and a queued COW (Copy On Write) writer providing safe transactions. No need for fsck.

From 2008 onward, many new features would be added including additional caches (read level two and write intent log caches), compression, encryption, free snapshots, cloning, deduplication, sending and receiving filesystems for backup and restore, and all forms of RAID configurations one might imagine (without any possibility of the write hole issue of hardware raid). Even ZFS mirrored boot (root) drives are becoming more mainstream, thanks to Ubuntu 19.10.

The features and ease of use convinced me right away, since the Solaris Nevada release that first included ZFS. I would end up using it on database servers, then in 2007 on laptops to do mirroring, using it on large scale systems, and, in 2012, even had it running on the Raspberry Pi and integrated with the cloud (see my slides from Cloudcamp 2013 for examples of sending / receiving snapshots)

I've also been using it for Data Engineering and Data Science tasks for several years now. I will touch on some of these aspects in future parts of "Data Infrastructures for the rest of us", but some like compression and encryption should be pretty obvious...

Jumping In

Ok, ZFS is pretty cool and it has been around for quite a while. So where should you look to get started?

If you like paper books, I mention Solaris 10 ZFS Essentials (2010) by Scott Watanabe on my "ex-libris" LinkedIn post. Although for Solaris, the essentials are very similar between ZFS on Solaris, and with Open ZFS (the OpenZFS initiative now covers FreeBSD, Linux, Mac OS and Windows).

Next up is the Oracle Solaris ZFS Administration Guide. Whereas the other book covered the essentials, this one covers 11 chapters and over 300 pages of details, and is available online.

A paper on flash memory and the role in file systems ("A file system all its own") by Adam Leventhal includes some really interesting references to other papers related to ZFS, like "Triple Parity RAID and Beyond"

You can also see what operating systems / distributions include ZFS on the distributions wiki page.

Finally, you can access getting started guides for various Linux distributions at the ZFS on Linux website.

In the case of Ubuntu, the main reference is on the Ubuntu ZFS Wiki. There, you would find that installing / enabling ZFS on that distribution is as simple as doing:


user@server ~ $ sudo apt install zfsutils-linux

Minimum Hardware

One of the goals of ZFS is to provide end to end data integrity, and to survive disk and memory bit flips. In order to achieve that, it is highly recommended to use ECC (Error-Correcting Code) memory. This is pretty much the standard in rack mount servers, but not in desktop or laptop computers. In the absence of ECC, ZFS will still provide data integrity, but will require restoring from backup in certain cases. The cost of ECC is non consequential compared to wasted time, and it is also a requirement for in memory dataframes, so just get ECC memory (about 94% of memory errors are correctable by ECC, even more with extended ECC and chipkill - see also "DRAM errors in the wild").

The original Solaris implementation of ZFS required a minimum of 768MB of RAM to run, and although I've run ZFS FUSE on a Raspberry Pi with 256MB of RAM, nowadays I wouldn't recommend running on systems with less than 16GB, realistically. This is especially true if you have a high performance network (10GB ethernet, 40GB/s infiniband etc) and a lot of data to share.

ZFS uses as much memory as it can find to cache files and it also structures many small random reads and writes as sequential reads and writes (again, the more RAM the merrier). Bottom line, we've already addressed sizing memory in "Data Infrastructures for the rest of us", just make sure it is ECC (and low voltage if you care about power consumption).

The second preferred hardware aspect is the one of HBA (Host Bus Adapter) versus hardware RAID hard disk controllers. ZFS prefers to have full control of the storage, with as little abstraction between it and the drives. This allows it to have a more intimate understanding of the pool makeup, of hard disk health, of cache states etc.

If all you are using are the SATA ports from the motherboard of your computer to plug the disk drives, that's ok. If you need more connections, HBAs can typically connect 4 hard disk drives per port, and they usually have two, for a total of eight using cables, or many more using drive bays with built in expanders (although, see the note on sas vs sata). Bottom line, stick with recommended HBAs and stay away from hardware RAID.

Extra Cache

Note: advanced topic, you can always revisit later

ZFS also supports a write log (technically speaking, not quite a cache, since this is only needed for inflight data when a system is brought down hard and needs to recover the intent data from the log. see also "Data Infrastructures for the rest of us part II" for some additional information on caches). You will see references to ZIL (ZFS Intent Log) cache or SLOG (separate log) devices.

By default, ZFS writes the ZIL to the pool, on the main drives. The concept here is to buy a fast write SSD (Solid State Device, perhaps a SAS drive or a PCIe or NVME device) that can guarantee writes on power loss. This will make ZFS using this intent log in combination with regular spinning rust hard disks perform very well, even on small random writes with sync (as is required by databases or by NFS). If you are not planning to share files over a network using NFS or running a database, this cache wont be of benefit, but if you are, read this presentation: ZIL Performance.

ZFS can also cache reads, beyond the RAM capacity. I've talked about swap before, but this is beyond that. The idea is to add fast SSD to cache the slower spinning rust hard disks. See Brendan Gregg's blog post: "ZFS L2ARC".

Let's do it

Alright, so you are eager to start.

One Device

Byte ,November 1981, p.114
$18K 100MB, 14" Drive

Let's start with the scenario where you have a single drive. This can be either a SAS or SATA drive of the spinning rust or SSD variety, or a PCIe flash device, or even an integrated or PCIe NVME adapter.

For this example, we will use an NVME drive on a PCIe card. It is identified by Linux as /dev/nvme0n1 on my machine.

Assuming I want to create a storage pool named "inbound" and accessible at /inbound, all I have to do (as root, or using sudo):


user@server ~ $ sudo zpool create inbound nvme0n1

There is no output in case of success.

You only get an output when there is an error. But we can still check on our disk pool:


user@server ~ $ sudo zpool status -v inbound
  pool: inbound
 state: ONLINE
  scan: none requested
config:

 NAME         STATE     READ WRITE CKSUM
 inbound      ONLINE       0     0     0
   nvme0n1    ONLINE       0     0     0

errors: No known data errors

This is the simplest pool, with one device and no redundancy. If the device fails, you lose everything. In practice, this is rarely used, except when combined with compression, as perhaps an inbound (temporary) folder. You still get protection from the checksums and multiple copies of the metadata.

Two Devices

Let's increase the robustness of our pool. After all, disk drives do fail (pdf), even SSD and flash (pdf). Let's destroy this pool and create a new one using 2 NVME devices, one mirroring the other (it is also possible to add a mirror to an existing pool using the attach command):


user@server ~ $ sudo zpool destroy inbound
user@server ~ $ sudo zpool create inbound mirror nvme0n1 nvme1n1

And checking on the status:


user@server ~ $ sudo zpool status -v inbound
  pool: inbound
 state: ONLINE
  scan: none requested
config:

 NAME         STATE     READ WRITE CKSUM
 inbound      ONLINE       0     0     0
   mirror-0   ONLINE       0     0     0
     nvme0n1  ONLINE       0     0     0
     nvme1n1  ONLINE       0     0     0

errors: No known data errors

The advantage of a mirror is that if one device dies, the other keeps the pool available for reading / writing. The failed device can be replaced by another device that is online (or even a hot spare on standby, in which case this is done automatically), or the bad device can be replaced in situ. If the device can be hot swapped, then this can be done with zero downtime. NVME devices are typically not hot swap, but SSDs and hard disks of the SATA and SAS variety typically are (if they are mounted in a removable tray).

Multiple Devices, Maximize Space

Let's create a pool named mpool. We want some redundancy, but want to maximize the space. Instead of using a mirror, which uses 50% of our storage for redundancy, we will use what ZFS calls a RAIDZ2 pool. RAIDZ is similar to hardware RAID5 using one drive out of the group for parity (meaning you can lose 1 drive and still operate), while RAIDZ2 is similar to hardware RAID6 and allows the loss of 2 drives and still operate. Note that a RAIDZ pool doesn't have to have an even number of drives, unlike mirroring. As far as choosing how wide to make the stripe and what RAID level to use, I'd suggest reading Matt Ahrens's "How I learned to stop worrying and love RAIDZ".

For the purpose of this example, we will create a RAIDZ2 pool out of 5 files, instead of 5 hard disk. First, let's create these "disks". This is just an example, so we will create 5 files of 1GB in size:


root@server ~ # for i in {0..4}; do fallocate -l 1G disk$i; done

And now we create the pool from the "devices" (either using sudo, or as root):


root@server ~ # zpool create mpool raidz2 /root/disk0 /root/disk1 \
                /root/disk2 /root/disk3 /root/disk4
root@server ~ # zpool status mpool
  pool: mpool
 state: ONLINE
  scan: none requested
config:

 NAME             STATE     READ WRITE CKSUM
 mpool            ONLINE       0     0     0
   raidz2-0       ONLINE       0     0     0
     /root/disk0  ONLINE       0     0     0
     /root/disk1  ONLINE       0     0     0
     /root/disk2  ONLINE       0     0     0
     /root/disk3  ONLINE       0     0     0
     /root/disk4  ONLINE       0     0     0

errors: No known data errors

Multiple Devices, Maximize Performance

Let's recreate mpool, this time for performance. Assuming we have 10 hard disk drives, as /dev/sdc, /dev/sdd etc, all the way to /dev/sdl, and we want to have redundancy and at the same time maximize performance, after destoying the pool we created in the previous section, we would mirror each pair of drives, then stripe them, like this:


user@server ~ $ sudo zpool destroy mpool
user@server ~ $ sudo zpool create mpool mirror sdc sdd \
                mirror sde sdf mirror sdg sdh \
                mirror sdi sdj mirror sdk sdl

And checking on the status:


user@server ~ $ sudo zpool status -v mpool
    pool: mpool
 state: ONLINE
  scan: none requested
config:

 NAME        STATE     READ WRITE CKSUM
 mpool       ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     sdc     ONLINE       0     0     0
     sdd     ONLINE       0     0     0
   mirror-1  ONLINE       0     0     0
     sde     ONLINE       0     0     0
     sdf     ONLINE       0     0     0
   mirror-2  ONLINE       0     0     0
     sdg     ONLINE       0     0     0
     sdh     ONLINE       0     0     0
   mirror-3  ONLINE       0     0     0
     sdi     ONLINE       0     0     0
     sdj     ONLINE       0     0     0
   mirror-4  ONLINE       0     0     0
     sdk     ONLINE       0     0     0
     sdl     ONLINE       0     0     0

errors: No known data errors

We will now focus the rest of this article on a few of the many features built into ZFS.

Killer Feature #1: Compression

18th century book press

No doubt you've dealt with CSV files. Some I've worked with can have tens of millions of rows. And sometimes a data set can include hundreds or thousands of these files. Many people use ZIP, gzip, or some other compression software, to save space, and time when moving the files around. The downside is that you have to uncompress the file before you can use it. Or do you?

ZFS has many attributes that can be set. To get a list of the attributes and their values, type:


user@server ~ $  zfs get all mpool
NAME   PROPERTY              VALUE                  SOURCE
mpool  type                  filesystem             -
mpool  creation              Thu Nov 28 22:46 2019  -
mpool  used                  239G                   -
mpool  available             4.59T                  -
mpool  referenced            25K                    -
mpool  compressratio         1.00x                  -
mpool  mounted               yes                    -
mpool  quota                 none                   default
mpool  reservation           none                   default
mpool  recordsize            128K                   default
mpool  mountpoint            /mpool                 default
mpool  sharenfs              off                    default
mpool  checksum              on                     default
mpool  compression           off                    default
.
.
.
mpool  overlay               off                    default

Compression is off by default on Linux. You can do your own tests, but I've never been in a situation where keeping it disabled helped in any way. To enable compression using the default algorithm (lz4):


user@server ~ $  sudo zfs set compression=on mpool

The content already on the pool will not be compressed, but anything new, will be. Copying the files to another location and back will compress it. To avoid all of that, set compression the moment you create the pool.


fdion@datus:~$ zfs get all mpool
NAME   PROPERTY              VALUE                  SOURCE
mpool  type                  filesystem             -
.
.
.
mpool  compressratio         3.10x                  -
mpool  mounted               yes                    -
mpool  quota                 none                   default
mpool  reservation           none                   default
mpool  recordsize            128K                   default
mpool  mountpoint            /mpool                 default
mpool  sharenfs              off                    default
mpool  checksum              on                     default
mpool  compression           on                     local
.
.
.

Transparent compression with 3 to 4x compression depending on the data is really nice to have. Plus you can ship ZFS in compressed form to other systems for backups. On top of that, both the in memory ARC and disk based Level 2 ARC use compression, so compressed data is used everywhere, until your application reads it in, tremendously improving cache hits for the same amount of RAM.

Killer Feature #2: Encryption

Gasparis Schotti, Schola Steganographica, 1665
Figura I, Rotularum (p. 95)

Most people would agree that encrypting sensitive data is a good thing. What would be the consequences of somebody stealing any of your servers, or some of the drives from them? Why not encrypt the whole device, just to be safe? Well...

Some people have relied on built in encryption support of disk drives, but time and again these have proven to be poorly implemented, leaving the data exposed. Oh, and bitlocker relied on these flawed implementations, so yeah, unsafe.

The solution? Yep, you guessed it. ZFS encryption. In the early days, I relied on ZFS on Solaris to have encryption, but now it is available on all Open ZFS systems. It was first available through a contribution to Open ZFS by Datto. Ubuntu 19.10 has direct support for encryption as it ships ZFS 0.8.1. So ideally, you would set up a hierarchy of ZFS pools. No / low risk data (aka public data), use compression only. next level, use compression and encryption. For the most secure storage possible, use encryption only (no compression to prevent a very unlikely, yet potential leakage).

Once you have ZFS 0.8 or above, you can use encryption, as follow (using -O to set options):


user@server ~ $ sudo zpool destroy mpool
user@server ~ $ sudo zpool create mpool mirror sdc sdd mirror sde sdf \
                mirror sdg sdh mirror sdi sdj mirror sdk sdl \
                -O encryption=aes-256-gcm -O keylocation=prompt \
                -O keyformat=passphrase

Another nice feature that no hard disk hardware encryption can ever do: sending ZFS filesystems encrypted to a backup system, without having to send unencrypted data to it (although I did provide an alternative using openssl, that adds conversions and risk of exposure of the key). Also, ZFS can still maintain integrity even without loaded keys. For more interesting details on ZFS encryption, see Tom Caputi's "Encryption at rest".

Killer Feature #3: Snapshots

Fujica AZ-1 35mm camera w/ Schneider-Kreuznach lens

"Like a Polaroid picture..."

As a SQL user, you are familiar with transactions. You know how valuable it is to start a transaction, do a lot of stuff, then commit all the steps, or roll back if there are issues. Snapshots provide that for your file system.

You run a shell script with a bunch of steps. You'd like ideally for it to process all, or in case of trouble, to roll back. How do ZFS snapshots help?

Let's say you have a pool named mpool, on which you created a data filesystem (mpool/data). The first line of you script would create a snapshot (we will name it prescript1):


zfs snapshot mpool/data@prescript1

Then, the last line would be (since we got to the last line without error,:


zfs destroy mpool/data@prescript1

How about error handling? You do the usual shell script error handling and set up a function that will be called on error and roll back to the original file system state. For example, you could set this up at the top of a bash script:


rollback() {
    echo 'Error raised. Rolling back.'
    zfs rollback mpool/data@prescript1
}

trap 'rollback' ERR

.
.
.

Of course you can also do this directly from a xonsh script (combining bash and python directly) or a python script using a module such as pyzfs. You could even do all of this from a Jupyter Notebook!

One last point on snapshots, you can easily set up automated snapshots using zfs-auto-snapshot with:


user@server ~ $ wget https://github.com/zfsonlinux/zfs-auto-snapshot/archive/upstream/1.2.4.tar.gz
user@server ~ $ tar -xzf 1.2.4.tar.gz
user@server ~ $ cd zfs-auto-snapshot-upstream-1.2.4
user@server ~ $ sudo make install

This will create snapshots at the following interval and retention:

every 15 minutes, keeping 4
hourly, keeping 24
daily, keeping 31
weekly,keeping 8
monthly, keeping 12

Conclusion

We've barely scratched the surface on ZFS and its functionality. Network sharing, delegation, quotas, reservations, clones (see storage backends for LXD), data scrubbing, deduplication, send/receive, endian independence, disk format upgrades, hierarchical inheritance, unicode normalization etc. The list goes on.

So start with what you've learned, get to know the features of ZFS and improve your day to day operations as a data scientist, data engineer, data modeler, devops, dba, dataops, data visualizer, architect or technologist.

And of course feel free to engage Dion Research in your data infrastructure or data science project, and leverage years of experience doing this.

Francois Dion
Chief Data Scientist
@f_dion

The Dion Research Blog

Search This Blog