What I learned from setting up ZFS on my fileserver

ian – Fri, 2007 – 07 – 06 04:30

I've been using Linux with a software RAID5 array for my fileserver for a few years now. It's pretty good. I've had one drive failure in about five years. But, of course, I filled it up (again). I decided to switch to ZFS for the most recent array for a few reasons:

  • I believed that I could add drives to the RAID array whenever I wanted to
  • I liked the idea of end-to-end checksums
  • It's supposed to be more robust against power failures than RAID5

In particular, I liked the idea of having two parity drives (RAID6 or raidz2).

This is what I learned.

Linux is really quite a mature product

At least, it's mature compared with OpenSolaris. The idea was to install an OpenSolaris - with its native and reliable ZFS support - onto a USB drive. That way the machine would have a very reliable boot drive and I'd remove a spindle from the system. Previously, I was using a cron job every night to back up the boot drive onto the RAID array.

I also had the idea of installing everything onto the USB drive through VMWare. This was partly a practical decision - I don't have a lot of spare computers hanging around - and partly because I liked the idea of upgrading the system on the USB drive from my comfy laptop. I didn't want to have to pull the server out of the closet and find a monitor and power and keyboard and blah blah blah.

So. Install OpenSolaris onto a USB drive from VMWare. For a Linux person like me this sounds entirely reasonable.

I identified two candidates: Nexenta and Belenix. Nexenta is roughly 'Ubuntu with the OpenSolaris kernel and tools' so I thought 'great, I love Ubuntu'.

OpenSolaris doesn't like VMWare's USB

Buggered if I know why. But OpenSolaris wouldn't touch USB devices attached through VMWare. It knew they were plugged in, it just wouldn't talk to them. Oh well, better dig up a computer.

Nexenta doesn't like USB drives

I bought two 2 gigabyte USB drives. "Two gigabytes ought to be enough for anyone, especially if they're OpenSolaris", I thought. Nexenta won't touch them. First it says that 2G isn't enough and flatly refuses to even try Please, just try! I promise there's enough space! So I drop another $90 on a 4G drive and find that the partitioner won't touch the drive full-stop because there's a device file missing or something. And on top of that, the installer crashes if it doesn't see any partitions. So that was an expensive timesink.

If you want to try this, make sure you connect the USB drive before the installer starts. If you don't, the installer assumes you have no hard drives and says "ha ha, you have no hard drives! you're not cool enough to run Nexenta!" and jams an RJ45 cable in your ear. Honest.

Belenix doesn't like anyone

Belenix has big promises about being USB-drive compatible on its website. It also promises to have you up and running in two minutes.

The first few machines just got kernel panics during bootup. I eventually figured out that the boot process was blindly probing for serial ports and ATA interfaces, and because I'm so damn cool, my PCIe-based laptop has no serial ports or old-style ATA interfaces. I found an older laptop and hurrah! we're booting.

After about half an hour, anyway. It wasn't a slow laptop or anything. I'm sure the website promised two minutes. I built a LiveUSB thingo, tried booting from it, and got kernel crash messages ("Nested trap, calling reset()") scrolling off the screen. On every machine I tried.

Excuse me if I feel a little frustrated.

FreeBSD is really, really old

My first Unix-like operating system was FreeBSD. Back in 1995 - at the ripe old age of 13 - a friend of mine decided that we should start an ISP. I said "yeah, I've heard of this thing called FreeBSD that runs ISPs" - I'd actually seen it on the cdrom.com FTP welcome message - so I downloaded a bunch of 136kb floppy images through my 14.4k modem.

Good times.

Anyway, the installer hasn't visibly changed since then. Since 1995.

FreeBSD sees the USB drives OK, but the installation takes hours. It also doesn't boot afterwards - something wrong with the bootloader. 'Wizard mode' actually means "I am a wizard", not "I am stupid and want a wizard to help me", unlike every other OS on the planet. The 'emergency holographic shell' has no binaries, so I can't actually do anything.

There's also some kerfuffle about ISO images and me using a 7.0 snapshot, but that's boring. I found a HOWTO on installing FreeBSD to a USB drive, but I never got past the bootloader.

Oh yeah, and FreeBSD under VMWare has kernel crashes and other fun stuff if you boot up with the USB drive already plugged in. This is the point where I really start to appreciate Linux's maturity - at some point, some idiot has tried booting up their machine with USB stuff attached and thought, "oh, shit. That crashed. I should fix that!"

ZFS on FUSE isn't as bad as I thought. In fact, it's pretty good.

I really didn't want to run ZFS on Linux. Buggy, untested implementation and performance issues. Plus, I wanted the excuse to play with something new; hang with the cool BSD crowd and the homies and pick up fly bitches.

But then I tried it and it was really easy. This page lists Debian packages that install ZFS-FUSE, and it was pretty much that easy. Add the apt repository to /etc/apt/sources.list, install the zfs-fuse package, start using zfs-fuse! Very, very slick.

I make my pool with:

zpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc

and bam! I have some storage.

You can't expand RAID arrays using ZFS. Really. You can't.

There is some major confusion floating around with RAID array expansion under ZFS. You can't do it. Sorry. Sun, please add this feature. It's not 'the last word in filesystems' unless we can grow our RAID arrays.

Sun differentiates between a pool and a vdev. A pool is what we all mentally thought of as a RAID array. It's not. A vdev can be a RAID array. A pool is comprised of vdevs. You can expand the pool, but not the vdev. Hence, no RAID array expansion.

I was rather pissed off when I figured this out (can you tell?) The best explanation I could find for why this rather critical feature was missing is that you would have to take the array offline to do it. Well, shit, you'll also have to take the machine offline to plug in the disk! It ruins their design and so we don't get the feature.

Well, Sun, this is the same thing that happened a dozen times with Java. Then C# came out with nice features that actually made it easy for people to do their jobs and lo and behold, everyone switched to C#. Then you had to play catch-up.

&lt/rant>

You can't delete vdevs from a pool

This is 'the last word in filesystems', but we can't expand RAID arrays or delete vdevs. It sounds rather incomplete to me. If I have to rearchitect my pool to use a whole bunch of extra drives, then the least you could do is make it easy for me to remove the old noisy, hot, power-sucking drives when I replace them. But no. There has to be something in that slot.

ZFS-FUSE performance is pretty mediocre

... but then, who's surprised? It's a filesystem. In userspace. I'm getting about 4 megabytes/second, which is enough for what I'm doing (archival). But don't deploy it in your company, because you'll be redeploying it on something faster within the month.

Linux actually supports SATA hotswap!

As part of my testing I disconnected drives while the machine was running. I'm happy to report that ZFS coped, Linux coped, and in fact, when I plugged the drives back in, it was re-detected automatically. ZFS kept using it once I ran

zpool online tank sdc

. Great! No reboot. There was no checksum verification which made me a little nervous, but I figured out that I could force it with

zpool scrub tank

.

ZFS's end-to-end checksumming is cool

I explored this "no verification on device replacement" idea. I disabled the same disk (

zpool offline tank sda

) and overwrote a bunch of it with random data (

dd if=/dev/urandom of=/dev/sda bs=1024 count=11000

). Obviously, don't do this if you have valuable data and no backups - I still had my original RAID5 array available. I onlined the disk and a few checksum errors were detected and corrected quickly, but still no scrub (a scrub is a guy that just can't get no...) I manually fired up the scrub and a lot more errors were detected. So, a big reminder: if you replace a disk and it doesn't churn for eight hours, fire off a scrub to be absolutely sure your redundant storage is actually redundant over the whole array.

ZFS might scare you if you rearrange the devices. But you're probably OK

I removed the old RAID5 drives, started the machine, and was greeted with 'pool is broken, ha ha, you should restore from backups'.

Maybe I accidentally shuffled the drive order - I'm not sure. But

zpool export tank

and

zpool import -f tank

seemed to fix the problem. No idea why. I ran another scrub to make sure everything was OK (another eight hours...) No issues.

Still to do

Set up a cron job to run daily snapshots of home directories and weekly snapshots on the public archive.

Have the machine email me when something goes wrong (such as a drive failure). Right now I have

cat /proc/mdstat

in my .bashrc, so I notice pretty quickly if a drive goes offline.

zpool status

gives roughly the same information, but you need to be logged in as root to run it. Which I don't do often, of course. So I need to figure out how to make it viewable by mortals such as myself.

NFS

Update 10 Aug 2007: You can't export ZFS-FUSE filesystems over NFS. Even the nfs-user-server.

However, the cifs filesystem works really nicely now. CIFS is the next-gen SMB (Windows File Sharing) filesystem. It includes a soft-mount option and enables it by default, so unreliable network connections (is there any other type?) aren't a problem. I actually wish more people knew about this - it'd solve a lot of hang bugs.

mount -t cifs //server/path /mount-path -o username=bob

Same as before.

Long-term reliability

I've seen the zfs-fuse process trigger the Out-Of-Memory killer twice now after being run for a while. This time, it lasted 42 days. I presume that there's a memory leak in it somewhere.

Update 28 Nov 2007

I had a disk die (already!) so while the machine was down, I tried the OpenSolaris thing again.

I got Belenix running OK from a USB stick, but discovered that you can't modify the setup easily (it mounts a compressed ISO image off there and hence can't be written to). It didn't like my oldish CPU (Athlon XP) due to libm/SSE2 issues. Then it started kernel panicking a lot. So the answer is No.

I got Nexenta running from the USB stick (wooh!) using a Dell Latitude D800. It all seemed to be running nicely - I could install packages and not crash. I moved it onto the actual server and ran into two fatal problems:

  • The machine was too old to boot from a USB stick (USB floppies were OK). I installed to a hard drive.
  • Nexenta hasn't been updated for a very long time. It supports ZFS version 3 (I hadn't even realised there were different versions). We're up to ZFS version 8 now.

(It also didn't support one of my NICs - a Broadcom 5700-based card - but that wasn't fatal.)

So Nexenta was a No.

I considered downloading Solaris Express, but that was 2GB and provided no particular guarantees of success either.

So I'm back to zfs-fuse and cranking up the CPU clock speed (I normally run it underclocked at 1250MHz down from 2GHz because the cooling isn't very good). Recent zfs-fuse versions support NFS (through some arguing) and so that might work.

Somewhere along the line I made the realization that there are no stable ZFS implementations right now. Even if the ZFS code itself is stable in an OpenSolaris, the rest of the environment (especially drivers) is still very immature. Installation is an absolute horror. Maybe if you buy some big iron Sun gear and a proper Solaris license it'll be better - but then, maybe you'll just have someone to sue when it all goes pear-shaped.

FreeBSD can often be booted

FreeBSD can often be booted from USB, it'll look as if you're booting from a scsi disk. "Often" because people have trouble booting supermicro server motherboards (with xeon woodcrest, e.g.) - something doesn't work in boot code, it doesn't get to the boot loader or kernel. Maybe grub will help with that... I have both 6.2 and 7-current booting and running off the read-only usb flash drives (128MB is enough for having bgp by quagga and a ton of not-that-much needed packages, like rrdtool, midnight commander, perl-5.8... thanks to geom_uzip, 256MB allows having all ofthe compiler stuff and even more packages). You'll have to modify startup scripts for things I'm also running 7-current with ZFS (/usr is ZFS, and anything but root and /var is ZFS) ever since it appeared in the source tree. There are stability issues if you go after heavy i/o, they may not be related only to ZFS. You can expand zfs pool - zpool add -f bla-bla Don't do that :). Without -f it'll not work for raidz. Seems the problem in expanding of raidz array is somewhat math-related - as the information on all of the device depends on every other device - so it's not straightforward how to expand it.

unisol (not verified) – Wed, 2007 – 07 – 25 09:41

Post new comment

Please solve the math problem above and type in the result. e.g. for 1+1, type 2
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
More information about formatting options