Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Open Source Bug

OpenZFS Fixes Data Corruption Issue (phoronix.com) 39

A pull request has been merged to fix a data corruption issue in OpenZFS (the open-source implementation of the ZFS file system and volume manager). "OpenZFS 2.2.2 and 2.1.14 released with fix in place," reports a Thursday comment on GitHub.

Earlier this week, jd (Slashdot reader #1,658) wrote: All versions of OpenZFS 2.2 suffer from a defect that can corrupt the data. Attempts to mitigate the bug have reduced the likelihood of it occurring, but so far nobody has been able to pinpoint what is going wrong or why.

Phoronix reported on Monday: Over the US holiday weekend it became more clear that this OpenZFS data corruption bug isn't isolated to just the v2.2 release — older versions are also susceptible — and that v2.2.1 is still prone to possible data corruption. The good news at least is that data corruption in real-world scenarios is believed to be limited but with some scripting help the corruption can be reproduced. It's also now believed that the OpenZFS 2.2 block cloning feature just makes encountering the problem more likely.

This discussion has been archived. No new comments can be posted.

OpenZFS Fixes Data Corruption Issue

Comments Filter:
  • "All versions of OpenZFS 2.2 suffer from a defect that can corrupt the data. Attempts to mitigate the bug have reduced the likelihood of it occurring, but so far nobody has been able to pinpoint what is going wrong or why."

    They've cut down on how often it can occur... which is certainly a good thing! But the problem is definitely NOT fixed and the headline is WRONG.

    • by bill_mcgonigle ( 4333 ) * on Saturday December 02, 2023 @11:10PM (#64050185) Homepage Journal

      They found the problem and fixed it ( two codepaths need checking for dirty) and they've not been able to reproduce the error with significant testing (on systems with 2-minute reproducers before the fix).

      See: https://github.com/openzfs/zfs... [github.com]

      > But the problem is definitely NOT fixed and the headline is WRONG.

      What is your basis for this assertion?

      • What is your basis for this assertion?

        You mean besides the part of the summary that I quoted - where it says "Attempts to mitigate the bug have reduced the likelihood of it occurring"?

        That doesn't sound like a fix to me.

        • Re: (Score:3, Informative)

          by JBeretta ( 7487512 )

          Reading the discussion thread on github, it would appear the developers are fairly certain it is fixed. I've seen lots of folks reporting in that systems that were affected by the bug are no longer exhibiting the problems after patching.

          Is someone going to declare that it is, without any doubt, fixed? Probably not.. But hundreds of tests, and a very detailed explanation of EXACTLY where the problem was occurring (some sort of race condition?) along with code to fix the errant behavior seem to show it is

        • by _merlin ( 160982 )

          The "attempts to mitigate the bug" happened prior to the reliable reproduction script and this fix. People were recommending various configuration changes and things, but none of it reliably stopped the data corruption. This fix is an actual fix.

  • Coreutils 9.2 (Score:5, Informative)

    by bill_mcgonigle ( 4333 ) * on Saturday December 02, 2023 @11:05PM (#64050181) Homepage Journal

    From
    https://github.com/openzfs/zfs... [github.com]

    The incorrect dirty check becomes a problem when the first block of a file is being appended to while another process is calling lseek to skip holes. It can happen that the dnode part is undirtied, while dirty records are still on the dnode for the next txg. In this case, lseek(fd, 0, SEEK_DATA) would not know that the file is dirty, and would go to dnode_next_offset(). Since the object has no data blocks yet, it returns ESRCH, indicating no data found, which results in ENXIO being returned to lseek()'s caller.

    Since coreutils 9.2, cp performs sparse copies by default, that is, it uses SEEK_DATA and SEEK_HOLE against the source file and attempts to replicate the holes in the target. When it hits the bug, its initial search for data fails, and it goes on to call fallocate() to create a hole over the entire destination file.

    This has come up more recently as users upgrade their systems, getting OpenZFS 2.2 as well as a newer coreutils. However, this problem has been reproduced against 2.1, as well as on FreeBSD 13 and 14.

    It looks like bookworm has coreutils 9.1.

    Somebody said RHEL9 backported the 9.2 behavior to coreutils 8.

    Gentoo blokes get all the goodies and all the gremlins right away and seem to have been the canaries on this one. TYFYS.

  • ZFS development has, in recent years, moved from FreeBSD to Linux. There is, apparently, no way to dual-license it, at least yet.

    But its development on Linux matters, because that increases the number of eyes that can look at it whilst in development, so (in principle) accelerating that development.

    More eyes didn't prevent this corruption bug, but it may well have reduced the time the bug has been allowed to survive, which is a Good Thing.

    ZFS will be increasingly important, I suspect, over the next few year

    • by Bongo ( 13261 )

      I'm just curious, where and how do data centres protect against bit rot or similar issues?

      • by Entrope ( 68843 )

        It depends on the data center and the scale of the system being deployed. Some applications use erasure codes, but those work better with multiple computers cooperating to increase redundancy. Some systems use RAID with multiple parity disks, but that requires the whole array to be in one physical location. Some places abstract those things behind an API that provides a file, block or object store.

  • Corrupted ZFS Data [oracle.com]

    “Data corruption occurs when one or more device errors (indicating one or more missing or damaged devices) affects a top-level virtual device. For example, one half of a mirror can experience thousands of device errors without ever causing data corruption. If an error is encountered on the other side of the mirror in the exact same location, corrupted data is the result.”
    • by jd ( 1658 )

      Oracle ZFS and OpenZFS diverged after Oracle naively closed the license. OpenZFS, I believe, has RAID modes not available in Oracle ZFS. Whether there are other protections present is unclear to me, but it seems safe to say that Oracle's ZFS documentation is reasonably valid for now but you're still advised to double check claims against up-to-date OpenZFS docs.

  • The bug exists in Oracle Solaris' ZFS as well. There's no indication as to whether they've fixed their version or not.

As long as we're going to reinvent the wheel again, we might as well try making it round this time. - Mike Dennison

Working...