Battle testing PHP fopen, SQLite, PostgreSQL and MariaDB on FFS2, UFS, ext4, XFS and ZFS

Published on 2024-01-22. Modified on 2024-01-23.

In this article I share the results of a lab test in which I tested writing data to the filesystem directly with PHP fopen versus storing data in SQLite, PostgreSQL and MariaDB on different filesystems, mainly FFS2 on OpenBSD, UFS on FreeBSD, ext4 and XFS on Linux and ZFS on both Linux and FreeBSD, while unplugging power to hard drives and to the machines mid-writing.

Introduction
Myths and misunderstandings
When to use a database rather than the filesystem?
What about product X?
Some source code
Running the tests
ext4 tests
XFS tests
ZFS tests
FFS2 tests
UFS tests
Final comments

Introduction

We (software developers/engineers) tend to stuff everything into a database by default and the Internet is filled with advice that state that you should always use a database, and that seem to be the majority opinion. As such I wanted to take a closer look at what goes on under the hood of some of these complex database systems and try to better determine when you should utilize a database rather than just store files on the filesystem.

I decided to do some "old-school" testing in which I set up multiple physical machines with identical hardware. I wrote some small PHP scripts that store text in files on the filesystem and text in the database. Then I set the scripts in motion and physically cut the power to the machines, by pulling the power cord either to the entire machine or to the destination hard drive, in order to see how many data integrity problems I could create. Each test was then repeated several times in order to determine how often problems would occur, if any.

Professional servers at data centers are usually well protected from unforeseen shutdowns and power loses, but some clients still host locally, which is often not only cheaper, but sometimes also more secure (you know exactly who has physical access to your machines). Sometimes a machine is at the mercy of power failures, accidental resets, a UPS that fails just in the wrong moment and other user incidents that can lead to unclean shutdowns.

There are many other tests that can be done, such as simulating bad sectors on a hard drive, storage running out of space, bad disk controllers, etc., but those are not the issues I am interested in this test. What I am interested in is the claims that a database is always much better at protecting your data than just the filesystem alone.

As with my article from 2019, Battle testing ZFS, Btrfs and mdadm+dm-integrity, this is a write up in which I share some of my findings.

It is worth nothing that I have run these tests using different brands of consumer grade spinning hard drives and SSDs. I have not tested anything on the grade of data centers.

Also, all the tests was run primarily with the default settings of each filesystem and database, except for SQLite on which I enabled write-ahead logging and increased the timeout. I also tested the FULL options for the synchronous flag, but unless you're using transactions, that doesn't seem to provide any extra protection when you're already running with the WAL enabled.

In a real production environment we need to take advantage of whatever fine tuning knobs the operating system, the filesystem and the database provides us with, but in this test I just wanted to see how these systems behave with the default settings when you experience a sudden loss of power, either to the entire machine or to the destination drive.

I have done the testing over about a week in which I have worked on multiple "bare metal" machines running at the same time (identical hardware). I believe it's important to test on real machines versus virtual machines or simulated situations as the virtual stuff often do not work the same way as a real machine when things fail. If you test a database on a virtual machine you need to be running your database in the virtual machine in production in order for the test results to be useful. If you move your database to bare metal, you might discover that things do not behave the same.

Also, build-in software testing features or frameworks might try to simulate a power loss, but a real sudden loss of power is actually difficult to simulate, sometimes e.g. (depending on the hardware and the supply of power) enough residual electricity can be present in the capacitors on the motherboard and in the power supply that you can manage to get a couple more bits of data from memory to physical storage which - even if it's just a split second - might produce different results. On some hardware this is actually consistent. Hence, all of the tests I have done have been by manually and physically disabling the power to hard drives or entire machines while the machines were doing the writing.

Each test was run by starting a basic file copy of a big binary ISO file from one hard drive to the target hard drive in order to fill up some of the write buffer (both the physical memory on the mother board and the buffer in the destination disk), I wanted to get as many dirty pages in memory as possible. Then after transferring 20%, 60%, and 80% of the file, one or two other processes got started in the background which wrote data to the destination drive either to the filesystem via PHP fopen or to the database via the database driver. Then after the write loop had run a little, the power was cut.

I ran the tests on OpenBSD 7.4 current (snapshot) with the FFS2 filesystem. I also tested on FreeBSD 14 stable with UFS and ZFS and on Arch Linux with ext4, XFS and ZFS.

Unlike ext4 and XFS, FFS2 and UFS does not have any journal capabilities, but UFS can utilize the new journal class of the GEOM framework in FreeBSD, which can then be used to provide a file system independent journaling system.

A filesystem journal uses a log to record all transactions that take place in the file system, and it then preserves data integrity in the event of a system crash or power failure. Although it is still possible to lose unsaved changes to files, journaling almost completely eliminates the possibility of file system corruption caused by an unclean shutdown or a power loss. It also shortens the time required for after-failure file system checking.

The ext4 and XFS journaling are both file system specific whereas the GEOM journal for UFS isn't, it functions at the block level instead.

If you don't enable and setup the GEOM based journal for UFS, UFS can also use soft updates, which is the default and an approach to maintaining filesystem meta-data integrity. Soft updates work by tracking and enforcing dependencies among updates to file system meta-data. Soft updates are an alternative to the more commonly used approach of journaling file systems.

Neither journaling nor soft updates guarantees that no data will be lost, but they do make sure that the filesystem remains consistent.

ZFS is a completely different beast. It is a Copy-on-write filesystem and one of the major features is that it is designed with a focus on data integrity by protecting the user's data on disk against silent data corruption caused by data degradation, power surges (voltage spikes), bugs in disk firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc.

ZFS uses hierarchical checksumming of all data and metadata, ensuring that the entire storage system can be verified on use, and confirmed to be correctly stored, or remedied if corrupt. Checksums are stored with a block's parent block, rather than with the block itself. This contrasts with many file systems where checksums (if held) are stored with the data so that if the data is lost or corrupt, the checksum is also likely to be lost or incorrect. When ZFS reads data, it recalculates the checksums. If the checksums do not match, meaning detecting one or more data errors, ZFS will attempt to automatically correct errors when ditto, mirror, or parity-blocks are available.

ZFS can also perform a rollback of recent changes to the file system and data in the event of an error or inconsistency.

In the tests I performed I used PHP 8.3.1, SQLite 3.44.2, PostgreSQL 15 and 16 and MariaDB 10.6.16, 10.9.6 and 11.2.2 (depending on what package version the operating system had available through the package manager).

For SQLite I used the PHP SQLite class while I used PDO for both PostgreSQL and MariaDB.

As usual, please forgive any shortcomings, missing parts, and mistakes in my attempt to make this write up. Also, please note that English is not my native language.

Myths and misunderstandings

Let's begin by getting rid of some of the myths and misunderstandings.

Myth: The database is always better than the filesystem.

I suspect that this myth originates from back when filesystems were really bad. In the past you could loose an entire OS just by not shutting down the computer cleanly. Occasionally you could not recover from an unclean shutdown and would have to reinstall the operating system.

Back then the databases had to battle these problems and they utilized checksumming, created roll-back journals, etc., as such they became much better at handling a power loss compared to just writing files straight to the filesystem.

Filesystems have however evolved since then and some of them have become very robust.

Furthermore, the database always depends on the filesystem itself. The database stores it's files and data on the storage medium via the filesystem. Unless you're running a database on an embedded device without a filesystem, such as e.g. can be done with SQLite, or you're working at a place like Google, the database are storing data on the storage medium via the filesystem. If there is a serious problem with the filesystem, there is going to be a serious problem with the database.

This can perhaps best be understood by comparing storing data in a database to storing data in a gzip tar archive. gzip creates a CRC checksum and you can use tar to compare the content of the archive with the content of the filesystem. Such an archive will protect your data better than just storing the data directly on the filesystem without any checksumming. Still, you can imagine, if the filesystem goes crazy at the exact time when you're creating your tar archive, the tar archive will "break" and no amount of checksumming and data validation is going to help the archive. The same goes for a database.

Myth: Database developers are the experts, they know best.

Database developers are human, they make mistakes and (sometimes) wrong assumptions - just like the rest of us.

On e.g. the SQLite website it is clearly stated:

SQLite assumes that the operating system will buffer writes and that a write request will return before data has actually been stored in the mass storage device. SQLite further assumes that write operations will be reordered by the operating system. For this reason, SQLite does a "flush" or "fsync" operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. We are told that the flush and fsync primitives are broken on some versions of Windows and Linux. This is unfortunate. It opens SQLite up to the possibility of database corruption following a power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLite assumes that the operating system that it is running on works as advertised. If that is not quite the case, well then hopefully you will not lose power too often.

Making assumptions about the underlying technology can be very bad, especially when you're developing software that's supposed to protect data, but often, as a developer, you simply cannot test your product on every single platform and technology out there, sometimes you just have to make assumptions based on e.g. documentation, especially if some of the operating system or the filesystem developers are less than forthcoming about how feature X is really working. Also, working with the POSIX standard can be really difficult, sometimes it's rather vague about a specific subject.

Furthermore, you cannot test everything and if an operating system syscall deviates from e.g. the POSIX standard, you cannot program yourself around every single edge case out there.

A good example of this was discovered with PostgreSQL back in 2018 (something that was later referred to as fsyncgate 2018), but this is not something that is unique to PostgreSQL, the same assumptions was made by many other database developers as well.

If you look through source code of different products, you will occasionally see that assumptions are being made about something that goes untested or unvalidated.

My point is that you should never trust anything blindly, even if it is stated by someone with a "high rank or point score" on e.g. Stack Overflow, Reddit, Hacker News, some blog post, or something else.

If your data matters to you, test your setup and run data recovery drills.

Myth: You should never store binary data in the database, that's what the filesystem is for!

I had not used a database for storing binary data since about 2009, but all the databases I have tested have become extremely good at handling binary data (BLOB format). Performance wise there is a difference between the different databases, but storing binary data in the database is very safe.

During some of the tests I did, I stuffed hundreds of thousands of images of different sizes (PNG, JPEG, etc.) into the databases, they handled this very well.

Other considerations are relevant though. Dumping and restoring a database with a ton of binary data is never fun. The bigger the database, the longer it takes to do a complete dump and/or recovery.

In any case, I personally do not believe that you should be afraid of storing binary data in the database if that is what you want, they can handle it and they handle it very well.

Myth: You can only scale with a database!

There are other ways you can scale too. Some filesystems can scale, such as e.g. Gluster and ceph. You can even scale with NFS simply by sharding the data out into different mounted directories. You can also scale by using different HTTP based APIs on different machines, you send one type of file to one machine and another to another machine. You can e.g. hash data and send them to different machines depending on the characters in the hash.

Also, some databases do not scale without external third party applications.

When to use a database rather than the filesystem?

My default opinion is that you should not use a database unless you actually need one.

Don't get me wrong, I absolute love databases and I especially love SQLite. But the fact of the matter is that the database, the database driver, and all the database tooling, all add layers of complexity to your application.

Whenever you "stuff" your data into a database, you are stuffing your data into a kind of "container". Hence, you should only stuff your data into a "container" when that particular container provides some extra benefit you need, such as e.g. searchable indexes or transactional roll-backs etc.

When you store data directly on the filesystem, you have direct access to the data via basic tools such as grep, sed and e.g. a text editor. Those tools are just much easier to work with - especially when problems occur.

However, there are situation when you need to use a database.

If you need to search through the data at the application level, i.e. your application runs through the data looking for stuff. This quickly becomes very slow unless you use an indexed database. Databases are really good at searching and e.g. PostgreSQL has multiple index types to choose from.
When your data fits well into the relational model (i.e. tables with rows). This is where a relational database really shines and where querying the data with SQL is amazing. My recommendation is to get to know SQL and not add an ORM on top of that.
When you're unclear about future requirements. Unless you absolutely know for sure that you do not need to query your data in a relational manner in the future, it's a good idea to use a relational database from the get go (rather than having to change the entire structure of your application after you have already launched).

Of course other reasons to use a database exist too.

What about product X?

There exist so many amazing open source projects today, too many to test all of them.

So, I am just looking at the good "old and boring" in this test.

Some source code

Since I am working with PHP in this project I can start by saying that the PHP's fsync() and fdatasync() functions are basically just wrappers to the underlying system calls of the operating system.

I used a simple PHP function to write data to disk that looks like this:

function store_data(string $file, string $content, int $loop_count = 0): void
{
    // fopen() will emit an E_ERROR upon failing to acquire the file, but we
    // still need handle if fopen() returns false.
    //
    // fwrite() is not binary safe. This means, that the data - be it correctly
    // encoded or not - might get mangled by this command or it's underlying
    // routines. To be on the safe side, we use fopen() with the binary mode
    // flag. That's "b". Afterwards, fwrite() will save the string data
    // "as-is", i.e. as binary data, because strings in PHP are binary strings.
    if (!$fp = fopen($file, 'wb')) {
        trigger_error('WRITE ERROR: Could not write to file: '. $file.' Loop count: '.$loop_count."\n");
    }
    flock($fp, LOCK_EX);
    ftruncate($fp, 0);
    fwrite($fp, $content.PHP_EOL);
    fsync($fp);
    flock($fp, LOCK_UN);
    fclose($fp);
}

I make a lock on the file just before I write to it. This is because the application also needs to be able to read while a write might occur. Once the data has been written, a call to fsync is made. The fsync command tells the underlying operating system to flush the data to disk, however this call completely depends on how the operating system generally handles system calls to fsync and the implementation is different between operating system, e.g. Linux and FreeBSD does not handle this in the exact same manner.

The sync functions of the databases are more complex as they try to take into consideration the difference between operating system implementations.

A really good example of the quality of code and level of detail that goes into the development of SQLite can be seen at the unixSync function.

/*
** Make sure all writes to a particular file are committed to disk.
**
** If dataOnly==0 then both the file itself and its metadata (file
** size, access time, etc) are synced.  If dataOnly!=0 then only the
** file data is synced.
**
** Under Unix, also make sure that the directory entry for the file
** has been created by fsync-ing the directory that contains the file.
** If we do not do this and we encounter a power failure, the directory
** entry for the journal might not exist after we reboot.  The next
** SQLite to access the file will not know that the journal exists (because
** the directory entry for the journal was never created) and the transaction
** will not roll back - possibly leading to database corruption.
*/
static int unixSync(sqlite3_file *id, int flags){
  int rc;
  unixFile *pFile = (unixFile*)id;

  int isDataOnly = (flags&SQLITE_SYNC_DATAONLY);
  int isFullsync = (flags&0x0F)==SQLITE_SYNC_FULL;

  /* Check that one of SQLITE_SYNC_NORMAL or FULL was passed */
  assert((flags&0x0F)==SQLITE_SYNC_NORMAL
      || (flags&0x0F)==SQLITE_SYNC_FULL
  );

  /* Unix cannot, but some systems may return SQLITE_FULL from here. This
  ** line is to test that doing so does not cause any problems.
  */
  SimulateDiskfullError( return SQLITE_FULL );

  assert( pFile );
  OSTRACE(("SYNC    %-3d\n", pFile->h));
  rc = full_fsync(pFile->h, isFullsync, isDataOnly);
  SimulateIOError( rc=1 );
  if( rc ){
    storeLastErrno(pFile, errno);
    return unixLogError(SQLITE_IOERR_FSYNC, "full_fsync", pFile->zPath);
  }

  /* Also fsync the directory containing the file if the DIRSYNC flag
  ** is set.  This is a one-time occurrence.  Many systems (examples: AIX)
  ** are unable to fsync a directory, so ignore errors on the fsync.
  */
  if( pFile->ctrlFlags & UNIXFILE_DIRSYNC ){
    int dirfd;
    OSTRACE(("DIRSYNC %s (have_fullfsync=%d fullsync=%d)\n", pFile->zPath,
            HAVE_FULLFSYNC, isFullsync));
    rc = osOpenDirectory(pFile->zPath, &dirfd);
    if( rc==SQLITE_OK ){
      full_fsync(dirfd, 0, 0);
      robust_close(pFile, dirfd, __LINE__);
    }else{
      assert( rc==SQLITE_CANTOPEN );
      rc = SQLITE_OK;
    }
    pFile->ctrlFlags &= ~UNIXFILE_DIRSYNC;
  }
  return rc;
}

The SQLite source code reveals that the developers are meticulous in their work and they do a lot in order to make sure that the data that SQLite works with is well protected against any kind of problem.

From what I can tell, the same amount of detail and thoroughness also goes into the development of PostgreSQL.

This is the source code for the pg_fsync function:

/*
 * pg_fsync --- do fsync with or without writethrough
 */
int
pg_fsync(int fd)
{
#if !defined(WIN32) && defined(USE_ASSERT_CHECKING)
    struct stat st;

    /*
     * Some operating system implementations of fsync() have requirements
     * about the file access modes that were used when their file descriptor
     * argument was opened, and these requirements differ depending on whether
     * the file descriptor is for a directory.
     *
     * For any file descriptor that may eventually be handed to fsync(), we
     * should have opened it with access modes that are compatible with
     * fsync() on all supported systems, otherwise the code may not be
     * portable, even if it runs ok on the current system.
     *
     * We assert here that a descriptor for a file was opened with write
     * permissions (either O_RDWR or O_WRONLY) and for a directory without
     * write permissions (O_RDONLY).
     *
     * Ignore any fstat errors and let the follow-up fsync() do its work.
     * Doing this sanity check here counts for the case where fsync() is
     * disabled.
     */
    if (fstat(fd, &st) == 0)
    {
        int            desc_flags = fcntl(fd, F_GETFL);

        /*
         * O_RDONLY is historically 0, so just make sure that for directories
         * no write flags are used.
         */
        if (S_ISDIR(st.st_mode))
            Assert((desc_flags & (O_RDWR | O_WRONLY)) == 0);
        else
            Assert((desc_flags & (O_RDWR | O_WRONLY)) != 0);
    }
    errno = 0;
#endif

    /* #if is to skip the sync_method test if there's no need for it */
#if defined(HAVE_FSYNC_WRITETHROUGH) && !defined(FSYNC_WRITETHROUGH_IS_FSYNC)
    if (sync_method == SYNC_METHOD_FSYNC_WRITETHROUGH)
        return pg_fsync_writethrough(fd);
    else
#endif
        return pg_fsync_no_writethrough(fd);
}

Also regarding syncing directories:

/*
 * fsync_fname -- fsync a file or directory, handling errors properly
 *
 * Try to fsync a file or directory. When doing the latter, ignore errors that
 * indicate the OS just doesn't allow/require fsyncing directories.
 */
void
fsync_fname(const char *fname, bool isdir)
{
  fsync_fname_ext(fname, isdir, false, data_sync_elevel(ERROR));
}

It's always a good idea to take at least a quick look at some of the source code of the open source projects that you use. You don't need to read through it all and you don't need to understand all of it. Sometimes a "gotcha" is right there in the comments.

You will perhaps be surprised at how often you can find something like this:

/* Should we really be doing this??? */

Depending on what the code is actually doing, such a comment might be a reason for concern ;)

Running the tests

In this part of the write up I will post some examples of the results I got during the different tests. Please note that these examples are not exhausting in any way, they are just a summary.

ext4 tests

ext4 SQLite

Cutting the power to the drive before the writing script is done, dmesg shows the problem:

[ 4123.680448] ata5.00: disable device
[ 4123.680471] sd 4:0:0:0: [sdb] tag#9 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=11s
[ 4123.680480] sd 4:0:0:0: [sdb] tag#9 Sense Key : Illegal Request [current]
[ 4123.680484] sd 4:0:0:0: [sdb] tag#9 Add. Sense: Unaligned write command
[ 4123.680488] sd 4:0:0:0: [sdb] tag#9 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[ 4123.680492] I/O error, dev sdb, sector 486909448 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[ 4123.680503] ata5: EH complete
[ 4123.680505] Aborting journal on device sdb1-8.
[ 4123.680516] ata5.00: detaching (SCSI 4:0:0:0)
[ 4123.680522] device offline error, dev sdb, sector 486803456 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[ 4123.680528] device offline error, dev sdb, sector 486803456 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
[ 4123.680532] Buffer I/O error on dev sdb1, logical block 60850176, lost sync page write
[ 4123.680541] JBD2: I/O error when updating journal superblock for sdb1-8.
[ 4123.680684] ext4-fs error (device sdb1): ext4_journal_check_start:84: comm php: Detected aborted journal
[ 4123.680976] device offline error, dev sdb, sector 2048 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 2
[ 4123.680982] device offline error, dev sdb, sector 2048 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 2
[ 4123.680986] Buffer I/O error on dev sdb1, logical block 0, lost sync page write
[ 4123.681663] ext4-fs (sdb1): I/O error while writing superblock
[ 4123.681665] ext4-fs (sdb1): Remounting filesystem read-only

From PHP we get a disk I/O problem:

PHP Warning:  SQLite3Stmt::execute(): Unable to execute statement: disk I/O error in php-insert-into-sqlite.php on line 143
ERROR writing row num: 11
PHP Warning:  SQLite3Stmt::execute(): Unable to execute statement: database disk image is malformed in php-insert-into-sqlite.php on line 143

Then after getting the drive back up:

$ ls /test/
ls: reading directory '/test/': Input/output error

We let ext4 automatically fix the drive by unmouting and mounting again:

$ doas umount /test/
$ doas mount /test/

The log then shows us that the drive is being fixed:

[ 4230.505449] ext4-fs warning (device sdb1): htree_dirblock_to_tree:1082: inode #2: lblock 0: comm ls: error -5 reading directory block
[ 4251.362302] ext4-fs (sdb1): unmounting filesystem bb933623-bb25-4723-8b79-80bf78e7a0ef.
[ 4257.417957] ext4-fs (sdc1): recovery complete
[ 4257.443758] ext4-fs (sdc1): mounted filesystem bb933623-bb25-4723-8b79-80bf78e7a0ef r/w with ordered data mode. Quota mode: none.

Then let's check the database:

sqlite> SELECT COUNT(*) FROM foo;
┌──────────┐
│ COUNT(*) │
├──────────┤
│ 10       │
└──────────┘

We got 10 inserts completed and no data integrity problems:

sqlite> PRAGMA integrity_check;
┌─────────────────┐
│ integrity_check │
├─────────────────┤
│ ok              │
└─────────────────┘

As expected, the last write, which would have been the 11'th row did not reach the disk.

All the results were the same with SQLite on ext4 with both writes and updates. ext4 cleared up all disk issues and no errors occurred with the database.

ext4 PHP fopen

Again I cut the power in the middle of writing:

[ 5966.767188] ata5.00: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
[ 5966.767196] ata5.00: irq_stat 0x00400040, connection status changed
[ 5966.767198] ata5: SError: { HostInt PHYRdyChg 10B8B DevExch }
[ 5966.767202] ata5.00: failed command: FLUSH CACHE ext
[ 5966.767204] ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 6
                        res 50/00:08:48:18:07/00:00:1d:00:00/40 Emask 0x50 (ATA bus error)
[ 5966.767212] ata5.00: status: { DRDY }
[ 5966.767217] ata5: hard resetting link
[ 5967.507192] ata5: SATA link down (SStatus 0 SControl 300)
[ 5972.537178] ata5: hard resetting link
[ 5972.850110] ata5: SATA link down (SStatus 0 SControl 300)

PHP gives us an error:

PHP Warning:  fopen(/test/php-files/file-11.txt): Failed to open stream: Input/output error in php-fopen.php on line 13
PHP Notice:  WRITE ERROR: Could not write to file: /test/php-files/file-11.txt Loop count: 11
 in php-fopen.php on line 14
PHP Fatal error:  Uncaught TypeError: flock(): Argument #1 ($stream) must be of type resource, false given in php-fopen.php:16
Stack trace:
#0 php-fopen.php(16): flock()
#1 php-fopen.php(146): store_data()
#2 {main}
  thrown in php-fopen.php on line 16

Let's reattach the drive and let ext4 fix the drive:

[ 6050.129811] ext4-fs (sdc1): recovery complete
[ 6050.155587] ext4-fs (sdc1): mounted filesystem bb933623-bb25-4723-8b79-80bf78e7a0ef r/w with ordered data mode. Quota mode: none.

The error logs gives us more relevant information:

kernel: JBD2: I/O error when updating journal superblock for sdc1-8.
kernel: ext4-fs error (device sdc1) in ext4_reserve_inode_write:5764: Journal has aborted
kernel: ext4-fs error (device sdc1): ext4_dirty_inode:5968: inode #14: comm mc: mark_inode_dirty error
kernel: ext4-fs error (device sdc1) in ext4_dirty_inode:5969: Journal has aborted
kernel: ext4-fs error (device sdc1): ext4_check_bdev_write_error:224: comm kworker/u8:12: Error while async write b

kernel: ext4-fs (sdc1): Delayed block allocation failed for inode 14 at logical offset 498688 with max blocks 1024

kernel: ext4-fs (sdc1): This should not happen!! Data will be lost

kernel: ext4-fs error (device sdc1) in ext4_do_writepages:2700: Journal has aborted
kernel: device offline error, dev sdc, sector 2048 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 2
kernel: device offline error, dev sdc, sector 2048 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 2
kernel: Buffer I/O error on dev sdc1, logical block 0, lost sync page write
kernel: ext4-fs error (device sdc1) in ext4_reserve_inode_write:5764: Journal has aborted
kernel: ext4-fs error (device sdc1): __ext4_ext_dirty:202: inode #16121867: comm kworker/u8:6: mark_inode_dirty err

kernel: ext4-fs warning (device sdc1): ext4_convert_unwritten_extents:4813: inode #16121867: block 0: len 7: ext4_e
d -30
kernel: ext4-fs warning (device sdc1): ext4_end_bio:343: I/O error 17 writing to inode 14 starting block 578752)
kernel: ext4-fs error (device sdc1): ext4_check_bdev_write_error:224: comm kworker/u8:6: Error while async write ba

kernel: ext4-fs error (device sdc1) in ext4_reserve_inode_write:5764: Journal has aborted
kernel: ext4-fs (sdc1): I/O error while writing superblock
kernel: ext4-fs (sdc1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 1612
1867, error -30)

The error messages stated, "This should not happen!! Data will be lost". That was correct. The loop counter showed us that we should have 11 files written to disk, we only got 9. The good thing is that none of the residing files ever contained any missing data. In every single test all the files contained the data they should.

The results were exactly the same as with SQLite. One or two files missing or one or two rows missing, but never any missing data in the files or rows that were written to storage.

ext4 PostgreSQL

As before, the script is run and the power to the hard drive is disconnected during the write.

PHP Fatal error:  Uncaught PDOException: SQLSTATE[HY000]: General error: 7 PANIC:  could not fdatasync file "000000010000000000000001": Input/output error
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request. in php-insert-into-postgresql.php:151
Stack trace:
#0 php-insert-into-postgresql.php(151): PDOStatement->execute()
#1 {main}
  thrown in php-insert-into-postgresql.php on line 151

Interesting, but not surprising - since I already knew this from looking at the PostgreSQL source code - PostgreSQL is set to panic when there is a write error.

This is in my humble opinion the most correct approach to handling a serious write error like this. The database is fully and completely dependent upon the storage solution, if data cannot be written to disk, something very serious is wrong and the database should shut down completely.

I confirm that the PostgreSQL database is no longer running:

$ systemctl status postgresql
× postgresql.service - PostgreSQL database server
     Loaded: loaded (/usr/lib/systemd/system/postgresql.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/postgresql.service.d
             └─PGROOT.conf
     Active: failed (Result: core-dump) since Sun 2024-01-14 05:54:07 CET; 1min 34s ago
   Duration: 32min 58.074s
    Process: 901 ExecStartPre=/usr/bin/postgresql-check-db-dir ${PGROOT}/data (code=exited, status=0/SUCCESS)
    Process: 904 ExecStart=/usr/bin/postgres -D ${PGROOT}/data (code=dumped, signal=ABRT)
   Main PID: 904 (code=dumped, signal=ABRT)
        CPU: 1.354s

postgres[904]: 2024-01-14 05:54:07.029 CET [904] DETAIL:  Failed process was running: INSERT INTO foo (foo_id, foo>
postgres[904]: 2024-01-14 05:54:07.029 CET [904] LOG:  terminating any other active server processes
postgres[904]: 2024-01-14 05:54:07.032 CET [904] LOG:  all server processes terminated; reinitializing
postgres[904]: 2024-01-14 05:54:07.032 CET [904] LOG:  could not open directory "base/pgsql_tmp": Input/output err>
postgres[904]: 2024-01-14 05:54:07.032 CET [904] LOG:  could not read directory "base": Input/output error
postgres[904]: 2024-01-14 05:54:07.032 CET [904] LOG:  could not open directory "pg_tblspc": Input/output error
postgres[904]: 2024-01-14 05:54:07.051 CET [904] PANIC:  could not open file "global/pg_control": Input/output err>
systemd[1]: postgresql.service: Main process exited, code=dumped, status=6/ABRT
systemd[1]: postgresql.service: Failed with result 'core-dump'.
systemd[1]: postgresql.service: Consumed 1.354s CPU time.

So, this time, without further investigation, let's just restart PostgreSQL and see how it behaves:

# systemctl restart postgresql

In the log we can now see PostgreSQL is cleaning up:

systemd[1]: Starting PostgreSQL database server...
LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.1 20230801, 64-bit
LOG:  listening on IPv6 address "::1", port 5432
LOG:  listening on IPv4 address "127.0.0.1", port 5432
LOG:  listening on Unix socket "/run/postgresql/.s.PGSQL.5432"
LOG:  database system was interrupted; last known up at 2024-01-14 05:51:10 CET
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/198F730
LOG:  invalid record length at 0/19A5560: expected at least 24, got 0
LOG:  redo done at 0/19A5538 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.05 s
LOG:  checkpoint starting: end-of-recovery immediate wait
LOG:  checkpoint complete: wrote 15 buffers (0.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.046 s, sync=0.084 s, total=0.323 s; sync files=5, longest=0.055 s, average=0.017 s; distance=87 kB, estimate=87 kB; lsn=0/19A5560, redo lsn=0/19A5560
LOG:  database system is ready to accept connections
systemd[1]: Started PostgreSQL database server.

Looking through the database afterwards shows that PostgreSQL managed to clean up everything nicely and no data was ever missing from any rows.

ext4 MariaDB

Again we get an error from the writing script when the power is cut:

PHP Fatal error:  Uncaught PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away in php-insert-into-mariadb.php:143
Stack trace:
#0 php-insert-into-mariadb.php(143): PDOStatement->execute()
#1 {main}
  thrown in php-insert-into-mariadb.php on line 143

From looking through the code of MariaDB and also from the issue with "fsyncgate 2018", we know that the MariaDB developers decided to follow the approach of PostgreSQL and panic MariaDB when an error like this happens. And this is exactly what happened on ext4. However, the developers are very humble stating that they probably made a mistake and a bug has happened:

kernel: Buffer I/O error on device sdb1, logical block 602121
mariadbd[1925]: 2024-01-14  6:58:13 0 [Warning] InnoDB: Retry attempts for writing partial data failed.
mariadbd[1925]: 2024-01-14  6:58:13 0 [ERROR] [FATAL] InnoDB: write("ib_logfile0") returned I/O error
mariadbd[1925]: 240114  6:58:13 [ERROR] mysqld got signal 6 ;

mariadbd[1925]: Sorry, we probably made a mistake, and this is a bug.
mariadbd[1925]: Your assistance in bug reporting will enable us to fix this for the next release.
mariadbd[1925]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
mariadbd[1925]: We will try our best to scrape up some info that will hopefully help
mariadbd[1925]: diagnose the problem, but since we have already crashed,
mariadbd[1925]: something is definitely wrong and this may fail.

mariadbd[1925]: Server version: 11.2.2-MariaDB source revision: 929532a9426d085111c24c63de9c23cc54382259
mariadbd[1925]: key_buffer_size=134217728
mariadbd[1925]: read_buffer_size=131072
mariadbd[1925]: max_used_connections=2
mariadbd[1925]: max_threads=153
mariadbd[1925]: thread_count=2
mariadbd[1925]: It is possible that mysqld could use up to
mariadbd[1925]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468089 K  bytes of memory
mariadbd[1925]: Hope that's ok; if not, decrease some variables in the equation.
mariadbd[1925]: Thread pointer: 0x0
mariadbd[1925]: Attempting backtrace. You can use the following information to find out
mariadbd[1925]: where mysqld died. If you see no messages after this, something went
mariadbd[1925]: terribly wrong...
mariadbd[1925]: 2024-01-14  6:58:13 14 [ERROR] [FATAL] InnoDB: fdatasync() returned 5
kernel: Buffer I/O error on dev sdb1, logical block 0, lost sync page write
kernel: ext4-fs (sdb1): previous I/O error to superblock detected

...

systemd-coredump[2031]: [🡕] Process 1925 (mariadbd) of user 971 dumped core.

     Stack trace of thread 1940:
     #0  0x00007f19eb2ac83c n/a (libc.so.6 + 0x8e83c)
     #1  0x00007f19eb25c668 raise (libc.so.6 + 0x3e668)
     #2  0x00007f19eb2444b8 abort (libc.so.6 + 0x264b8)
     #3  0x000055fad4473f29 n/a (mariadbd + 0x5f0f29)
     #4  0x000055fad4457485 n/a (mariadbd + 0x5d4485)

No. This was not a bug, this was a hardware problem :)

systemd automatically tried to restart MariaDB, something which I would highly recommend is disabled after such a crash:

mariadbd[2076]: 2024-01-14  6:58:23 0 [ERROR] mariadbd: Can't create/write to file './ddl_recovery.log' (Errcode: 3
tem")
mariadbd[2076]: 2024-01-14  6:58:23 0 [ERROR] DDL_LOG: Failed to create ddl log file: ./ddl_recovery.log
mariadbd[2076]: 2024-01-14  6:58:23 0 [ERROR] Aborting
systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: mariadb.service: Failed with result 'exit-code'.
systemd[1]: Failed to start MariaDB 11.2.2 database server.

So we make ext4 clean up the filesystem:

[ 9561.781767] ext4-fs (sdc1): recovery complete
[ 9561.807910] ext4-fs (sdc1): mounted filesystem bb933623-bb25-4723-8b79-80bf78e7a0ef r/w with ordered data mode. Quota mode: none.

Then we restart MariaDB manually in order to see how it handles the situation:

mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] InnoDB: Rollback of non-prepared transactions completed
mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] Recovering after a crash using tc.log
mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] Starting table crash recovery...
mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] Crash table recovery finished.
mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] Server socket created on IP: '0.0.0.0'.
mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] Server socket created on IP: '::'.
mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
mariadbd[2146]: 2024-01-14  7:04:06 0 [Note] /usr/bin/mariadbd: ready for connections.
mariadbd[2146]: Version: '11.2.2-MariaDB'  socket: '/run/mysqld/mysqld.sock'  port: 3306  Arch Linux
systemd[1]: Started MariaDB 11.2.2 database server.

A quick look through the data in the database shows that MariaDB always managed to recover from all of the crashes on ext4 whether during a write or an update of the data. No rows was ever missing any data.

ext4 conclusions

While the delayed allocation changes which ext4 has poses some additional risk of data loss in cases where the system crashes or loses power before all of the data has been written to disk, I did not experience a single problem with either writing to the filesystem or to the databases (except for the occasional missing file or row of data, which is exactly what is expected). This is most likely because that the typical scenario in which such a problem might occur is a program replacing the contents of a file WITHOUT forcing a write to the disk with fsync. Hence, the importance of using fsync if you choose to write directly to the file system without a database.

ext4 uses checksums in the journal to improve reliability. This feature has the side benefit that it can safely avoid a disk I/O wait during journaling which improves performance slightly. In ext4 unallocated block groups and sections of the inode table are marked which enables e2fsck to skip them entirely and this greatly reduces the time it takes to check the file system.

As the tests show, ext4 recovery is performed automatically the first time the file system is mounted after a crash and I did not experience any problems with using the filesystem directly versus using a database.

XFS tests

XFS preparations

As of version 3.2.0 XFS introduced a new on-disk format (v5) that includes a metadata checksum scheme called Self-Describing Metadata. Based on CRC32, it provides additional protection against metadata corruption (e.g. on unexpected power losses). Unlike Btrfs and ZFS, the checksumming only applies to the metadata and not actual data.

I am going to leave the XFS checksumming enabled. Furthermore, when using PHP fopen I am relying on the filesystem being as resilient as possible.

# mkfs.xfs /dev/sdb1
meta-data=/dev/sdb1              isize=512    agcount=4, agsize=30524098 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=122096390, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=59617, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

XFS SQLite

Just like with ext4 we start by inserting data into SQLite and then plug out the power cord to the destination disk, and in some test, to the entire computer.

We see an error during the INSERT:

PHP Warning:  SQLite3Stmt::execute(): Unable to execute statement: disk I/O error in php-insert-sqlite.php on line 437
PHP Warning:  SQLite3Stmt::execute(): Unable to execute statement: database disk image is malformed in php-insert-sqlite.php

And we see the following in dmesg:

[ 1479.868546] ata5.00: detaching (SCSI 4:0:0:0)
[ 1479.868549] I/O error, dev sdb, sector 4821496 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
[ 1479.868556] device offline error, dev sdb, sector 4824056 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
[ 1479.868564] device offline error, dev sdb, sector 4826616 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
[ 1479.868571] device offline error, dev sdb, sector 4829176 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
[ 1479.868847] sdb1: writeback error on inode 1075, offset 655360, sector 8492920
[ 1479.868858] sdb1: writeback error on inode 1075, offset 659456, sector 8492928
[ 1479.868865] sdb1: writeback error on inode 1076, offset 0, sector 8493432
[ 1479.869222] sdb1: writeback error on inode 1074, offset 2386558976, sector 4862456
[ 1479.870842] sdb1: writeback error on inode 1074, offset 2436890624, sector 4871160
[ 1479.870870] sdb1: writeback error on inode 1074, offset 2441084928, sector 4879352
[ 1479.870895] sdb1: writeback error on inode 1074, offset 2445279232, sector 4887544
[ 1479.870921] sdb1: writeback error on inode 1074, offset 2449473536, sector 4895736
[ 1479.870945] sdb1: writeback error on inode 1074, offset 2453667840, sector 4903928
[ 1479.870971] sdb1: writeback error on inode 1074, offset 2457862144, sector 4912120
[ 1479.913302] XFS (sdb1): log I/O error -19
[ 1479.913311] XFS (sdb1): Filesystem has been shut down due to log error (0x2).
[ 1479.913313] XFS (sdb1): Please unmount the filesystem and rectify the problem(s).
[ 1480.003861] XFS (sdb1): Unmounting Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d

Of course the system log also shows the errors:

kernel: sd 4:0:0:0: rejecting I/O to offline device
kernel: I/O error, dev sdb, sector 4813816 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
kernel: I/O error, dev sdb, sector 4816376 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
kernel: I/O error, dev sdb, sector 8492920 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
kernel: I/O error, dev sdb, sector 8492928 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 2
kernel: I/O error, dev sdb, sector 4818936 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
kernel: ata5.00: detaching (SCSI 4:0:0:0)
kernel: I/O error, dev sdb, sector 4821496 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
kernel: device offline error, dev sdb, sector 4824056 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
kernel: device offline error, dev sdb, sector 4826616 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
kernel: device offline error, dev sdb, sector 4829176 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
kernel: sdb1: writeback error on inode 1075, offset 655360, sector 8492920
kernel: sdb1: writeback error on inode 1075, offset 659456, sector 8492928
kernel: sdb1: writeback error on inode 1076, offset 0, sector 8493432
kernel: sdb1: writeback error on inode 1074, offset 2386558976, sector 4862456
kernel: sdb1: writeback error on inode 1074, offset 2436890624, sector 4871160
kernel: sdb1: writeback error on inode 1074, offset 2441084928, sector 4879352
kernel: sdb1: writeback error on inode 1074, offset 2445279232, sector 4887544
kernel: sdb1: writeback error on inode 1074, offset 2449473536, sector 4895736
kernel: sdb1: writeback error on inode 1074, offset 2453667840, sector 4903928
kernel: sdb1: writeback error on inode 1074, offset 2457862144, sector 4912120
kernel: XFS (sdb1): log I/O error -19
kernel: XFS (sdb1): Filesystem has been shut down due to log error (0x2).
kernel: XFS (sdb1): Please unmount the filesystem and rectify the problem(s).
kernel: XFS (sdb1): Unmounting Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d
systemd[1]: Stopped target Local File Systems.
systemd[1]: Unmounting /test...
systemd[1]: test\x2ddisk.mount: Deactivated successfully.
systemd[1]: Unmounted /test.

XFS fs_repair does not run at boot time, even when an XFS file system was not cleanly unmounted. In the event of an unclean unmount, xfs_repair simply replays the log at mount time, ensuring a consistent file system.

XFS automatically unmounted the filesystem, a resonable precaution, something which ext4 doesn't do. However, as we can see in a moment, this unfortunately doesn't happen every time.

$ doas mount /test/

Now, dmesg shows XFS doing recovery:

[ 1864.482335] XFS (sdb1): Mounting V5 Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d
[ 1864.651106] XFS (sdb1): Starting recovery (logdev: internal)
[ 1864.803359] XFS (sdb1): Ending recovery (logdev: internal)

The INSERT loop got broken at the 17'th run and looking through the SQLite database we can see that we got 16 rows cleanly inserted:

sqlite> SELECT COUNT(*) FROM foo;
┌──────────┐
│ COUNT(*) │
├──────────┤
│ 16       │
└──────────┘

No data is missing and no data integrity problems occurred.

At another test XFS was not automatically unmounted. dmesg showed the following:

[ 3988.028709] XFS (sdb1): metadata I/O error in "xfs_buf_ioend+0x282/0x5f0 [xfs]" at daddr 0x1 len 1 error 5
[ 3988.030221] XFS (sdb1): log I/O error -5
[ 3988.030264] XFS (sdb1): log I/O error -19
[ 3988.030624] sdb1: writeback error on inode 1074, offset 2046820352, sector 71173136
[ 3988.030656] sdb1: writeback error on inode 1074, offset 2092957696, sector 71206928
[ 3988.030687] sdb1: writeback error on inode 1074, offset 2097152000, sector 71215120
[ 3988.030717] sdb1: writeback error on inode 1074, offset 2101346304, sector 71223312
[ 3988.030770] sdb1: writeback error on inode 1074, offset 2105540608, sector 71221776
[ 3988.031423] XFS (sdb1): Metadata I/O Error (0x1) detected at xfs_buf_ioend+0x5c3/0x5f0 [xfs] (fs/xfs/xfs_buf.c:1266).  Shutting down filesystem.
[ 3988.031763] XFS (sdb1): Please unmount the filesystem and rectify the problem(s)

We need to manually unmount the filesystem.

And the log said:

kernel: sdb1: writeback error on inode 1074, offset 2105540608, sector 71221776
kernel: XFS (sdb1): Metadata I/O Error (0x1) detected at xfs_buf_ioend+0x5c3/0x5f0 [xfs] (fs/xfs/xfs_buf.c:1266).  Shutting down filesystem.
kernel: XFS (sdb1): Please unmount the filesystem and rectify the problem(s)
systemd[1]: Unmounting /test...
systemd[1]: test\x2ddisk.mount: Mount process exited, code=exited, status=32/n/a
umount[13201]: umount: /test: target is busy.
systemd[1]: Failed unmounting /test.

So I reattached the power cord to the drive, unmounted the directory and remounted. dmesg shows:

[ 4135.286433] XFS (sdb1): Unmounting Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d
[ 4219.326391] XFS (sdc1): Mounting V5 Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d
[ 4219.508601] XFS (sdc1): Starting recovery (logdev: internal)
[ 4219.630847] XFS (sdc1): Ending recovery (logdev: internal)

And the log shows:

kernel: sd 4:0:0:0: [sdc] Attached SCSI removable disk
kernel: XFS (sdb1): Unmounting Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d
doas[13208]: foo ran command umount /test as root from /home/foo
systemd[1]: test\x2ddisk.mount: Failed with result 'exit-code'.
doas[13211]: foo ran command mount /test as root from /home/foo
kernel: XFS (sdc1): Mounting V5 Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d
kernel: XFS (sdc1): Starting recovery (logdev: internal)
kernel: XFS (sdc1): Ending recovery (logdev: internal)

Still no data integrity problems and all data was fine:

sqlite> PRAGMA integrity_check;
┌─────────────────┐
│ integrity_check │
├─────────────────┤
│ ok              │
└─────────────────┘

XFS PHP fopen

Again, I am running with the same PHP script as before, creating a bunch of files while cutting the power both to the drive and to the entire computer.

PHP Warning:  fopen(/test/php-files/file-18.txt): Failed to open stream: Input/output error in php-fopen.php on line 13
PHP Notice:  WRITE ERROR: Could not write to file: /test/php-files/file-18.txt Loop count: 18
 in php-fopen.php on line 14
PHP Fatal error:  Uncaught TypeError: flock(): Argument #1 ($stream) must be of type resource, false given in php-fopen.php:16
Stack trace:
#0 php-fopen.php(16): flock()
#1 php-fopen.php(146): store_data()
#2 {main}
  thrown in php-fopen.php on line 16

And dmesg shows:

[ 4530.531854] device offline error, dev sdc, sector 4054584 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
[ 4530.532286] sdc1: writeback error on inode 1074, offset 1992294400, sector 4087864
[ 4530.562099] sd 4:0:0:0: [sdc] Synchronizing SCSI cache
[ 4530.562151] sd 4:0:0:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

Again this time the drive was not auto unmounted:

kernel: device offline error, dev sdc, sector 4054584 op 0x1:(WRITE) flags 0x104000 phys_seg 20 prio class 2
systemd[1]: Unmounting /test...
umount[13255]: umount: /test: target is busy.
systemd[1]: test\x2ddisk.mount: Mount process exited, code=exited, status=32/n/a
systemd[1]: Failed unmounting /test.
kernel: sdc1: writeback error on inode 1074, offset 1992294400, sector 4087864
kernel: sd 4:0:0:0: [sdc] Synchronizing SCSI cache
kernel: sd 4:0:0:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

I reattach the drive's power cord, unmounted and remounted:

[ 4658.605302] XFS (sdb1): Mounting V5 Filesystem dc0feaa4-37d8-4bfa-bb44-806eb3a8836d
[ 4658.824771] XFS (sdb1): Starting recovery (logdev: internal)
[ 4659.075935] XFS (sdb1): Ending recovery (logdev: internal)

The loop counter reached file 18, we should at least have "file-17.txt" in the directory, which we got:

$ ls -l /test/php-files
...
file-15.txt
file-16.txt
file-17.txt

A look inside "file-17.txt" reveals a healthy file without any data missing.

XFS PostgreSQL and MariaDB

No data corruption or data missing in any of the tests I did.

XFS conclusions

Red Hat Enterprise Linux uses XFS as its default file system and XFS has been getting a lot of attention from Red Hat (IBM) since they decided to drop Btrfs.

The XFS journal contains entries that describe the portions of the disk blocks changed by filesystem operations. Journal updates are performed asynchronously to avoid a decrease in performance speed. When a system crash happens, file system operations which occurred immediately prior to the crash are reapplied if possible and they are completed as recorded in the journal. This is how data stored in XFS file systems remain consistent.

As the tests show, XFS recovery is performed automatically the first time the file system is mounted after a crash. The speed of recovery is independent of the size of the file system, instead depending on the amount of file system operations to be reapplied.

I did not manage to create a single problem on XFS with any of the tests I did. All the fsync calls with PHP completed successfully and XFS managed to nicely clean up anything that could have caused a problem.

XFS seemed as performant as ext4, in some cases even more, and I have always had positive experiences in the past running XFS.

ZFS tests

Time to do some testing on ZFS.

ZFS SQLite

When I unplugged the power cable to the disk, dmesg showed the problem:

[ 2193.014347] ata2.00: detaching (SCSI 1:0:0:0)
[ 2193.047720] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[ 2193.047763] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 2193.105719] WARNING: Pool 'pool1' has encountered an uncorrectable I/O failure and has been suspended.

However, an important change to the behavior of the PHP script occurred, it froze during transfer with no error output.

After some time dmesg and the log provided some very relevant information:

[ 2334.574912] INFO: task php:758 blocked for more than 122 seconds.
[ 2334.574915]       Tainted: P          IOE      6.6.11-1-lts #1
[ 2334.574916] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2334.574918] task:php             state:D stack:0     pid:758   ppid:427    flags:0x00004002
[ 2334.574922] Call Trace:
[ 2334.574923]  <TASK>
[ 2334.574925]  __schedule+0x3e7/0x1410
[ 2334.574930]  ? __switch_to_asm+0x3e/0x70
[ 2334.574935]  ? __schedule+0x3ef/0x1410
[ 2334.574941]  schedule+0x5e/0xd0
[ 2334.574945]  io_schedule+0x46/0x70
[ 2334.574949]  cv_wait_common+0xaa/0x130 [spl 4309e7c412c817caa2ceb3a215ecd3901ae54471]
[ 2334.574971]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 2334.574975]  txg_wait_synced_impl+0xfd/0x150 [zfs ff9a20dc6f96e0755362c016011bbca975259357]
[ 2334.575330]  txg_wait_synced+0x10/0x40 [zfs ff9a20dc6f96e0755362c016011bbca975259357]
[ 2334.575684]  zil_commit_impl+0x810/0x1330 [zfs ff9a20dc6f96e0755362c016011bbca975259357]
[ 2334.576028]  zfs_fsync+0x95/0x130 [zfs ff9a20dc6f96e0755362c016011bbca975259357]
[ 2334.576370]  zpl_fsync+0x107/0x190 [zfs ff9a20dc6f96e0755362c016011bbca975259357]
[ 2334.576692]  __x64_sys_fdatasync+0x52/0x90
[ 2334.576698]  do_syscall_64+0x60/0x90
[ 2334.576703]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[ 2334.576707] RIP: 0033:0x7f976ca8c074
[ 2334.576714] RSP: 002b:00007fff441a77c8 EFLAGS: 00000202 ORIG_RAX: 000000000000004b
[ 2334.576718] RAX: ffffffffffffffda RBX: 0000559d40e8b5e0 RCX: 00007f976ca8c074
[ 2334.576720] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000005
[ 2334.576722] RBP: 0000559d40e8b550 R08: 0000559d40e8b600 R09: 0000559d40fce4d8
[ 2334.576724] R10: 000000000000001e R11: 0000000000000202 R12: 000000000001e2f0
[ 2334.576726] R13: 0000000000001579 R14: 0000000000000000 R15: 0000559d40fa9fc0
[ 2334.576730]  </TASK>

This is a feature build into ZFS in which a call to fsync will wait a specified amount of time.

I reattached the power cord to the drive and instructed ZFS to clear the data errors:

$ doas zpool clear pool1
$ zpool status
  pool: pool1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jan 14 22:36:50 2024
        714M / 3.52G scanned, 104M / 2.85G issued at 13.0M/s
        1008K resilvered, 3.56% done, 00:03:36 to go
config:

        NAME                                                STATE     READ WRITE CKSUM
        pool1                                               ONLINE       0     0     0
          ata-HITACHI_HTS545050B9A300_110207PBN403171UA5TE  ONLINE       4     0    11  (resilvering)

The result of this was that both the transfer of the big file and the PHP script resumed and finished without any errors:

sqlite> PRAGMA integrity_check;
┌─────────────────┐
│ integrity_check │
├─────────────────┤
│ ok              │
└─────────────────┘

sqlite> SELECT COUNT(*) FROM foo;
┌──────────┐
│ COUNT(*) │
├──────────┤
│ 499      │
└──────────┘

And a SELECT of the last row shows that all is well.

This test was run on Linux which uses up to about 40% of the available memory for file system caching. After this mark has been reached the file system flushes all data to disk causing all following I/Os going synchronous. There is a time limit of 120 seconds by default, which you can see in /proc/sys/kernel/hung_task_timeout_secs.

A failing single disk in the ZFS pool can cause the system to be unresponsive. This is because in order for ZFS to mark a device bad, I/O to the device should fail with an error. If the operating system, the disk controller, the disk driver, or the disk itself is doing its own "thing" before timing out the I/O and propagating the error to upper layer, ZFS won't know if the device is bad and thus it will continue to schedule I/O to it and there is no timeout setting in ZFS.

This is because, by design, I/O time out problems should be dealt with at the lower layer (HDD/SSD). Some SCSI drives have all kinds of retry tuning. If e.g. a drive is taking 30 seconds to perform I/O, but is still present and the disk driver refuses to mark it bad, ZFS cannot do much about it.

ZFS PHP fopen

Again I am running both writing and updating tests and the results are again the same.

From dmesg we see the following when I unplug the power cord to the drive:

[  629.894240] zio pool=pool1 vdev=/dev/disk/by-id/ata-HITACHI_HTS545050B9A300_110207PBN403171UA5TE-part1 error=5 type=2 offset=107378728960 size=4096 flags=1589376
[  629.894252] ata2.00: detaching (SCSI 1:0:0:0)
[  629.941034] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[  629.947684] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  629.997936] WARNING: Pool 'pool1' has encountered an uncorrectable I/O failure and has been suspended.

The drive has been suspended and the file transfers freezes.

This time, rather than just reattaching the power cord to the drive and let ZFS fix the situation, I purposefully crash the machine by unplugging the power to it.

After rebooting I import the ZFS pool:

$ doas zpool import pool1

And the status shows:

$ doas zpool status
  pool: pool1
 state: ONLINE
  scan: resilvered 2.04M in 00:00:27 with 11 errors on Sun Jan 14 22:37:17 2024
config:

        NAME                                                STATE     READ WRITE CKSUM
        pool1                                               ONLINE       0     0     0
          ata-HITACHI_HTS545050B9A300_110207PBN403171UA5TE  ONLINE       0     0     0

errors: No known data errors

As expected, the big file was halted during file transfer and it is missing data, but the 11 files got written to disk and no data corruption occurred.

ZFS PostgreSQL

On a ZFS mirror nothing is supposed to happen as we're running on a mirror (and nothing will happen), but now that we have ZFS on a single drive, will the file insert simply halt as with SQLite and PHP fopen, or will PostgreSQL panic? As the PostgreSQL C code shows, the panic depends on the return value of fsync. As we have already determined, a failing disk in a ZFS zpool can cause the system to be unresponsive. ZFS will mark the device bad, I/O to the device fails with an error and properly the transfer just hangs until a system administrator fixes the issue.

Let's take a look. We have a pool with a single drive:

$ zpool status
  pool: pool1
 state: ONLINE
  scan: resilvered 3.52G in 00:00:53 with 0 errors on Mon Jan 15 04:02:17 2024
config:

        NAME                               STATE     READ WRITE CKSUM
        pool1                              ONLINE       0     0     0
          ata-TOSHIBA_MK5056GSY_60GMTQUMT  ONLINE       0     0     0

errors: No known data errors

Let's insert some data into PostgreSQL and unplug the power cord to the destination drive.

Again dmesg shows the error:

[ 3888.800345] zio pool=pool1 vdev=/dev/disk/by-id/ata-TOSHIBA_MK5056GSY_60GMTQUMT-part1 error=5 type=2 offset=113364508672 size=524288 flags=1074267264
[ 3888.852542] WARNING: Pool 'pool1' has encountered an uncorrectable I/O failure and has been suspended.
[ 3888.852588] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
[ 3888.852623] sd 2:0:0:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 4413.886351] WARNING: Pool 'pool1' has encountered an uncorrectable I/O failure and has been suspended.

Both the file transfer of the big file and the inserts into PostgreSQL halts and freezes.

As soon as I reattach the power cord to the drive and run the zpool clear command the transfers completes and all rows has been updated without problems.

Even though PostgreSQL is specifically programmed to panic when there is a problem with fsync, ZFS handles the issue differently and all file transfers are halted until a system administrator can take a look at the situation or until the machine is simply rebooted.

ZFS MariaDB

Before I started the tests, I suspected about the same results as with PostgreSQL.

Again, as with PostgreSQL, the database did not crash. dmesg shows:

[ 5026.536659] WARNING: Pool 'pool1' has encountered an uncorrectable I/O failure and has been suspended.
[ 5026.546163] sd 2:0:0:0: [sdc] Synchronizing SCSI cache
[ 5026.546210] sd 2:0:0:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[ 5026.89473] WARNING: Pool 'pool1' has encountered an uncorrectable I/O failure and has been suspended.

Reattaching the drive and fixing the pool resumes the file transfers and all is completed without errors.

ZFS conclusions

I do not believe that the lack of a panic on PostgreSQL and MariaDB when there is a I/O error on ZFS is a problem because ZFS suspends all I/O operations which is different from both ext4 and XFS.

Not only is ZFS amazingly resilient even with failures on a single drive, but it is so good that I would actually state that you do not need a database if all you require is data integrity protection. Add the snapshotting and send/receive capabilities of ZFS and you have got a really amazing filesystem.

I personally use ZFS at my home-lab and on production servers. Working with ZFS has always been a joy and even though ZFS keeps getting new features added (some more useful than others), and even though a serious bug were recently introduced, which both Linux and FreeBSD suffered from, I must admit that I love ZFS.

If you combine your database with ZFS you have got a very strong system, especially when you run redundant drives. Of course you need to determine how to best set things up, and you better not do a snapshot in the middle of a database transaction, but that is not the subject of this article.

In my humble opinion, ZFS (and also Btrfs for that matter) eliminates the need for a database in scenarios where you don't need database specific operations, such as e.g. indexes search, etc.

FFS2 tests

FFS2 softraid mirror with SQLite

I know from experience that OpenBSD's FFS2 is very robust and resilient. I have run with FFS on many machines for many years and I have experienced several power outages, but I have never lost any data to neither FFS or FFS2.

OpenBSD softraid on the other hand is, well, okay. I have not lost any data to it, but it is not as nice to work with compared to ZFS.

Furthermore, as the tests will show, the OpenBSD AHCI driver is buggy. It continuously cause the machine to "hang" completely during disk errors and the machine needs to not only be rebooted, but also hard reset because it eventually freezes completely trying to sync disks. It seems that DragonflyBSD has ported OpenBSD's AHCI driver and they have incorporated several workarounds for the problems I stumbled upon below, however this is not something that have been been backported to OpenBSD, as far as I can tell.

In order to setup a softraid mirror I first need to determine what device names the disks have:

$ sysctl hw.disknames
hw.disknames=sd0:a9a32e09b7f1b37b,sd1,sd2:

I know from looking at /etc/fstab that the main disk which OpenBSD has been installed on is the one that has also gotten a DUID (sd0). Then I just blank the other two disks and get a boot record boot code written to the disks I want to use in the mirror.

# fdisk -iy sd1
# fdisk -iy sd2

As I went on to partition the disks with fdisk I noticed that even though both disks are advertised as 500GB in size there is a difference in the actual physical size. Softraid will work, but it will complain about different block sizes as some space is wasted. I therefore partition the disk with the smallest physical size and then copy the partition table unto the bigger disk which will make the partitioning equal sized.

$ doas disklabel -E sd2
Label editor (enter '?' for help at any prompt)
sd2> a a
offset: [64]
size: [976773104]
FS type: [4.2BSD] RAID
sd2*> w
sd2> q
No label changes.

Then store the partition layout and copy that to sd1:

$ doas disklabel sd2 > sd2.layout
$ doas disklabel -R sd1 sd2.layout
$ rm sd2.layout

Then we create the softraid volume:

$ doas bioctl -c 1 -l sd1a,sd2a softraid0
softraid0: RAID 1 volume attached as sd3

We can take a look at the status of the RAID at any time with:

$ doas bioctl softraid0
Volume      Status               Size Device
softraid0 0 Online       500104946176 sd3     RAID1
          0 Online       500104946176 0:0.0   noencl <sd1a>
          1 Online       500107558912 0:1.0   noencl <sd2a>

Now we need to create a new FFS2 filesystem on the RAID device with the newfs command. Typically the defaults are reasonable:

# newfs /dev/rsd3c
/dev/rsd3c: 476937.2MB in 976767472 sectors of 512 bytes
2356 cylinder groups of 202.50MB, 12960 blocks, 25920 inodes each
super-block backups (for fsck -b #) at:
 160, 414880,
...

Now, sd3 has gotten a DUID so we can use that in /etc/fstab for mounting. I choose not to use automatic fsck for now.

$ sysctl hw.disknames
hw.disknames=sd0:a9a32e09b7f1b37b,sd1:09095176d2dd74ac,sd2:98d246cc6bc65c9b,sd3:8c99f49a5299b85a

In /etc/fstab we then have:

8c99f49a5299b85a.c /test ffs rw 0 0

And the device has been mounted:

$ mount
/dev/sd3c on /test type ffs (local)

$ df -h
/dev/sd3c      451G    2.0K    429G     1%    /test

There are some caveats that need to be noted. As we can read in the man page for softraid:

The driver relies on underlying hardware to properly fail chunks.
There are no automated mechanism to recover from failed disks.
Certain RAID levels can protect against some data loss due to component failure.

Let's do some testing with SQLite on OpenBSD softraid and see how it goes.

I am running with the same setup as I was on Linux, the same database and the same amount of data.

Cutting the power to the drive before the script is done. There is a small pause in which every file transfer is halted, then after some seconds, dmesg outputs:

/bsd: ahci0: stopping the port, softreset slot 31 was still active.
/bsd: ahci0: failed to reset port during timeout handling, disabling it

Every file transfer is halted and it's not possible to get access to SQLite:

$ sqlite3 /test/foo.sqlite3

Hmm, SQLite does not respond and cannot be broken with CTRL-C.

Let's check and see what the status of the softraid looks like:

$ doas bioctl softraid0
Volume      Status               Size Device
softraid0 0 Degraded     500104946176 sd3     RAID1
          0 Offline      500104946176 0:0.0   noencl <sd1a>
          1 Online       500107558912 0:1.0   noencl <sd2a>

So, at the moment everything is halted.

I will first try to reattach the failed drive and rebuild the softraid.

$ doas bioctl softraid0
Volume      Status               Size Device
softraid0 0 Degraded     500104946176 sd3     RAID1
          0 Offline      500104946176 0:0.0   noencl <sd1a>
          1 Online       500107558912 0:1.0   noencl <sd2a>

Unfortunately I had to manually reboot the machine in order to get anything up and running again. Also, during the reboot I had to shut down the machine as it hangs at:

AHCI: Stopping the port, softreset slot 31 was still active
AHCI: Failed to reset port during timeout handling, disabling it
Syncing disks

Eventually it might reboot, but I was impatient during these many tests, so I forced a hard reboot in order to proceed (after having waited a while).

This is problematic if this should happen on a remote host.

From investigating the OpenBSD source code, the issue seems to lie in ahci.c:

/* Reset port to abort running command. */
if (ccb_was_started) {
    DPRINTF(AHCI_D_TIMEOUT, "%s: resetting port to abort%s command "
        "in slot %d, pmp port %d, active %08x\n", PORTNAME(ap),
        ncq_cmd ? " NCQ" : "", ccb->ccb_slot, xa->pmp_port, *active);
    if (ahci_port_softreset(ap) != 0 && ahci_port_portreset(ap, 0)
        != 0) {
        printf("%s: failed to reset port during timeout "
            "handling, disabling it\n", PORTNAME(ap));
        ap->ap_state = AP_S_FATAL_ERROR;
    }

    /* Restart any other commands that were aborted by the reset. */
    if (*active) {
        DPRINTF(AHCI_D_TIMEOUT, "%s: re-enabling%s slots "
            "%08x\n", PORTNAME(ap), ncq_cmd ? " NCQ" : "",
            *active);
        if (ncq_cmd)
            ahci_pwrite(ap, AHCI_PREG_SACT, *active);
        ahci_pwrite(ap, AHCI_PREG_CI, *active);
    }
}

After booting up the machine, the filesystem on softraid0 is marked as unclean:

/bsd: WARNING: R/W mount of /test denied.  Filesystem is not clean - run fsck

The softraid0 volume is naturally still marked as "DEGRADED":

$ doas bioctl softraid0
Volume      Status               Size Device
softraid0 0 Degraded     500104946176 sd3     RAID1
          0 Offline                 0 0:0.0   noencl <sd1a>
          1 Online       500107558912 0:1.0   noencl <sd2a>

I then rebuild the RAID:

$ doas bioctl -R /dev/sd1a sd3
softraid0: sd1a partition too large, wasting 2612736 bytes
softraid0: rebuild of sd3 started on sd1a

After some time:

$ doas bioctl softraid0
Volume      Status               Size Device
softraid0 0 Rebuild      500104946176 sd3     RAID1 0% done
          0 Rebuild      500107558912 0:0.0   noencl <sd1a>
          1 Online       500107558912 0:1.0   noencl <sd2a>

But then suddenly I see this:

$ doas bioctl softraid0
Volume      Status               Size Device
softraid0 0 Offline      500104946176 sd3     RAID1
          0 Rebuild      500107558912 0:0.0   noencl <sd1a>
          1 Offline      500107558912 0:1.0   noencl <sd2a>

And in dmesg:

/bsd: sd3: retrying read on block 7147776
/bsd: softraid0: could not create read io

Trying to mount the softraid gives:

$ doas mount /test/
mount_ffs: 8c99f49a5299b85a.c on /test: Input/output error

Let's try and bring the RAID up again:

$ doas bioctl -c 1 -l sd1a,sd2a softraid0
softraid0: trying to bring up sd3 degraded
softraid0: sd3 is offline, will not be brought online

Ok, we need the force option:

$ doas bioctl -C force -c 1 -l sd1a,sd2a softraid0
softraid0: RAID 1 volume attached as sd3

$ doas bioctl sd3
Volume      Status               Size Device
softraid0 0 Online       500107558912 sd3     RAID1
          0 Online       500107558912 0:0.0   noencl <sd1a>
          1 Online       500107558912 0:1.0   noencl <sd2a>

Trying to mount:

$ doas mount /test/
mount_ffs: 8c99f49a5299b85a.c on /test: filesystem must be mounted read-only; you may need to run fsck

Homer Simpson Doh!

Doh!

I forgot I had disabled automatic filesystem clean up, I needed to run fsck. And there I was, trying to bring up everything by force. Sigh!

Let's run fsck:

$ doas fsck /dev/sd3c
** /dev/rsd3c
** Last Mounted on /test
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=5
SALVAGE? [Fyn?] F
SUMMARY INFORMATION BAD
SALVAGE? yes

BLK(S) MISSING IN BIT MAPS
SALVAGE? yes

5 files, 1846865 used, 234673808 free (16 frags, 29334224 blocks, 0.0% fragmentation)

MARK FILE SYSTEM CLEAN? yes

***** FILE SYSTEM WAS MODIFIED *****

Now let's mount the filesystem and take a look at the SQLite database:

sqlite> PRAGMA integrity_check;
┌─────────────────┐
│ integrity_check │
├─────────────────┤
│ ok              │
└─────────────────┘

Looking at the data in the database, all is looking fine. No data integrity issues.

Notes on OpenBSD FFS2 softraid

It makes no sense to continue testing anything on softraid. OpenBSD softraid is "ok", it's kindda like mdadm for Linux, except I do not remember having to do as many reboots with mdadm (it has been years though).

In all the tests I did, I always managed to get the RAID rebuild and up and running again, it just takes more work than e.g. ZFS or Btrfs.

I decided to not test more on softraid and instead continue testing on FFS2 with a single drive.

FFS2 SQLite

I have mounted the disk and created the database. Let's test.

The dmesg output reveals that the disk has gone dead when I unplugged the power cord:

/bsd: sd1 detached
/bsd: ahci0: stopping the port, softreset slot 31 was still active.
/bsd: ahci0: failed to reset port during timeout handling, disabling it

We also get an error message from PHP:

PHP Warning:  SQLite3Stmt::execute(): Unable to execute statement: database disk image is malformed in php-insert-into-sqlite.php on line 145

Again ACHI causes the machine to hang and I have to manually reset it. This time even the main drive gets an unclean shutdown because I manually resets the machine. OpenBSD runs an automatic fsck and the drive is clean. After it's up and running we have an unclean filesystem on the second disk:

/bsd: WARNING: R/W mount of /test denied.  Filesystem is not clean - run fsck

So let's do that:

$ doas fsck /dev/sd1a
** /dev/rsd1a
** Last Mounted on /test
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=4
SALVAGE? [Fyn?] F
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? yes

SUMMARY INFORMATION BAD
SALVAGE? yes

BLK(S) MISSING IN BIT MAPS
SALVAGE? yes

5 files, 4362 used, 120170163 free (11 frags, 15021269 blocks, 0.0% fragmentation)

MARK FILE SYSTEM CLEAN? yes

***** FILE SYSTEM WAS MODIFIED *****

The disk has been mounted and let's take a look at the database:

sqlite> PRAGMA integrity_check;
┌─────────────────┐
│ integrity_check │
├─────────────────┤
│ ok              │
└─────────────────┘

No data integrity errors detected with the database at any point during these tests.

FFS2 PHP fopen

Same procedure again.

$ php-8.3 php-fopen.php
PHP Warning:  fopen(/test/php-files/file-89.txt): Failed to open stream: Input/output error in php-fopen.php on line 13
PHP Notice:  WRITE ERROR: Could not write to file: /test/php-files/file-89.txt Loop count: 89
 in php-fopen.php on line 14
PHP Fatal error:  Uncaught TypeError: flock(): Argument #1 ($stream) must be of type resource, false given in php-fopen.php:16
Stack trace:
#0 php-fopen.php(16): flock()
#1 php-fopen.php(147): store_data()
#2 {main}
  thrown in php-fopen.php on line 16

And in dmesg:

/bsd: ahci0: stopping the port, softreset slot 31 was still active.
/bsd: ahci0: failed to reset port during timeout handling, disabling it

Again, unfortunately a hard reset of the machine is required as ACHI keeps hanging.

$ doas fsck /dev/sd1a
** /dev/rsd1a
** Last Mounted on /test
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? [Fyn?] F
SUMMARY INFORMATION BAD
SALVAGE? yes

BLK(S) MISSING IN BIT MAPS
SALVAGE? yes

92 files, 4187 used, 120170338 free (98 frags, 15021280 blocks, 0.0% fragmentation)

MARK FILE SYSTEM CLEAN? yes

***** FILE SYSTEM WAS MODIFIED *****

Checking the files shows that not only is file 89 non-existent, but file 88 got written but it's empty:

$ file file-88.txt
file-88.txt: empty

This is the result of FFS2 not having a journal. Empty files or file corruption will occur almost every time with ordinary files even when you use fsync carefully. So that is something to be on the lookout for if you're using FFS2 and you have an issue.

FFS2 Postgresql

I have installed PostgreSQL and created the database. Let's test.

$ php-8.3 php-insert-into-postgresql.php

Cutting power at loop 48, dmesg says:

/bsd: ahci0: stopping the port, softreset slot 31 was still active.
/bsd: ahci0: failed to reset port during timeout handling, disabling it

As before I have to manually hard reset the machine.

This prevents PostgreSQL from the panic. However, after having rebooted the machine and cleaned the disk with fsck, PostgreSQL is up and running again and the database has not been corrupted.

I did not manage to create an incident in which rows were missing data.

FFS2 MariaDB

Same as before.

$ php-8.3 php-insert-into-mariadb.php

Again the processes hangs, the machine needs to be rebooted and eventually hard reset.

/bsd: ahci0: stopping the port, softreset slot 31 was still active.
/bsd: ahci0: failed to reset port during timeout handling, disabling it

As with PostgreSQL, MariaDB did not panic. However, after having rebooted the machine and cleaned the disk with fsck, MariaDB is up and running again and the database has not been corrupted.

I did not manage to create an incident in which rows were missing data.

FFS2 Conclusions

OpenBSD's filesystem FFS2 is an improved version of the FFS filesystem shipped with BSD 4.4.

FFS2 is faster than its predecessor FFS when creating the filesystem, as well as analyzing it with fsck. FFS2 uses 64-bit timestamps and block numbers; so it is not subject to the Y2038 bug. FFS2 supports very large partitions (>= 1TB, since 4.2). FFS is designed to be fast, reliable, and able to handle the most common situations effectively while still supporting weird configurations. By default, OpenBSD tunes FFS for general use, but you can optimize it to fit your needs—whether you need to hold trillions of tiny files or a half dozen 30GB files. You don't need to know much about FFS internals, but you should at least understand blocks, fragments, and inodes.

It is worth also noting that fsck can take a long time to repair big disks.

Now, while it can look like FFS2 is "bad" compared to e.g. ext4 and XFS from Linux, it's important to note that FFS2 does not have a filesystem journal. Despite that, it is very robust and as I have mentioned, I have experienced a few power outages over the years while running different OpenBSD machines without ever loosing any data or having OpenBSD not being able to boot.

If you use any kind of application that writes data directly to disk on FFS2, you need to remember to fsync, and if you do have a power outage, you need to scan for empty files before you bring your application back up. You can also add some "hand made" custom checksumming to your application and try to use that to increase the protection, but as soon as you begin doing that, you're in the process of implementing parts of a database - without the search capabilities of course - and you might consider just adding that then.

I did not experience any missing data in any of the database tests I did on FFS2. Data integrity on FFS2 does indeed improve with one of these databases.

UFS tests

To configure the GEOM journaling on the UFS file system, one should first create a gjournal provider using the gjournal utility, then run newfs or tunefs on it with the -J flag which instructs UFS to cooperate with the gjournal provider below.

There are important differences in how journaled UFS works. The most important one is that sync and fsync system calls do not work as expected anymore. To ensure that data is stored on the data provider, the gjournal sync command should be used after calling sync. For the best performance possible, soft-updates should be disabled when gjournal is used. It is also safe and recommended to use the async mount option.

I began the tests on UFS with the GEOM journal enabled.

UFS SQLite and PHP fopen with gjournal

First I test by just pulling the cable to the single drive.

On all machines we get a lot of GEOM journal errors:

kernel: GEOM_JOURNAL: [copy] Error while writing data (error=6) ada1[WRITE(offset=374969303040, length=86016)]

The SQLite writing script fails with:

Warning: SQLite3Stmt::execute(): Unable to execute statement: database or disk is full in php-insert-into-sqlite.php on line 385

And PHP fopen:

Notice: WRITE ERROR: Could not write to file: /test/php-files/file-520.txt Loop count: 520
 in php-fopen.php on line 14

Fatal error: Uncaught TypeError: flock(): Argument #1 ($stream) must be of type resource, false given in php-fopen.php:16
Stack trace:
#0 php-fopen.php(16): flock(false, 2)
#1 php-fopen.php(387): store_data('/test/php-files...', '\n1\n\nLorem ipsum...', 520)
#2 {main}
  thrown in php-fopen.php on line 16

I reattach the power cords and try to clean UFS.

First I try to unmount. On one of the machines I do not get any errors, but on the other machine I get the following:

kernel: GEOM_JOURNAL: [flush] Error while writing data (error=6) ada1[WRITE(offset=499928533504, length=16896)]
kernel: GEOM_JOURNAL: Flush cache of ada1: error=6.
kernel: GEOM_JOURNAL: Cannot update metadata (error=6). ada1[WRITE(offset=500107861504, length=512)]
kernel: GEOM_JOURNAL: Flush cache of ada1: error=6.
kernel: GEOM_JOURNAL: [copy] Error while writing data (error=6) ada1[WRITE(offset=65536, length=4096)]
kernel: GEOM_JOURNAL: [copy] Error while writing data (error=6) ada1[WRITE(offset=20676608, length=12288)]
kernel: GEOM_JOURNAL: Flush cache of ada1: error=6.
kernel: GEOM_JOURNAL: Cannot update metadata (error=6). ada1[WRITE(offset=500107861504, length=512)]
kernel: GEOM_JOURNAL: Flush cache of ada1: error=6.

Let's see if we can clean this up:

$ doas fsck_ufs /dev/ada1.journal
Attempted recovery for standard superblock: failed
Attempted extraction of recovery data from standard superblock: failed
Attempt to find boot zone recovery data.
Finding an alternate superblock failed.
Check for only non-critical errors in standard superblock
Failed, superblock has critical errors
SEARCH FOR ALTERNATE SUPER-BLOCK FAILED. YOU MUST USE THE
-b OPTION TO FSCK TO SPECIFY THE LOCATION OF AN ALTERNATE
SUPER-BLOCK TO SUPPLY NEEDED INFORMATION; SEE fsck_ffs(8).

No. A reboot is required.

That fixed it. During boot we get:

kernel: GEOM_JOURNAL: Journal 1000171608: ada1 contains data.
kernel: GEOM_JOURNAL: Journal 1000171608: ada1 contains journal.
kernel: GEOM_JOURNAL: Journal ada1 consistent.

I set up fstab to have the disks automatically cleaned upon boot this time. The disks have been mounted and are up and running again.

On both machines we get no errors from SQLite:

sqlite> PRAGMA integrity_check;
┌─────────────────┐
│ integrity_check │
├─────────────────┤
│ ok              │
└─────────────────┘

sqlite> SELECT COUNT(*) FROM foo;
┌──────────┐
│ COUNT(*) │
├──────────┤
│ 548      │
└──────────┘

And looking through the data in the database shows that all is well.

What about PHP fopen? Despite having the GEOM journal I find one empty file on one of the machines, file "file-114.txt" was the last file to be opened just before I disconnected the power cord to the drive:

$ find . -type f -size 0
./file-114.txt

Notes on UFS with the GEOM journal

UFS with the GEOM journal enabled did not improve the results I had compared to FFS2 and it was worse with PHP fopen than both ext4 and XFS.

SQLite, PostgreSQL and MariaDB did not get any empty rows or rows missing data and it seems like one of these databases indeed improve data integrity on UFS with the GEOM journal enabled.

UFS with soft updates on the other hand was a complete disaster, which is why I skip ahead to those tests now.

I rebooted the machines in order to destroy and disable the GEOM journal and I then formatted the disks cleanly with UFS with soft updates enabled.

The procedure is described in the FreeBSD Manual.

UFS SQLite and PHP fopen with soft updates

I am removing the power cord from the harddrive and I get the following:

Warning: fopen(/test/php-files/file-125.txt): Failed to open stream: No such file or directory in php-fopen.php on line 13

Notice: WRITE ERROR: Could not write to file: /test/php-files/file-125.txt Loop count: 125
 in php-fopen.php on line 14

Fatal error: Uncaught TypeError: flock(): Argument #1 ($stream) must be of type resource, false given in php-fopen.php:16
Stack trace:
#0 php-fopen.php(16): flock(false, 2)
#1 php-fopen.php(387): store_data('/test/php-files...', '\n1\n\nLorem ipsum...', 125)
#2 {main}
  thrown in php-fopen.php on line 16

And in dmesg:

ada1:  s/n 5VE2G9RJ detached
g_vfs_done(): ada1 converting all errors to ENXIO
g_vfs_done():ada1[WRITE(offset=6001721344, length=1048576)]error = 6 supressing further ENXIO
UFS: forcibly unmounting /dev/ada1 from /test
(ada1:ahcich1:0:0:0): Periph destroyed
vnode_pager_putpages: I/O error 6
0xfffff8016b242540: type VREG state VSTATE_CONSTRUCTED op 0xffffffff8172e8c0
    usecount 3, writecount 2, refcount 3 seqc users 0
    hold count flags ()
    flags (VIRF_PGREAD|VMP_LAZYLIST)
    v_object 0xfffff8016b24db58 ref 2 pages 8 cleanbuf 1 dirtybuf 0
    lock type ufs: EXCL by thread 0xfffffe000f709720 (pid 0, kernel, tid 100017)
        nlink=1, effnlink=1, size=32768, extsize 0
        generation=24bb4f65, uid=1000, gid=0, flags=0x0
        ino 9, on dev ada1

UFS is forcibly unmounting the disk, which is very good.

I reboot and get a small warning during boot:

kernel: WARNING: /test was not properly dismounted

This time there are no empty files using PHP fopen(), but one thing I do notice is that despite the fact that the loop counter runs high both during writing and updating not many files actually reaches the physical disk. E.g. I get the error WRITE ERROR: Could not write to file: /test/php-files/file-88.txt Loop count: 88, but still only 24 files was synced to disk. During an update the counter reached 85, but not even a single file was written!

But what is perhaps even more concerning is that I am getting a lot of problems with SQLite.

On one machine, there are no integrity issues detected:

sqlite> PRAGMA integrity_check;
┌─────────────────┐
│ integrity_check │
├─────────────────┤
│ ok              │
└─────────────────┘

But like with PHP fopen that didn't write any data to disk, the entire database is empty.

sqlite> SELECT COUNT(*) FROM foo;
0

On another machine it is much worse. I am getting integrity errors:

sqlite> PRAGMA integrity_check;
Parse error: database disk image is malformed (11)

Looking through the database doesn't work of course:

sqlite> SELECT COUNT(*) FROM foo;
Parse error: database disk image is malformed (11)

I did a lot of tests and every time I got data integrity problems with SQLite.

Several times, after the database had been recovered, parts of data was missing.

$ sqlite3 foo.sqlite3 ".recover" | sqlite3 foo-new.sqlite3

I recovered, but when we take a look at the data, rows a missing text:

sqlite> SELECT COUNT(*) FROM foo;
┌──────────┐
│ COUNT(*) │
├──────────┤
│ 182      │
└──────────┘

sqlite> SELECT * FROM foo WHERE foo_id = 182;
┌────────┬──────────────────────────────────────────────────────────────┐
│ foo_id │                           foo_text                           │
├────────┼──────────────────────────────────────────────────────────────┤
│ 182    │                                                              │
│        │ Chapter 1                                                    │
│        │                                                              │
│        │ Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cra │
│        │ s nec nunc accumsan, consectetur nulla volutpat, pellentesqu │
│        │ e mauris. Nam blandit convallis tortor, ac congue lacus luct │
│        │ us nec. Duis quis mauris ultrices, tempus mauris non, congue │
│        │  velit. Nullam laoreet eu mi rhoncus rhoncus. Fusce consequa │
│        │ t nulla nunc, ac interdum orci dapibus vel. Quisque velit ve │
│        │ lit, feugiat vitae dictum vitae, sodales non risus. Ut conva │
│        │ llis ex sem, vel pellentesque tortor dictum in. Proin sed ul │
│        │ tricies metus. P                                             │
└────────┴──────────────────────────────────────────────────────────────┘

A lot more text should have been there.

Row 177 to 182 only contains a fraction of the text it should hold. Row 177 contains a little more, then row 178 a little less, and so on.

Every time I ran this test on UFS with soft updates I got data integrity errors and either no data or only parts of data in SQLite.

UFS PostgreSQL with soft updates

Let's see how PostgreSQL fares on UFS with soft updates.

I disconnect the power to the drive. PHP complains:

Fatal error: Uncaught PDOException: SQLSTATE[58030]: Io error: 7 ERROR:  could not extend file "base/16388/16392": Input/output error
HINT:  Check free disk space. in php-insert-into-postgresql.php:390
Stack trace:
#0 php-insert-into-postgresql.php(390): PDOStatement->execute()
#1 {main}
  thrown in php-insert-into-postgresql.php on line 390

On another machine I get:

Fatal error: Uncaught PDOException: SQLSTATE[HY000]: General error: 7 PANIC:  could not write to log file 000000010000000000000001 at offset 11198464, length 16384: Device not configured
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request. in php-insert-into-postgresql.php:390
Stack trace:
#0 php-insert-into-postgresql.php(390): PDOStatement->execute()
#1 {main}
  thrown in php-insert-into-postgresql.php on line 390

We also get an error in the kernel log which shows that PostgreSQL panics as it is coded to do:

kernel: ada1 at ahcich1 bus 0 scbus3 target 0 lun 0
kernel: ada1:  s/n 5VE2G9RJ detached
kernel: g_vfs_done(): ada1 converting all errors to ENXIO
kernel: g_vfs_done():ada1[WRITE(offset=5359828992, length=1048576)]error = 6 supressing further ENXIO
kernel: UFS: forcibly unmounting /dev/ada1 from /test
kernel: (ada1:ahcich1:0:0:0): Periph destroyed
kernel: Failed to write core file for process postgres (error 6)
kernel: pid 1169 (postgres), jid 0, uid 770: exited on signal 6 (no core dump - other error)
postgres[1169]: [7-1] 2024-01-17 03:32:28.358 CET [1169] PANIC:  could not write to log file 000000010000000000000001 at offset 11198464, length 16384: Device not configured
kernel: Failed to write core file for process postgres (error 6)
kernel: pid 1170 (postgres), jid 0, uid 770: exited on signal 6 (no core dump - other error)

Everything was rebooted and filesystems cleaned:

WARNING: / was not properly dismounted
WARNING: /test was not properly dismounted
WARNING: /test: mount pending error: blocks 0 files 3
...
fsck[1091]: /dev/ada1: INCORRECT BLOCK COUNT I=36297987 (128 should be 0) (CORRECTED)
fsck[1091]: /dev/ada1: UNREF FILE I=36297986  OWNER=mysql MODE=100660
fsck[1091]: /dev/ada1: SIZE=965 MTIME=Jan 17 03:50 2024
fsck[1091]:  (CLEARED)
fsck[1091]: /dev/ada1: UNREF FILE I=36297987  OWNER=mysql MODE=100660
fsck[1091]: /dev/ada1: SIZE=0 MTIME=Jan 17 03:50 2024
fsck[1091]:  (CLEARED)
fsck[1091]: /dev/ada1: Reclaimed: 0 directories, 3 files, 424416 fragments

PostgreSQL did not have any of the problems that SQLite had. In every single test PostgreSQL had no data integrity issues and no rows was missing any data.

UFS MariaDB with soft updates

What about MariaDB then?

Again, we get an error:

Fatal error: Uncaught PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away in php-insert-into-mariadb.php:251
Stack trace:
#0 php-insert-into-mariadb.php(251): PDOStatement->execute()
#1 {main}
  thrown in php-insert-into-mariadb.php on line 251

Likewise we get an error in the kernel log:

kernel: ada1 at ahcich4 bus 0 scbus5 target 0 lun 0
kernel: ada1:  s/n 110207PBN403171UA5TE detached
kernel: g_vfs_done():ada1[READ(offset=312890195968, length=8192)]error = 6 supressing further ENXIO
kernel: g_vfs_done(): ada1 converting all errors to ENXIO
kernel: UFS: forcibly unmounting /dev/ada1 from /test
kernel: (ada1:ahcich4:0:0:0): Periph destroyed
kernel: pid 1477 (mariadbd), jid 0, uid 88: exited on signal 6 (no core dump - other error)

MariaDB crashed every single time, as it is supposed to do (missed that in the log above in my notes, sorry).

$ doas service mysql-server status
mysql is not running.

In the MariaDB specific log (mysql.err) we have the following:

[Note] Starting MariaDB 10.6.16-MariaDB source revision b83c379420a8846ae4b28768d3c81fa354cca056 as process 1301
[Note] InnoDB: Compressed tables use zlib 1.3
[Note] InnoDB: Number of pools: 1
[Note] InnoDB: Using generic crc32 instructions
[Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728
[Note] InnoDB: Completed initialization of buffer pool
[Note] InnoDB: Starting crash recovery from checkpoint LSN=41354,41354
...
[Note] InnoDB: IO Error: 5during write of 16384 bytes, for file ./ibdata1(10), returned 0
...
[ERROR] InnoDB: Tablespace 5 was not found at ./foo/foo.ibd.
[ERROR] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace.
[ERROR] InnoDB: Plugin initialization aborted with error Tablespace not found
[Note] InnoDB: Starting shutdown...
[ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
[Note] Plugin 'FEEDBACK' is disabled.
[ERROR] Unknown/unsupported storage engine: InnoDB
[ERROR] Aborting

This is bad! It is complaining about [ERROR] Unknown/unsupported storage engine: InnoDB.

I then set innodb_force_recovery=1 in my.cnf, then restart the database and then try to run the option --auto-repair:

$ doas mysqlcheck --auto-repair --all-databases
mysql.column_stats                                 OK
mysql.columns_priv                                 OK
mysql.db                                           OK
mysql.event                                        OK
mysql.func                                         OK
mysql.global_priv                                  OK
mysql.gtid_slave_pos                               OK
mysql.help_category                                OK
mysql.help_keyword                                 OK
mysql.help_relation                                OK
mysql.help_topic                                   OK
mysql.index_stats                                  OK
mysql.innodb_index_stats                           OK
mysql.innodb_table_stats                           OK
mysql.plugin                                       OK
mysql.proc                                         OK
mysql.procs_priv                                   OK
mysql.proxies_priv                                 OK
mysql.roles_mapping                                OK
mysql.servers                                      OK
mysql.table_stats                                  OK
mysql.tables_priv                                  OK
mysql.time_zone                                    OK
mysql.time_zone_leap_second                        OK
mysql.time_zone_name                               OK
mysql.time_zone_transition                         OK
mysql.time_zone_transition_type                    OK
mysql.transaction_registry                         OK
sys.sys_config                                     OK

This looks fine, but the database table is now completely gone:

[foo]> SELECT COUNT(*) FROM foo;
ERROR 1146 (42S02): Table 'foo.foo' doesn't exist

This happened with MariaDB every single time in all the tests on UFS with soft updates on all machines.

UFS Conclusions

Of all the filesystems that I have tested, UFS with soft updates has been the worst. With PHP fopen often no files were written to disk or only a few, but at least no file was ever missing any data.

SQLite and MariaDB did not fare well on UFS with soft updates whereas PostgreSQL did not have a single issue even once.

I might just have been "lucky" with PostgreSQL and could possibly run into errors if I kept testing long enough. However, it was extremely easy getting errors on SQLite and MariaDB didn't work at all, so I doubt it would have made a big difference running extra tests.

NOTE: If you run FreeBSD with UFS I highly recommend that you enable the GEOM journal!

I would like to make a personal note about the GEOM journal. As is mentioned in the documentation, in order to ensure that data is stored on the data provider, the gjournal sync command should be used after calling sync. In my humble opinion, this is not the correct implementation. This is a non-standard functionality very specific to GEOM. At least it is documented, but rather, the FreeBSD operating system syscall to sync should check to see if UFS is being used with a GEOM journal and if so, then the OS sync should call the gjournal sync command rather than the application in the layer above. But that's just me.

Final comments

Considering that FFS2 doesn't have a journal or any kind of checksumming, it is impressively robust! The worst problem I encountered was a single empty file in some of the writes with PHP fopen. Of course, in the event of a disaster on FFS2, you need to scan for such issues before you re-enable your application. All of the databases fared very well on FFS2 and they all managed to clean everything up nicely in every single test and I never encountered a problem with missing data. As I wrote in my FFS2 conclusion, using a database rather than writing data directly to FFS2 does indeed improve the situation.

It is important to remember though, that running fsck on a very large datasets is time consuming, especially if it's done on spinning drives, and I would not recommend this for anything with a lot of data. Also, FFS2 seemed pretty slow compared to ZFS, ext4 and XFS. I did not perform any performance tests, so this is just "how it felt" during the work I did.

I personally use OpenBSD with FFS2 in production and it is an amazing operating system. As I have mentioned, I have experienced power loss several times, but have never had any problems at all with FFS2. Still, I would not use FFS2 on a machine with the risk of loosing power if I had data storing applications running, I would rather use ZFS (which is also what I do).

On ext4, XFS or ZFS I would not hesitate to store files directly in the filesystem without using a database. Not only is it rare to loose power, at least if you run your applications on dedicated servers located in a data center, but those filesystems recover amazingly well without loosing any data or missing any files.

Now, don't misunderstand me, I am not saying that you cannot loose data, you can ALWAYS loose data, but the risk of loosing data or getting corruptions are very small and I personally do not see any reason to use a database in order to improve that on those filesystems - just make sure you have a solid backup and that you run data recovery drills (you should always do that - no matter what).

Make sure you run your own tests, perhaps I have been exceptionally "lucky" and just didn't manage to create the right conditions during the tests I did.

In any case, when you leave out the database, you remove a lot of complexity and what you're left with is "just" the filesystem, which is something you always have to deal with anyway.

Unless you have a specific reason for using a database, you don't have to. However, you also have to consider the complexity of a parser in case you need to serialize your data in some manner, but dealing with something like e.g. a JSON parser is considerably more simple that a database and it's driver, especially if it's a network based database.

When I pull all the data I collected from these tests and from past experience, ZFS is by far the most resilient filesystem.

UFS with soft updates was the worst. It reminds me of the days when ext2 was the standard in Linux.

From my experience, and from looking at the source code, I believe there is a difference in the quality of PostgreSQL and SQLite compared to MariaDB. PostgreSQL and SQLite are both really good at protecting data. When and if a data integrity issue happens, such as a machine physically looses power, you're going to need as much data as can possibly be recovered (especially if some time has passed since the last backup). No days-old backup, or hours-old backup, or even minutes-old backup is going to help you if the data is very important. What you need is as much recoverable data as you can possible get - PostgreSQL and SQLite shines here. PostgreSQL to an even higher degree in my humble opinion. Still, I love all these database projects.

I have used MariaDB (and MySQL before that) in production for more than 20 years and I have surprisingly never (not even once) experienced any data loss.

Alright, I think that was it for now. Have a nice one!

Battle testing PHP fopen, SQLite, PostgreSQL and MariaDB on FFS2, UFS, ext4, XFS and ZFS

Table of Contents

Introduction

Myths and misunderstandings

When to use a database rather than the filesystem?

What about product X?

Some source code

Running the tests

ext4 tests

ext4 SQLite

ext4 PHP fopen

ext4 PostgreSQL

ext4 MariaDB

ext4 conclusions

XFS tests

XFS preparations

XFS SQLite

XFS PHP fopen

XFS PostgreSQL and MariaDB

XFS conclusions

ZFS tests

ZFS SQLite

ZFS PHP fopen

ZFS PostgreSQL

ZFS MariaDB

ZFS conclusions

FFS2 tests

FFS2 softraid mirror with SQLite

Notes on OpenBSD FFS2 softraid

FFS2 SQLite

FFS2 PHP fopen

FFS2 Postgresql

FFS2 MariaDB

FFS2 Conclusions

UFS tests

UFS SQLite and PHP fopen with gjournal

Notes on UFS with the GEOM journal

UFS SQLite and PHP fopen with soft updates

UFS PostgreSQL with soft updates

UFS MariaDB with soft updates

UFS Conclusions

Final comments