2012年3月29日星期四

Fixing ext3/jbd barrier problem under linux 2.6.26

I recently found that the journal code ext3 uses (jbd) has some correctness problems when mounted with barrier=1. More specifically, there are two problems:

1. It does not flush disk when deleting some transactions from journal. This means you could lose data if disk chose to reorder some data block writes after the deletion happens, and you have a crash in the time window between the deletion and the data writes hit disk.  With a proper sized journal, the possibility of this happening is quite small, but nevertheless, this is a correctness issue.

2. If you put the journal in an external device (like I do), ext3/jbd only flushed the journal disk, but never flushes the data disk. Which means if you are using ordered journal mode (you probably are, since this is the default), you are risking consistency of the file system, as ext3 doesn't not really enforce ordering of the data blocks, and disk may choose to cache those writes after metadata has been written!

I found the problem when looking at code of Linux 2.6.26.  In the latest version of Linux (Linux 3.3), problem 1 is fixed but problem 2 is still there.

In general I would encourage you to use ext4/jbd2 as it got both problem fixed and has some other nice features. But if you somehow need to stick to ext3 and and older Linux version, here is a patch to fix the problem:
(You need to fix the another problem in the in the buffer layer too, as discussed in this thread http://kerneltrap.org/mailarchive/linux-kernel/2008/8/21/3022914, to get this really working). I have included this fix in the patch too.


--- linux-2.6.26/fs/jbd/checkpoint.c    2012-03-29 21:38:31.000000000 -0500
+++ old_linux/linux-2.6.26/fs/jbd/checkpoint.c  2008-07-13 16:51:29.000000000 -0500
@@ -22,7 +22,6 @@
 #include <linux/jbd.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
-#include <linux/blkdev.h>
 
 /*
  * Unlink a buffer from a transaction checkpoint list.
@@ -454,10 +453,6 @@
        journal->j_tail_sequence = first_tid;
        journal->j_tail = blocknr;
        spin_unlock(&journal->j_state_lock);
-
-       if (journal->j_flags & JFS_BARRIER)
-               blkdev_issue_flush(journal->j_fs_dev, NULL);
-
        if (!(journal->j_flags & JFS_ABORT))
                journal_update_superblock(journal, 1);
        return 0;
--- linux-2.6.26/fs/jbd/commit.c        2012-03-29 21:58:32.000000000 -0500
+++ old_linux/linux-2.6.26/fs/jbd/commit.c      2008-07-13 16:51:29.000000000 -0500
@@ -20,7 +20,6 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
-#include <linux/blkdev.h>
 
 /*
  * Default IO end handler for temporary BJ_IO buffer_heads.
@@ -733,11 +732,6 @@
 
        jbd_debug(3, "JBD: commit phase 6\n");
 
-       /*flush the data device before write commit record */
-       if((journal->j_fs_dev != journal->j_dev) &&
-                       (journal->j_flags & JFS_BARRIER))
-               blkdev_issue_flush(journal->j_fs_dev, NULL);
-
        if (journal_write_commit_record(journal, commit_transaction))
                err = -EIO;
 
--- linux-2.6.26/fs/buffer.c    2012-03-29 22:45:29.000000000 -0500
+++ old_linux/linux-2.6.26/fs/buffer.c  2008-07-13 16:51:29.000000000 -0500
@@ -2868,14 +2868,14 @@
        BUG_ON(!buffer_mapped(bh));
        BUG_ON(!bh->b_end_io);
 
-       if (buffer_ordered(bh) && (rw & WRITE))
-               rw |= WRITE_BARRIER;
+       if (buffer_ordered(bh) && (rw == WRITE))
+               rw = WRITE_BARRIER;
 
        /*
         * Only clear out a write error when rewriting, should this
         * include WRITE_SYNC as well?
         */
-       if (test_set_buffer_req(bh) && (rw & WRITE))
+       if (test_set_buffer_req(bh) && (rw == WRITE || rw == WRITE_BARRIER))
                clear_buffer_write_io_error(bh);
 
        /*

没有评论:

发表评论