2012年3月29日星期四

Header file for ioprio

Linux kernel offers the ability to set the priority of the I/O requests issued by a process. The corresponding system call is ioprio_get and ioprio_set.  However, libc has yet to offer an appropriate header file for us to use this functionality in the user space.

Following is the header file I use. It simply wraps the system call and copied a few macro definitions from the kernel source file.

Name the file ioprio.h and copy it into /usr/include, and then start using ioprio :)



#include <unistd.h>

extern int sys_ioprio_set(int, int, int);
extern int sys_ioprio_get(int, int);

#if defined(__i386__)
#define __NR_ioprio_set         289
#define __NR_ioprio_get         290
#elif defined(__ppc__)
#define __NR_ioprio_set         273
#define __NR_ioprio_get         274
#elif defined(__x86_64__)
#define __NR_ioprio_set         251
#define __NR_ioprio_get         252
#elif defined(__ia64__)
#define __NR_ioprio_set         1274
#define __NR_ioprio_get         1275
#else
#error "Unsupported arch"
#endif

static inline int ioprio_set(int which, int who, int ioprio)
{
        return syscall(__NR_ioprio_set, which, who, ioprio);
}


static inline int ioprio_get(int which, int who)
{
        return syscall(__NR_ioprio_get, which, who);
}


enum { 
        IOPRIO_CLASS_NONE, 
        IOPRIO_CLASS_RT, 
        IOPRIO_CLASS_BE, 
        IOPRIO_CLASS_IDLE, 
};

enum {
        IOPRIO_WHO_PROCESS = 1,
        IOPRIO_WHO_PGRP,
        IOPRIO_WHO_USER,
};

#define IOPRIO_BITS             (16)
#define IOPRIO_CLASS_SHIFT      (13)
#define IOPRIO_PRIO_MASK        ((1UL << IOPRIO_CLASS_SHIFT) - 1)

#define IOPRIO_PRIO_CLASS(mask) ((mask) >> IOPRIO_CLASS_SHIFT)
#define IOPRIO_PRIO_DATA(mask)  ((mask) & IOPRIO_PRIO_MASK)
#define IOPRIO_PRIO_VALUE(class, data)  (((class) << IOPRIO_CLASS_SHIFT) | data)

#define ioprio_valid(mask)      (IOPRIO_PRIO_CLASS((mask)) != IOPRIO_CLASS_NONE)

Fixing ext3/jbd barrier problem under linux 2.6.26

I recently found that the journal code ext3 uses (jbd) has some correctness problems when mounted with barrier=1. More specifically, there are two problems:

1. It does not flush disk when deleting some transactions from journal. This means you could lose data if disk chose to reorder some data block writes after the deletion happens, and you have a crash in the time window between the deletion and the data writes hit disk.  With a proper sized journal, the possibility of this happening is quite small, but nevertheless, this is a correctness issue.

2. If you put the journal in an external device (like I do), ext3/jbd only flushed the journal disk, but never flushes the data disk. Which means if you are using ordered journal mode (you probably are, since this is the default), you are risking consistency of the file system, as ext3 doesn't not really enforce ordering of the data blocks, and disk may choose to cache those writes after metadata has been written!

I found the problem when looking at code of Linux 2.6.26.  In the latest version of Linux (Linux 3.3), problem 1 is fixed but problem 2 is still there.

In general I would encourage you to use ext4/jbd2 as it got both problem fixed and has some other nice features. But if you somehow need to stick to ext3 and and older Linux version, here is a patch to fix the problem:
(You need to fix the another problem in the in the buffer layer too, as discussed in this thread http://kerneltrap.org/mailarchive/linux-kernel/2008/8/21/3022914, to get this really working). I have included this fix in the patch too.


--- linux-2.6.26/fs/jbd/checkpoint.c    2012-03-29 21:38:31.000000000 -0500
+++ old_linux/linux-2.6.26/fs/jbd/checkpoint.c  2008-07-13 16:51:29.000000000 -0500
@@ -22,7 +22,6 @@
 #include <linux/jbd.h>
 #include <linux/errno.h>
 #include <linux/slab.h>
-#include <linux/blkdev.h>
 
 /*
  * Unlink a buffer from a transaction checkpoint list.
@@ -454,10 +453,6 @@
        journal->j_tail_sequence = first_tid;
        journal->j_tail = blocknr;
        spin_unlock(&journal->j_state_lock);
-
-       if (journal->j_flags & JFS_BARRIER)
-               blkdev_issue_flush(journal->j_fs_dev, NULL);
-
        if (!(journal->j_flags & JFS_ABORT))
                journal_update_superblock(journal, 1);
        return 0;
--- linux-2.6.26/fs/jbd/commit.c        2012-03-29 21:58:32.000000000 -0500
+++ old_linux/linux-2.6.26/fs/jbd/commit.c      2008-07-13 16:51:29.000000000 -0500
@@ -20,7 +20,6 @@
 #include <linux/slab.h>
 #include <linux/mm.h>
 #include <linux/pagemap.h>
-#include <linux/blkdev.h>
 
 /*
  * Default IO end handler for temporary BJ_IO buffer_heads.
@@ -733,11 +732,6 @@
 
        jbd_debug(3, "JBD: commit phase 6\n");
 
-       /*flush the data device before write commit record */
-       if((journal->j_fs_dev != journal->j_dev) &&
-                       (journal->j_flags & JFS_BARRIER))
-               blkdev_issue_flush(journal->j_fs_dev, NULL);
-
        if (journal_write_commit_record(journal, commit_transaction))
                err = -EIO;
 
--- linux-2.6.26/fs/buffer.c    2012-03-29 22:45:29.000000000 -0500
+++ old_linux/linux-2.6.26/fs/buffer.c  2008-07-13 16:51:29.000000000 -0500
@@ -2868,14 +2868,14 @@
        BUG_ON(!buffer_mapped(bh));
        BUG_ON(!bh->b_end_io);
 
-       if (buffer_ordered(bh) && (rw & WRITE))
-               rw |= WRITE_BARRIER;
+       if (buffer_ordered(bh) && (rw == WRITE))
+               rw = WRITE_BARRIER;
 
        /*
         * Only clear out a write error when rewriting, should this
         * include WRITE_SYNC as well?
         */
-       if (test_set_buffer_req(bh) && (rw & WRITE))
+       if (test_set_buffer_req(bh) && (rw == WRITE || rw == WRITE_BARRIER))
                clear_buffer_write_io_error(bh);
 
        /*