2011年9月25日星期日

How to run customized ext4 under Linux 2.6.26

There are two difficulties:
1. How to build your own ext4 and jbd2 module without re-build the whole kernel.
2. How to enable ext4 support on old linux 2.6.26 kernel

My platform is centOS.

Solution:

0. Make a copy of your linux source code tree. Modify the ext4 and jbd2 source code as you need.

1. Modify the .config file in your linux kernel source code directory. Delete all the xxx=m line. Add the following line:
COFIG_CRC16=m //needed by ext4
CONFIG_EXT4DEV_FS=m
CONFIG_JBD=m
CONFIG_JBD2=m

2. do "make modules" under this directory. This will produce the needed kernel modules.

3. Check /etc/modprobe.conf, /etc/modules.conf, and /etc/modprobe.d/blacklist files, make sure ext4 and jbd2 module are not automatically loaded. This is to prevent the kernel from loading the some other version of these modules from a default place.

4. reboot the kernel. Do insmod on lib/crc_16.ko, fs/jbd2/jbd2.ko and fs/ext4/ext4dev.ko. Theses are the modules you modified. Note that crc_16.ko is needed by ext4dev.

5. yum install e4fsprogs. This will give you necessary utility support for ext4.

6. Choose a disk partition to mkfs: mkfs -t ext4dev /dev/sdc

7. Tune the fs to tag that it's a test filesystem (otherwise kernel will refuse to mount it): tune4fs -E test_fs /dev/sdc

8. mount -t ext4dev /dev/sdc /dir/you/want/to/mount. Note that you may need to mount with some especial flags (e.g. no extent support), since ext4 under 2.6.26 have some bugs.

Done.



2011年9月21日星期三

Efficient Locking for Concurrent Operations on B-Trees

Lehman et al. 2002

Summary:

This paper proposed a new solution to support concurrent access on B-trees which reside on secondary storage. Basically a new link pointer is added to the B-tree data structure to chain all the pages in the same level into a linked list. The intent of doing this is that if an insertion split a node into two, other processes will still see a "logically single node" by following the link pointer, thus the consistency of the tree is not lost.

The the search for the B-tree will proceed as usual, without grabbing any locks. However if the search key exceed the highest value in a node (as indicated by the high key), it indicates that the tree structure has been changed, and we should follow the link pointer to get the split node. Insertion into the B-tree is divided into two phase: the search phase to locate where to inserte, which is identical to normal search; and the actual insertion phase, where we lock a node, split it as necessary, and add appropriate link pointers. An insertion to the parent node is also needed, if split happens, and it worked the same as leaf node. A stack of the rightmost node we examined at each level during the descent through the tree is maintained, to urge the case when we backtrack the tree.

An informal proof of the correctness of this solution is given: it ensures that each process sees a consistent state of the B-tree, and prevents deadlock. Livelock, on the other hand, is possible if a process keep following the link pointer other processes created. This solution is highly concurrent because an insertion process at most hold three locks simultaneously and a search process don't need to grab any lock (not even shared lock).

The whole model they developed is based on the assumption that the whole B-tree sits on disk, and every process has its own private memory cache. However, I don't think this is very pratical. A more common case would be that the first and second level of the B-tree sit in main memory and is accessed by all the process. How to correctly access this kind of B-tree correctly remains a problem.

A Critique of ANSI SQL Isolation Level

From Microsoft, 1995


Summary


This paper discussed the definition of ANSI SQL(92) isolation level, and pointed out its ambiguity and failure to preclude some abnormality. The authors proposed a new set of definitions which could naturally map to the standard locking implementation of each level, and also discussed some additional isolation level, such as snapshot isolation.


The author first introduced the Dirty Read, Non-repeatable Rad and Phantom phenomena; and pointed out due to the ambiguity of their English statement, there could be a broad and a strict interpretation of each of the phenomena. They stated each interpretation in terms of execution history and defined different ANSI SQL Isolation level by how they preclude those phenomena.


The author then defined another set of isolation level in terms of different locking implementation. They discussed what kind of abnormality each of these isolation levels preclude, and noticed that their discrepancy with the ANSI isolation levels. They defined a new kind of phenomena, namely dirty write. A new set of isolation levels, defined by how they preclude those four phenomena (dirty write, dirty read, non-repeatable read and phantom) in the strict interpretation, are proposed. The author then showed that they are equivalent to their corresponding locking isolation levels.


The author discussed two additional kinds of isolation level: cursor stability and snapshot isolation. They clearly defined them by what abnormality they preclude, and showed that how they sit in the isolation level hierarchy.


Finally a summary has been given to show under the better definition, what kind of abnormality each isolation level precludes; and how they form an isolation level hierarchy.

On Optimistic Methods for Concurrency Control

paper here:

http://www.google.com/url?sa=t&source=web&cd=1&ved=0CCMQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.114.3052%26rep%3Drep1%26type%3Dpdf&rct=j&q=On%20optmistic%20method%20for%20concurrency%20control&ei=EPB5Tp6rFYWJsQLW28joAw&usg=AFQjCNGrkzBJkj-S35zwDWNlA5iYM--eWg&sig2=vLC4D24pmZlETmYNr0q4Kw


Summary:

In this paper the authors analyzed the drawbacks of doing concurrency control using traditional lock techniques: you have to pay the locking overhead even when you don't really need it. They then proposed optimistic concurrency control: let transactions proceed as if there's no conflict, and back-up them if a conflict is detected.


The main idea behind this optimistic approach is to divide the transactions into three phases: read, validation and write. In read phase, transactions only read database values and all updates take place in local copy; we also remember what we read and update in this transaction. In validation phase, we use some algorithm to detect whether there is a conflict during the read phase. If there is no conflict, we proceed to write phase and apply the updates to the global database; otherwise we just back-up and do necessary clean-up.


The authors discussed in detail how to validate a transaction. They laid out three conditions for transactions to meet serial equivalence (mainly about the read/write set of each transaction should not intersect with each other), and discussed how to assign each transaction a serial number to facilitate validation. Two families of validations algorithm are then proposed: one require serial write phase of all transactions; and the other allow concurrent write phases.


The author also discussed what kind of workload is suitable for optimistic concurrency control: query-dominate workload and other workload which conflict happen in a low probability. An example is given and analyzed (in paper) to show that optimistic concurrency control is promising in maintaining B-tree index. However, no experimental results are shown.


Comments:

1. This whole optimistic CC thing has a copy-on-write semantics. So note that it could break the original physical layout and extra caution should be taken.

2. The paper didn't quantify at all the overhead of the optimistic approach: maintaining sets, intersection, making local copies (memory copy is expensive!). The authors intentionally did that but you should let them manipulating you....

3. Optimistic Control has not yet been implemented in any commercial system as far as I know, but some techniques developed in this paper, like assigning transaction numbers, are very influential.

2011年9月1日星期四

Do NOT block make_request function!

Today I encountered a bug in my code, which I think is interesting enough and worth documenting.

So I have two block device drivers, journal_bd and checkpoint_bd, which communicate with each other. Basically, when journal_bd serves I/O request in its journal_bd_make_request function, it will look at the type of bio it's serving; and for a special kind of bio, it will postpone the service for that bio, issue some I/O requests (actually, writes) to checkpoint_bd, wait for those writes to be completed by checkpoint_bd, and finally serve the original bio.

This seems like a straightforward functionality. Thus I had the following implementation:

journal_bd_make_request(bio * bio){
//....
if(bio is special bio){
checkpoint(); //this is a function checkpoint_bd module exports
serve_bio();
}
//...
return;
}

In checkpoint(), which is implemented inside checkpoint_bd module, we issue writes to disk by calling generic_make_request() and wait for them to complete:

checkpoint(){
for_each_write_bio_we_want_complete{
generic_make_request(write_bio);
}
wait_on_semaphore();
return;
}

checkpoint() will block until all the writes hit disk. At that point, we will raise the semaphore in an I/O completion function checkpoint_end_io(), which will enable checkpoint() to return, and journal_bd_make_request() to proceed and serve the special bio.

checkpoint_end_io(){
//check if we are done with all the writes
//if we are, raise the semaphore
}

The above scheme seems nice, except for that it didn't work...

When I tried the above implementation, I noticed that we got stuck waiting on the semaphore, which is never getting raised.

At first I thought that this a simple synchronization bug: I probably didn't initialize the semaphore correctly, or ignored some condition where I should have raised the semaphore but didn't. All of these have happened before. However, after more inspection I realized that checkpoint_end_io(), our I/O completion function, was never called! That's why it doesn't have the chance to raise the semaphore. Things are getting interesting....

After making sure that I did assign the right value to bio->bi_end_io field, I wrote a pseudo device driver which intercepts all the bios and log them. Astonished, I found that those write_bios we submitted to disk using generic_make_request() in checkpoint() were never received by the disk!

OK...So apparently kernel is silently discarding our bio requests. Maybe I didn't set the bio fields right? But then kernel should at least complain about it...And using a debugger, I confirmed that all the bio requests I submitted are legitimated ones.

So is it because of some permission problem? Anyway, checkpoint() is called by another module. Maybe kernel doesn't allow that? Or maybe module makes some implicit assumption about its execution context, which we are not ensuring when calling checkpoint()?

Ishani came at this point and suggested to check all the global variables in checkpoint_bd module are being read as its correct value when we are executing checkpoint() on behalf of journal_bd. We checked that, and it didn't offer us too much information.

At last we decided to actually step through the bio submit process to understand where all the bios we submitted actually went. We used gdb to step into the checkpoint() function, which calls into generic_make_request(write_bio). However, we found that instead of submitting the bio, kernel chained the bio in a list and immediately returns. (Line 1417 - 1422 in blk-core.c, if you are using Linux 2.6.26). The comments for this function explain it fairly clearly, and I am going to copy them here:

/*
1405  * We only want one ->make_request_fn to be active at a time, 1406  * else stack usage with stacked devices could be a problem. 1407  * So use current->bio_{list,tail} to keep a list of requests 1408  * submited by a make_request_fn function. 1409  * current->bio_tail is also used as a flag to say if 1410  * generic_make_request is currently active in this task or not. 1411  * If it is NULL, then no make_request is active.  If it is non-NULL, 1412  * then a make_request is active, and new requests should be added 1413  * at the tail 1414  */
So what happens is that kernel only allows one make_request function to be active at a time for a particular kernel thread. Since we are already inside the journal_bd_make_request function, i.e., it is active, all the bio's we submitted will be kept in a list instead of being handed to the block device. And since we then sleep inside the journal_bd_make_request function, which makes it always active, so the chained bio cannot get a chance to be served. This why those bios are not served and why we never see the I/O completion function to raise the semaphore.

Once we understand the problem, the fix is straightforward: just make journal_bd_make_request immediately return instead of blocking on the semaphore. Instead, we serve the special bio inside checkpoint_bd, once we are completed all the write_bios. And this fix works perfectly.

So moral of story: do not ever block your make_request function; this will kill the chance of serving the other bio's you submitted.