2011年9月1日星期四

Do NOT block make_request function!

Today I encountered a bug in my code, which I think is interesting enough and worth documenting.

So I have two block device drivers, journal_bd and checkpoint_bd, which communicate with each other. Basically, when journal_bd serves I/O request in its journal_bd_make_request function, it will look at the type of bio it's serving; and for a special kind of bio, it will postpone the service for that bio, issue some I/O requests (actually, writes) to checkpoint_bd, wait for those writes to be completed by checkpoint_bd, and finally serve the original bio.

This seems like a straightforward functionality. Thus I had the following implementation:

journal_bd_make_request(bio * bio){
//....
if(bio is special bio){
checkpoint(); //this is a function checkpoint_bd module exports
serve_bio();
}
//...
return;
}

In checkpoint(), which is implemented inside checkpoint_bd module, we issue writes to disk by calling generic_make_request() and wait for them to complete:

checkpoint(){
for_each_write_bio_we_want_complete{
generic_make_request(write_bio);
}
wait_on_semaphore();
return;
}

checkpoint() will block until all the writes hit disk. At that point, we will raise the semaphore in an I/O completion function checkpoint_end_io(), which will enable checkpoint() to return, and journal_bd_make_request() to proceed and serve the special bio.

checkpoint_end_io(){
//check if we are done with all the writes
//if we are, raise the semaphore
}

The above scheme seems nice, except for that it didn't work...

When I tried the above implementation, I noticed that we got stuck waiting on the semaphore, which is never getting raised.

At first I thought that this a simple synchronization bug: I probably didn't initialize the semaphore correctly, or ignored some condition where I should have raised the semaphore but didn't. All of these have happened before. However, after more inspection I realized that checkpoint_end_io(), our I/O completion function, was never called! That's why it doesn't have the chance to raise the semaphore. Things are getting interesting....

After making sure that I did assign the right value to bio->bi_end_io field, I wrote a pseudo device driver which intercepts all the bios and log them. Astonished, I found that those write_bios we submitted to disk using generic_make_request() in checkpoint() were never received by the disk!

OK...So apparently kernel is silently discarding our bio requests. Maybe I didn't set the bio fields right? But then kernel should at least complain about it...And using a debugger, I confirmed that all the bio requests I submitted are legitimated ones.

So is it because of some permission problem? Anyway, checkpoint() is called by another module. Maybe kernel doesn't allow that? Or maybe module makes some implicit assumption about its execution context, which we are not ensuring when calling checkpoint()?

Ishani came at this point and suggested to check all the global variables in checkpoint_bd module are being read as its correct value when we are executing checkpoint() on behalf of journal_bd. We checked that, and it didn't offer us too much information.

At last we decided to actually step through the bio submit process to understand where all the bios we submitted actually went. We used gdb to step into the checkpoint() function, which calls into generic_make_request(write_bio). However, we found that instead of submitting the bio, kernel chained the bio in a list and immediately returns. (Line 1417 - 1422 in blk-core.c, if you are using Linux 2.6.26). The comments for this function explain it fairly clearly, and I am going to copy them here:

/*
1405  * We only want one ->make_request_fn to be active at a time, 1406  * else stack usage with stacked devices could be a problem. 1407  * So use current->bio_{list,tail} to keep a list of requests 1408  * submited by a make_request_fn function. 1409  * current->bio_tail is also used as a flag to say if 1410  * generic_make_request is currently active in this task or not. 1411  * If it is NULL, then no make_request is active.  If it is non-NULL, 1412  * then a make_request is active, and new requests should be added 1413  * at the tail 1414  */
So what happens is that kernel only allows one make_request function to be active at a time for a particular kernel thread. Since we are already inside the journal_bd_make_request function, i.e., it is active, all the bio's we submitted will be kept in a list instead of being handed to the block device. And since we then sleep inside the journal_bd_make_request function, which makes it always active, so the chained bio cannot get a chance to be served. This why those bios are not served and why we never see the I/O completion function to raise the semaphore.

Once we understand the problem, the fix is straightforward: just make journal_bd_make_request immediately return instead of blocking on the semaphore. Instead, we serve the special bio inside checkpoint_bd, once we are completed all the write_bios. And this fix works perfectly.

So moral of story: do not ever block your make_request function; this will kill the chance of serving the other bio's you submitted.


没有评论:

发表评论