2012年12月18日星期二

STATISTICAL TESTS FOR SIGNIFICANCE


Other parts of this site explain how to do the common statistical tests. Here is a guide to choosing the right test for your purposes. When you have found it, click on "more information?" to confirm that the test is suitable. If you know it is suitable, click on "go for it!"
Important: Your data might not be in a suitable form (e.g. percentages, proportions) for the test you need. You can overcome this by using a simple transformation.Always check this - click HERE.
Use this test for comparing the means of two samples (but see test 2 below), even if they have different numbers of replicates. For example, you might want to compare the growth (biomass, etc.) of two populations of bacteria or plants, the yield of a crop with or without fertiliser treatment, the optical density of samples taken from each of two types of solution, etc. This test is used for "measurement data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3 etc. You would need to transform percentages and proportions because these have fixed limits (0-100, or 0-1).
More information?
Go for it!
Use this test like the t-test but in special circumstances - when you can arrange the two sets of replicate data in pairs. For example: (1) in a crop trial, use the "plus" and "minus" nitrogen crops on one farm as a pair, the "plus" and "minus" nitrogen crops on a second farm as a pair, and so on; (2) in a drug trial where a drug treatment is compared with a placebo (no treatment), one pair might be 20-year-old Caucasian males, another pair might be 30-year old Asian females, and so on.
More information?
Go for it!
Use this test if you want to compare several treatments. For example, the growth of one bacterium at different temperatures, the effects of several drugs or antibiotics, the sizes of several types of plant (or animals' teeth, etc.). You can also compare two things simultaneously - for example, the growth of 3 bacteria at different temperatures, and so on. Like the t-test, this test is used for "measurement data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3 etc. You would need to transform percentages and proportions because these have fixed limits (0-100, or 0-1).
More information? You need this, because there are different forms of this test.
Use this test to compare counts (numbers) of things that fall into different categories. For example, the numbers of blue-eyed and brown-eyed people in a class, or the numbers of progeny (AA, Aa, aa) from a genetic crossing experiment. You can also use the test for combinations of factors (e.g. the incidence of blue/brown eyes in people with light/dark hair, or the numbers of oak and birch trees with or without a particular type of toadstool beneath them on different soil types, etc.).
More information?
Go for it!
Use this test for putting confidence limits on the mean of counts of random events, so that different count means can be compared for statistical difference. For example, numbers of bacteria counted in the different squares of a counting chamber (haemocytometer) should follow a random distribution, unless the bacteria attract one another (in which case the numbers in some squares should be abnormally high, and abnormally low in other squares) or repel one another (in which case the counts should be abnormally similar in all squares). Very few things in nature are randomly distributed, but testing the recorded data against the expectation of the Poisson distribution would show this. By using the Poisson distribution you have a powerful test for analysing whether objects/ events are randomly distributed in space and time (or, conversely, whether the objects/ events are clustered).
More information?
Go for it!

2012年12月3日星期一

My personal view of software defined storage

(This is the result of some recently emerging thoughts, which means my view is likely to be changing dramatically pretty soon; or I probably have no idea what I am talking about.)

In order to talk about what is software defined storage, we might first ask ourselves, "what is software-defined; and what is the antonym of software-defined?". Some people think software-defined is basically moving stuff out of big black box appliances (which you buy from, say, NetApp or Epic), and having open software does whatever which was done inside that box (Remzi). Some people emphasize the automatic provisioning and single-point management of storage hardware for VM (VMWare).While my personal understanding of software defined storage is a bit different from the above two views.

In my opinion, software-defined is just what we computer science people have been practicing in decades and have gotten enormous success on: break up problem into pieces, define abstractions to factor out details, and then focus on a few details at a time and deal with them really well. Thus in my opinion, the opposite of software-defined would be "undefined" (instead of, say, hardware-defined).What undefined means is basically doing all things at the same time, yet is unable to do any of them well. 

To make my definition concrete, let me illustrate with an example.Let's say you want to solve a complex physics simulation problem, and you are given a computer with some cpu, memory and disks. You need to write a program which runs on this particular set of hardware. 

How do you approach it? Of course you could manage everything by yourself: in your program logic you control every instruction running on the CPU, every bit you reference memory, how exactly you store your data in each sector of your hard disk,and deal with the fact that CPU or memory or disk could just fail in unexpected ways and you are actually sharing these resources with others. And in the early days of computer, people actually do that. However, if you try to do this now, I'd say any reasonable programmer will think you are absolutely crazy. 

What is the computer science way, or, "software defined" way of approaching the same problem? You just sit back and think hard, "how can I decompose this complex problem so that I can only deal with one thing at a time?" and "for each part of the problem I decomposed, what abstraction should I present so that the internal complexity of this part is hidden from the outside and I don't have to worry about it when I am dealing the rest of this problem"? And by thinking this, you would probably first develop an Operating system which manages CPU and memory, and present the abstraction of process to the outside. You might then go ahead and define some higher level languages, which hide the complexity of dealing with machine code. You probably will get a file system in place too, which present a simple file abstraction, so that you don't have worry about where to put your data in the disk, how to locate it later, and what will happen if the machine just crashed in the middle of writing out your data. 

So what have you done? You basically divided the whole problem in to several parts, presented some simple abstractions, and tackled the sub-problems one by one, whose complexities are hided behind each abstraction. Now you can go ahead and deal with the actual physics simulation, which may very well be a very hard problem. But this is because it is inherently hard, not because it has been made hard due to the fact that you have to think about disk failures. Better yet, this approach enables innovation, because once you have a great idea about how to store data on the disk, you can just change your file system without redo the whole huge software you already have in hand. 

This is it. This is my definition of software-defined: decompose problem, define abstractions to hide complexity, and modularily solve each sub-problem in isolation. It is the opposite of undefined, or ad-hoc way of trying to solve all the problem at once, and is unable to do it well just because you have so much to manage and worry about.Making something software defined, to me, is to reexamine how we do things, and ask ourselves if we would be better off try to breaking things up and do one at a time.

So why is "software-defined" getting so much heat recently? It is because with the data center trend, multi-tenancy computing at scale, and other technology advances, we have found certain things become hard -- so hard that we have to re-examine how we do it, and do it in a well defined, well structured way so that we could break complexity up. 

This is true for software-defined network: with people require increasingly more control over the network, more and more stuff have been put in: firewalls, middleboxes, control programs which monitors network traffic and provide isolation, deduplication... So much that managing the network control plane and reason about network performance becomes very difficult. The network community's reaction to this situation is to break this down, have a single layer which handles distributing states across the whole network (to hide the distributed state complexity), a single layer which virtualize complex network into simple network view (to hide the physical network complexity), and a standard way for the control programs to express their needs (to hide hardware configuration complexity). And that is what they call software defined network. 

And if you look at storage management, especially storage which happens in the data center, we are pretty much in the same situation. Applications are asking for more control over the data they store: they need availability, integrity, ordering and performance guarantees associated with the data they access. Multi-tenants and different applications are sharing the same storage infrastructure, which calls for isolation, both for performance and for security.People are managing storage in fancier ways: they need to back-up, restore, take snapshots whenever they want, and have plug-and-play storage hardware which they can manage from a single point. And we have more and more hardware: disks, flashes, RAM caches, deduplication appliance, RAID arrays, encryption hardware,archive devices, and many more. They are constantly failing and recovering, and new stuff regularly get plugged in. All of them are attached to different machines with different configuration and capacity, and are located in different locations in the data centers network, which is highly dynamic by itself. In a word, managing storage is becoming incredibly hard and complex; yet we do not have a systematic way to tackle this complexity. The state of art solutions for large scale, highly virtualized, multi-tenancy storage, all exhibit "undefined" behavior, in that they are doing too many things, yet doing them poorly. 

GFS and its open source variant HDFS have been widely used for data center storage. In order to write this blog, I actually took a quick review of the GFS paper, and was amazed by how much stuff GFS is trying to do simultaneously. Just to name a few:
  • Distributed device management, including failure detection and reaction. This is very low level stuff, direct interaction with hardware.And GFS does it in a very limited way: it only manages disks, and indirectly use RAM using Linux's page cache mechanism. No fine control, and no heterogeneous devices, e.g, flash cache, deduplication hardware, or any special purpose storage. And this is not unique to GFS: RAMCloud (SOSP'11), among with many other storage systems, deliberately choose to use only memory, or other single kind of storage hardware; not because other hardware, say, flash, has nothing good to offer, but because it's just too hard to manage heterogeneous distributed storage devices, when you have other things to worry about. 
  • Name space management and data representation to the application. This is, on the contrast, very high level interaction, which actually requires understanding and assumption on what application needs. GFS made one reasonable assumption;which works for certain kinds of applications, but certainly not for all. Other applications have many different kinds of storage needs. 
  • Data locality and correlated failure region decisions. GFS itself decides which data is closer in the network than other, and which storage nodes are likely to fail correlatively.It is a very naive decision, though, which only considers rack locality and requires human configuration effort. It would work awkwardly in the more and more popular full bi-sectional network, and take no consideration about the complexity and dynamicity of the underlying network and power supply system. Flat Data Center storage (OSDI'12) takes another position and simply assumes the network is always flat and good enough --- another oversimplified assumption. There is no way these systems could make informed decisions other than naive ones, because they know too little about the current network status and device status -- too much information for GFS to keep up with.  
  • Storage management functionalites. GFS actually try to offer certain management functionality, say, snapshots. Not too many, though, because it is not a storage configuration/management system after all. 
  • Data distribution and replication.
  • Consistency model and concurrent access control.This, again, tied closely to application semantics. 
  • Data integrity maintenance. GFS try to detect and recover from data corruption by checksum. This is certainly one solution to achieve data integrity, but arguably not the best or complete solution. 
I could continue the list with hotspot reaction, (very limited) isolation attempts, and many more. But the point is that GFS is virtually trying to do everything in storage provisioning and management, up to very high level application interaction, to the very low level of physical hardware management, to storage administration stuff. This is also true to Amazon's Dynamo system, Google's MegaStore, and pretty much every storage solution we are deploying in today's data center that I can think of. All of them have redone the above things GFS attempted to do. This is exactly what I consider undefined: you have too big a task, and you are not decomposing the task carefully. Thus you end up doing everything a little bit in an uncontrolled way, and redoing the whole thing if you need to change a little bit on how you do it. And this is what caused a lot of problems in today's storage stack: no single point of control and configuration, inability to make efficient use of different hardware, very little isolation guarantee, very hard to conform with application's SLA requirements, and many others.

This is why now is the time for software-defined storage. Just like the network folks did, we should sit back and ask ourselves: how can we decompose the problem and what abstraction should we provide to hide the complexity. 

I would argue that the decomposition techniques software-defined network uses could partially be applied here. We need an I/O distribution layer to manage and control the heterogeneous I/O devices which are distributed all over the network, mentoring their status and capacity, handle new devices plug in, respond to network status change, and present a single storage pool for the upper level. It only needs to do this and it needs to do it well. This service could be used by every storage solution running in the data center, without each system re-implementing their own version. We need an isolation layer, which handles security, performance, and failure isolation, and present an isolated storage view to the upper layer storage systems so that they could confidently reason about the system's performance and robustness without worrying about other's interference. And above that we need to virtualized storage, which is simple enough for application to use yet flexible enough to express their storage needs. This virtualized storage could be file system (which, in my view, is a fantastic storage virtualization layer and present a beautiful virtulized storage view in the form of files and directories) for applications which are happy with POSIX APIs. However, it could also be something else for application with different storage needs. A database-like data management system, say, could probably use some extensive APIs which allow its fine control over the I/O behavior. A key-value store might benefit from another form of virtualized storage. And with all the other layers and service in place, developing another virutliazed storage system shouldn't be as difficult as before.

This is, of course, very preliminary thoughts on how to divide the storage stack; and you may very well have different view on how we should decompose this task and what the abstractions should be. But I think it is fair to say we should seriously examine this, and this should be our first step toward software defined storage.

(I have no idea why this post end up so lengthy. I should really learn how to concisely express my thoughts and how to cut what I wrote down....:( )