2013年1月30日星期三

f2fs: flash-fridendly file system

by Joo-oung Hwang from Samsung

It is basically some measurement on the FTL device (optimal segment size, optimal concurrent write threads, etc.) and then some natual (known techniques) tweaks on LFS. 


Samsung has the benefit of both doing device and hearing the requirement directly from end users

Log-structured file system approach for flash storage (with tweaks specially designed for flash/FTLdevices)

Traditional LFS has the wnadering tree problem (to write one data block, you have to write 4-5 metadata blocks)

FTL address mapping methods:
1. block mapping (too large guranity)
2. page mapping (high memory and processor cost)
3. hybrid mapping (aka log block mapping)  include three parts
    BAST (block associative sector translation)
     FAST (fully associative)
    SAST (set associative)
 
     basically you have log block (which use page mapping) and data block (which use whole block mapping).
     You will have to merge log blocks to data blocks periodically: full merge, partial merge (a.k.a cop merge), switch merge
     
     They did some measurments to show that FTL works best for >4M segment size and at most 6 concurrent writing(active) log blocks (for most devices, also supported by IBM measurements)


Differences from traditional LFS:
1. alignment with FTL

2. avoid metadata update propagation by introducing indirection layer for indexing structure, this indrection data is on fixed location

3. multi-level hash tree to implement directory (known technique)


2012年12月18日星期二

STATISTICAL TESTS FOR SIGNIFICANCE


Other parts of this site explain how to do the common statistical tests. Here is a guide to choosing the right test for your purposes. When you have found it, click on "more information?" to confirm that the test is suitable. If you know it is suitable, click on "go for it!"
Important: Your data might not be in a suitable form (e.g. percentages, proportions) for the test you need. You can overcome this by using a simple transformation.Always check this - click HERE.
Use this test for comparing the means of two samples (but see test 2 below), even if they have different numbers of replicates. For example, you might want to compare the growth (biomass, etc.) of two populations of bacteria or plants, the yield of a crop with or without fertiliser treatment, the optical density of samples taken from each of two types of solution, etc. This test is used for "measurement data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3 etc. You would need to transform percentages and proportions because these have fixed limits (0-100, or 0-1).
More information?
Go for it!
Use this test like the t-test but in special circumstances - when you can arrange the two sets of replicate data in pairs. For example: (1) in a crop trial, use the "plus" and "minus" nitrogen crops on one farm as a pair, the "plus" and "minus" nitrogen crops on a second farm as a pair, and so on; (2) in a drug trial where a drug treatment is compared with a placebo (no treatment), one pair might be 20-year-old Caucasian males, another pair might be 30-year old Asian females, and so on.
More information?
Go for it!
Use this test if you want to compare several treatments. For example, the growth of one bacterium at different temperatures, the effects of several drugs or antibiotics, the sizes of several types of plant (or animals' teeth, etc.). You can also compare two things simultaneously - for example, the growth of 3 bacteria at different temperatures, and so on. Like the t-test, this test is used for "measurement data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3 etc. You would need to transform percentages and proportions because these have fixed limits (0-100, or 0-1).
More information? You need this, because there are different forms of this test.
Use this test to compare counts (numbers) of things that fall into different categories. For example, the numbers of blue-eyed and brown-eyed people in a class, or the numbers of progeny (AA, Aa, aa) from a genetic crossing experiment. You can also use the test for combinations of factors (e.g. the incidence of blue/brown eyes in people with light/dark hair, or the numbers of oak and birch trees with or without a particular type of toadstool beneath them on different soil types, etc.).
More information?
Go for it!
Use this test for putting confidence limits on the mean of counts of random events, so that different count means can be compared for statistical difference. For example, numbers of bacteria counted in the different squares of a counting chamber (haemocytometer) should follow a random distribution, unless the bacteria attract one another (in which case the numbers in some squares should be abnormally high, and abnormally low in other squares) or repel one another (in which case the counts should be abnormally similar in all squares). Very few things in nature are randomly distributed, but testing the recorded data against the expectation of the Poisson distribution would show this. By using the Poisson distribution you have a powerful test for analysing whether objects/ events are randomly distributed in space and time (or, conversely, whether the objects/ events are clustered).
More information?
Go for it!

2012年12月3日星期一

My personal view of software defined storage

(This is the result of some recently emerging thoughts, which means my view is likely to be changing dramatically pretty soon; or I probably have no idea what I am talking about.)

In order to talk about what is software defined storage, we might first ask ourselves, "what is software-defined; and what is the antonym of software-defined?". Some people think software-defined is basically moving stuff out of big black box appliances (which you buy from, say, NetApp or Epic), and having open software does whatever which was done inside that box (Remzi). Some people emphasize the automatic provisioning and single-point management of storage hardware for VM (VMWare).While my personal understanding of software defined storage is a bit different from the above two views.

In my opinion, software-defined is just what we computer science people have been practicing in decades and have gotten enormous success on: break up problem into pieces, define abstractions to factor out details, and then focus on a few details at a time and deal with them really well. Thus in my opinion, the opposite of software-defined would be "undefined" (instead of, say, hardware-defined).What undefined means is basically doing all things at the same time, yet is unable to do any of them well. 

To make my definition concrete, let me illustrate with an example.Let's say you want to solve a complex physics simulation problem, and you are given a computer with some cpu, memory and disks. You need to write a program which runs on this particular set of hardware. 

How do you approach it? Of course you could manage everything by yourself: in your program logic you control every instruction running on the CPU, every bit you reference memory, how exactly you store your data in each sector of your hard disk,and deal with the fact that CPU or memory or disk could just fail in unexpected ways and you are actually sharing these resources with others. And in the early days of computer, people actually do that. However, if you try to do this now, I'd say any reasonable programmer will think you are absolutely crazy. 

What is the computer science way, or, "software defined" way of approaching the same problem? You just sit back and think hard, "how can I decompose this complex problem so that I can only deal with one thing at a time?" and "for each part of the problem I decomposed, what abstraction should I present so that the internal complexity of this part is hidden from the outside and I don't have to worry about it when I am dealing the rest of this problem"? And by thinking this, you would probably first develop an Operating system which manages CPU and memory, and present the abstraction of process to the outside. You might then go ahead and define some higher level languages, which hide the complexity of dealing with machine code. You probably will get a file system in place too, which present a simple file abstraction, so that you don't have worry about where to put your data in the disk, how to locate it later, and what will happen if the machine just crashed in the middle of writing out your data. 

So what have you done? You basically divided the whole problem in to several parts, presented some simple abstractions, and tackled the sub-problems one by one, whose complexities are hided behind each abstraction. Now you can go ahead and deal with the actual physics simulation, which may very well be a very hard problem. But this is because it is inherently hard, not because it has been made hard due to the fact that you have to think about disk failures. Better yet, this approach enables innovation, because once you have a great idea about how to store data on the disk, you can just change your file system without redo the whole huge software you already have in hand. 

This is it. This is my definition of software-defined: decompose problem, define abstractions to hide complexity, and modularily solve each sub-problem in isolation. It is the opposite of undefined, or ad-hoc way of trying to solve all the problem at once, and is unable to do it well just because you have so much to manage and worry about.Making something software defined, to me, is to reexamine how we do things, and ask ourselves if we would be better off try to breaking things up and do one at a time.

So why is "software-defined" getting so much heat recently? It is because with the data center trend, multi-tenancy computing at scale, and other technology advances, we have found certain things become hard -- so hard that we have to re-examine how we do it, and do it in a well defined, well structured way so that we could break complexity up. 

This is true for software-defined network: with people require increasingly more control over the network, more and more stuff have been put in: firewalls, middleboxes, control programs which monitors network traffic and provide isolation, deduplication... So much that managing the network control plane and reason about network performance becomes very difficult. The network community's reaction to this situation is to break this down, have a single layer which handles distributing states across the whole network (to hide the distributed state complexity), a single layer which virtualize complex network into simple network view (to hide the physical network complexity), and a standard way for the control programs to express their needs (to hide hardware configuration complexity). And that is what they call software defined network. 

And if you look at storage management, especially storage which happens in the data center, we are pretty much in the same situation. Applications are asking for more control over the data they store: they need availability, integrity, ordering and performance guarantees associated with the data they access. Multi-tenants and different applications are sharing the same storage infrastructure, which calls for isolation, both for performance and for security.People are managing storage in fancier ways: they need to back-up, restore, take snapshots whenever they want, and have plug-and-play storage hardware which they can manage from a single point. And we have more and more hardware: disks, flashes, RAM caches, deduplication appliance, RAID arrays, encryption hardware,archive devices, and many more. They are constantly failing and recovering, and new stuff regularly get plugged in. All of them are attached to different machines with different configuration and capacity, and are located in different locations in the data centers network, which is highly dynamic by itself. In a word, managing storage is becoming incredibly hard and complex; yet we do not have a systematic way to tackle this complexity. The state of art solutions for large scale, highly virtualized, multi-tenancy storage, all exhibit "undefined" behavior, in that they are doing too many things, yet doing them poorly. 

GFS and its open source variant HDFS have been widely used for data center storage. In order to write this blog, I actually took a quick review of the GFS paper, and was amazed by how much stuff GFS is trying to do simultaneously. Just to name a few:
  • Distributed device management, including failure detection and reaction. This is very low level stuff, direct interaction with hardware.And GFS does it in a very limited way: it only manages disks, and indirectly use RAM using Linux's page cache mechanism. No fine control, and no heterogeneous devices, e.g, flash cache, deduplication hardware, or any special purpose storage. And this is not unique to GFS: RAMCloud (SOSP'11), among with many other storage systems, deliberately choose to use only memory, or other single kind of storage hardware; not because other hardware, say, flash, has nothing good to offer, but because it's just too hard to manage heterogeneous distributed storage devices, when you have other things to worry about. 
  • Name space management and data representation to the application. This is, on the contrast, very high level interaction, which actually requires understanding and assumption on what application needs. GFS made one reasonable assumption;which works for certain kinds of applications, but certainly not for all. Other applications have many different kinds of storage needs. 
  • Data locality and correlated failure region decisions. GFS itself decides which data is closer in the network than other, and which storage nodes are likely to fail correlatively.It is a very naive decision, though, which only considers rack locality and requires human configuration effort. It would work awkwardly in the more and more popular full bi-sectional network, and take no consideration about the complexity and dynamicity of the underlying network and power supply system. Flat Data Center storage (OSDI'12) takes another position and simply assumes the network is always flat and good enough --- another oversimplified assumption. There is no way these systems could make informed decisions other than naive ones, because they know too little about the current network status and device status -- too much information for GFS to keep up with.  
  • Storage management functionalites. GFS actually try to offer certain management functionality, say, snapshots. Not too many, though, because it is not a storage configuration/management system after all. 
  • Data distribution and replication.
  • Consistency model and concurrent access control.This, again, tied closely to application semantics. 
  • Data integrity maintenance. GFS try to detect and recover from data corruption by checksum. This is certainly one solution to achieve data integrity, but arguably not the best or complete solution. 
I could continue the list with hotspot reaction, (very limited) isolation attempts, and many more. But the point is that GFS is virtually trying to do everything in storage provisioning and management, up to very high level application interaction, to the very low level of physical hardware management, to storage administration stuff. This is also true to Amazon's Dynamo system, Google's MegaStore, and pretty much every storage solution we are deploying in today's data center that I can think of. All of them have redone the above things GFS attempted to do. This is exactly what I consider undefined: you have too big a task, and you are not decomposing the task carefully. Thus you end up doing everything a little bit in an uncontrolled way, and redoing the whole thing if you need to change a little bit on how you do it. And this is what caused a lot of problems in today's storage stack: no single point of control and configuration, inability to make efficient use of different hardware, very little isolation guarantee, very hard to conform with application's SLA requirements, and many others.

This is why now is the time for software-defined storage. Just like the network folks did, we should sit back and ask ourselves: how can we decompose the problem and what abstraction should we provide to hide the complexity. 

I would argue that the decomposition techniques software-defined network uses could partially be applied here. We need an I/O distribution layer to manage and control the heterogeneous I/O devices which are distributed all over the network, mentoring their status and capacity, handle new devices plug in, respond to network status change, and present a single storage pool for the upper level. It only needs to do this and it needs to do it well. This service could be used by every storage solution running in the data center, without each system re-implementing their own version. We need an isolation layer, which handles security, performance, and failure isolation, and present an isolated storage view to the upper layer storage systems so that they could confidently reason about the system's performance and robustness without worrying about other's interference. And above that we need to virtualized storage, which is simple enough for application to use yet flexible enough to express their storage needs. This virtualized storage could be file system (which, in my view, is a fantastic storage virtualization layer and present a beautiful virtulized storage view in the form of files and directories) for applications which are happy with POSIX APIs. However, it could also be something else for application with different storage needs. A database-like data management system, say, could probably use some extensive APIs which allow its fine control over the I/O behavior. A key-value store might benefit from another form of virtualized storage. And with all the other layers and service in place, developing another virutliazed storage system shouldn't be as difficult as before.

This is, of course, very preliminary thoughts on how to divide the storage stack; and you may very well have different view on how we should decompose this task and what the abstractions should be. But I think it is fair to say we should seriously examine this, and this should be our first step toward software defined storage.

(I have no idea why this post end up so lengthy. I should really learn how to concisely express my thoughts and how to cut what I wrote down....:( )





2012年11月22日星期四

What is Softwae Defined Network

(This is mainly Scott Shenkar’s Definition, but I wholeheartedly agree)

SDN is three abstractions aiming at abstract out simplicity on the network control plane (which is currently ad-hot ACL, middleboxes, DPI, and other functionalities).They are distributed state abstraction, specification abstraction, and configuration abstraction.

1. Distributed State abstraction -- centralized state
network states are physically distributed over many many switches. But that doesn’t mean we have to always deal with this. This distributed states should be abstracted out into a logically centralized task, where you are dealing with a global network view, i.e., some data structure, not some distributed states. Then this logically centralized task could be dealt with in whatever way you like, you could even distributedly do it for scalability when approiprate. But that is a distributed system problem, not a networking problem with inherently distributed states. And you are not forced to deal with network scale complexity.


2. Specification abstraction (or network virtualization) -- simple network view
Control program should describe functionality, not how to realize it in the particular physical network. So what the control program see should be virtual network view which is only complex enough to express its desire, not as complex as the actual underlying physical network.
e.g., for ACL problem, program should only see endpoint-to-endpoint network.


3. Configuration abstraction (or forwarding abstraction) -- hardware oblivious forwarding specification.
Configuration abstraction should expose enough to enable flexible forwarding decisions, but it should NOT expose the hardware details. (OpenFlow comes in, but only partially solve the problem here. It assumes switches are the unit of forwaring abstraction, instead of , say, a fabric).

All in all, SDN is NOT OpenFlow. SDN doesn’t have to happen in a datacenter neither. SDN is just reexamining how we manage the control plane of our network.


How to realize SDN (not that important, and you probably have seen this dozen of times...):

            control programs
     -----------------------------------------  (control program’s network view, or virtualized network)      
          virtualization layer
    ---------------------------------------------  (centralized network view, i.e, one data structure)      
common distribution layer (network OS)
     -------------------------------------------------  (physical, distributed network states)
physical network + switches

2012年11月13日星期二

Security in the Cloud

1. resource sharing among distrusful customers
    cross VM side-channels attack
    proof-of-concept attack: attacker and victim sharing the same core, attacker try to wake-up as frequently as possible, fill the instruction cache, let the victim run (which use a portion of cache), then wakeup again and measure the performance of previousely cached data. (so that how victim uses cache is learned). This could enable you to learn the secret key of the victim
   For multi-core attacking: force shedular to re-schedular frequenctly, so you end up getting the same core with the victim a lot
   DNA reassemble technique used to go from partial, noised secret key to complete secrete key

2. pricing of fine-grained sources
    performance variies with different type of cpus, and network performance vary too, so not very predictable.
   either predictable but low performance, or high but unpredictable performance
   loss comes from workload contention (Zen does good job in cpu performance isolation, but so greate for memory, disk or network. but not much Zen can do anyway)
   thus uniform of abstraction fails
   and attackers have opportunities to interfere with other's workload
 a. placement gaming:
    start multiple instances and shut the ones which perform worse
   when seeing bad performance, just shut the vm and launch a new one

b. resource freeing attack:
   attacker and victim both run apache on the same physical machine, and both want more bandwidth
 attacker could request a lot of dynamic pages from the victim, which is cpu intensive, and when the victim is busy processing these requests, bandwidth is free for the attacker to use

storage panel:

storage in the context of cloud and network (no slides....)

intersection of storage, sdn and computing
system management paradigm important
redefines what is to sell IT
slaf? (an open source storage cluster stack built on commodity hardware)
sotware defined storage align with SDS: volume management, virtual network + virtual storage device,)

 I have no idea what he talked about for 5 minutes.....


reserch problems in software defined storage (NetApp)

trends in storage in datacenter: hetoragilty, dynamic, sharing, and ???
sotware define storage: have all types/layers storage components to seamlessly communicate with each other, and in a way which abstracts out details (like wheter you support dedup or not)

He has a tabke of SLO language (which is worth looking up later)
different stackts for different SLO at different cost points
SLO a core idea and need to be standardized 

storage management not in a single box, but at mulitple layers
isolation in performance and failure/security case
failure handling when one storage service is composed by multiple components (remzi covered layered structure, could be other structure)

now storage happens a little bit in hypervisor, virtual machine and application. which layer should do what? how do we coordinate different layers. 

new storage problems due to dynamism

want: data structure storage.


Evolution of software defined storage (VMware)
go back and look at how SDS changes, as we already have software defined storage
traditionally: hardware defined storage realized by special purpose boxes
clear boundaries between hardware/software, and limited interface: block read/wite
now interface is changing, and the enabling factors are:
1. commoditization of those big boxes
2. because cpu advances, hosts are more powerful and could have more intelligence, rather than, say, put dedup inside box
3. richer interface standards (like which offerred by SCSI, say, xcopy)
4. boxes simplifying, allowing applications to sepecify what they want
5. no distinction between server node and storage node anymore, just bricks (which has cpu, memory, flash and disk) -- already happening in Google, Facebook, etc. 

summary: software defined storage is what happens when distinction of server node and storage node blurs, and we have disruptive, share-nothing model(what does he mean by this???)

Storage Infrastructure Vision(Google)
google needs mulitple data centers
data consistency first class citizen now, scaliabilty, reliability and pricing problems
google infrastructure goal: not just look simple by outside users, look simple to application programmers too. Complexity managed by infrastructure operators, not application developers

Datacenter sotrage systems (Facebook)
chanlenges: 
1. sotware stack will be obsolete soon: (more cpus, not faster cus, faster storages, faster flatter networks)
2. heterogeneous workloads -- but you could potentially have specific datacenter for spedific things, so the workload no that different
3. heterogeneous hardware -- how to take advantage of different hardware profiles
4. dynamic applications -- need adaptive system

opportunies:
1. multi-threaded programming is clumsy -- new parallel programming paradigms
2. flash != hard disk -- new storage engine for flash
3. high speed network stack -- new network stack
4. dynamic system will win big -- high throughput vs. low latency, space vs time, data temperature aware


WISDOM discussion (microsoft)
COSMOS:
   service internal to micorsoft, used for batch procesing and analytics
   chanlanges:
         transient outliers can pummel performance, thus hard to reason
         any storage nodes servicing multiple sytpes of requests
         exploiting cheap bandwidth (flat network), but we always have storage outpaces network







sds (kinda) work by remzi's group and others

use flash as a cache (from mike swift's group)
flash is widely used as cache
use ssd's block interface is inefficient for cache, because cache is different from storage
new firmware in ssd, to get rid of block mapping, and use unified address space
they also provide consistent cache interface (which blcok is clean/dirty, which block has been evicted, etc.)
when doing gabage collection, don't have to migarate because it's cache, and we have primary copy of the data somewhere
we plan to propose new interfaces and virtual ssds (so it could be more software defined???)

combating ordering loss in the storage stack -- the no-order fs work (Vijay)
ordering not respected in many layers
don's use ordering to ensure consistency
1. coerced cache eviction -- additional write to flush cache to ensure ordering
2. backpointer-based consistency, do verification on backpointers when following pointers in file systems
3. (on going) inconsistency in virtualized storage setting

de-virtualization for flash-based ssd (yiying) - the nameless writes in ssd
take off too many layers of block mappings in file system -TFL- physical block
store physcial block number directly in file systems (when file system write a block, no position is specified, and disk decide where to put the block, and inform file system where the block got written)

in a virtualized enviormant: file system de-virtualizer (on going, i think)
file system perform normal writes, but fsdv do nameless writes, and store physical mapping)


zettabyte reliability (yupu)
add checksum at file system/memory bondaries
also an analytical framework to reason about reliability -- the sum of probability model
future: data protection as a service? without modifications to the os? i am not quite sure how to realize this....)

low latency storage class memory
data-centric workloads are sensitive to storage latency
storage-class momory: phase-change memory, stt-ram etc.): persistent, low latency, byte addressable
mnemosyne: persistent regions + consistency
                     persistency: so that data structures don't get corrupted
                     consistency: update data in a crash-safe way


Harden HDFs( Than Do)
software defined way to ensure system reliability
1. selective 2-versioning programming
2. encode file system states using bloom filters