Yang Suli的Blog: 2012

2012年12月18日星期二

STATISTICAL TESTS FOR SIGNIFICANCE

Other parts of this site explain how to do the common statistical tests. Here is a guide to choosing the right test for your purposes. When you have found it, click on "more information?" to confirm that the test is suitable. If you know it is suitable, click on "go for it!"

Important: Your data might not be in a suitable form (e.g. percentages, proportions) for the test you need. You can overcome this by using a simple transformation.Always check this - click HERE.

1. Student's t-test

Use this test for comparing the means of two samples (but see test 2 below), even if they have different numbers of replicates. For example, you might want to compare the growth (biomass, etc.) of two populations of bacteria or plants, the yield of a crop with or without fertiliser treatment, the optical density of samples taken from each of two types of solution, etc. This test is used for "measurement data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3 etc. You would need to transform percentages and proportions because these have fixed limits (0-100, or 0-1).
More information?
Go for it!

2. Paired-samples test

Use this test like the t-test but in special circumstances - when you can arrange the two sets of replicate data in pairs. For example: (1) in a crop trial, use the "plus" and "minus" nitrogen crops on one farm as a pair, the "plus" and "minus" nitrogen crops on a second farm as a pair, and so on; (2) in a drug trial where a drug treatment is compared with a placebo (no treatment), one pair might be 20-year-old Caucasian males, another pair might be 30-year old Asian females, and so on.
More information?
Go for it!

3. Analysis of variance for comparing the means of three or more samples

Use this test if you want to compare several treatments. For example, the growth of one bacterium at different temperatures, the effects of several drugs or antibiotics, the sizes of several types of plant (or animals' teeth, etc.). You can also compare two things simultaneously - for example, the growth of 3 bacteria at different temperatures, and so on. Like the t-test, this test is used for "measurement data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3 etc. You would need to transform percentages and proportions because these have fixed limits (0-100, or 0-1).
More information? You need this, because there are different forms of this test.

4. Chi-squared test for categories of data

Use this test to compare counts (numbers) of things that fall into different categories. For example, the numbers of blue-eyed and brown-eyed people in a class, or the numbers of progeny (AA, Aa, aa) from a genetic crossing experiment. You can also use the test for combinations of factors (e.g. the incidence of blue/brown eyes in people with light/dark hair, or the numbers of oak and birch trees with or without a particular type of toadstool beneath them on different soil types, etc.).
More information?
Go for it!

5. Poisson distribution for count data

Use this test for putting confidence limits on the mean of counts of random events, so that different count means can be compared for statistical difference. For example, numbers of bacteria counted in the different squares of a counting chamber (haemocytometer) should follow a random distribution, unless the bacteria attract one another (in which case the numbers in some squares should be abnormally high, and abnormally low in other squares) or repel one another (in which case the counts should be abnormally similar in all squares). Very few things in nature are randomly distributed, but testing the recorded data against the expectation of the Poisson distribution would show this. By using the Poisson distribution you have a powerful test for analysing whether objects/ events are randomly distributed in space and time (or, conversely, whether the objects/ events are clustered).

More information?
Go for it!

2012年12月3日星期一

My personal view of software defined storage

(This is the result of some recently emerging thoughts, which means my view is likely to be changing dramatically pretty soon; or I probably have no idea what I am talking about.)

In order to talk about what is software defined storage, we might first ask ourselves, "what is software-defined; and what is the antonym of software-defined?". Some people think software-defined is basically moving stuff out of big black box appliances (which you buy from, say, NetApp or Epic), and having open software does whatever which was done inside that box (Remzi). Some people emphasize the automatic provisioning and single-point management of storage hardware for VM (VMWare).While my personal understanding of software defined storage is a bit different from the above two views.

In my opinion, software-defined is just what we computer science people have been practicing in decades and have gotten enormous success on: break up problem into pieces, define abstractions to factor out details, and then focus on a few details at a time and deal with them really well. Thus in my opinion, the opposite of software-defined would be "undefined" (instead of, say, hardware-defined).What undefined means is basically doing all things at the same time, yet is unable to do any of them well.

To make my definition concrete, let me illustrate with an example.Let's say you want to solve a complex physics simulation problem, and you are given a computer with some cpu, memory and disks. You need to write a program which runs on this particular set of hardware.

How do you approach it? Of course you could manage everything by yourself: in your program logic you control every instruction running on the CPU, every bit you reference memory, how exactly you store your data in each sector of your hard disk,and deal with the fact that CPU or memory or disk could just fail in unexpected ways and you are actually sharing these resources with others. And in the early days of computer, people actually do that. However, if you try to do this now, I'd say any reasonable programmer will think you are absolutely crazy.

What is the computer science way, or, "software defined" way of approaching the same problem? You just sit back and think hard, "how can I decompose this complex problem so that I can only deal with one thing at a time?" and "for each part of the problem I decomposed, what abstraction should I present so that the internal complexity of this part is hidden from the outside and I don't have to worry about it when I am dealing the rest of this problem"? And by thinking this, you would probably first develop an Operating system which manages CPU and memory, and present the abstraction of process to the outside. You might then go ahead and define some higher level languages, which hide the complexity of dealing with machine code. You probably will get a file system in place too, which present a simple file abstraction, so that you don't have worry about where to put your data in the disk, how to locate it later, and what will happen if the machine just crashed in the middle of writing out your data.

So what have you done? You basically divided the whole problem in to several parts, presented some simple abstractions, and tackled the sub-problems one by one, whose complexities are hided behind each abstraction. Now you can go ahead and deal with the actual physics simulation, which may very well be a very hard problem. But this is because it is inherently hard, not because it has been made hard due to the fact that you have to think about disk failures. Better yet, this approach enables innovation, because once you have a great idea about how to store data on the disk, you can just change your file system without redo the whole huge software you already have in hand.

This is it. This is my definition of software-defined: decompose problem, define abstractions to hide complexity, and modularily solve each sub-problem in isolation. It is the opposite of undefined, or ad-hoc way of trying to solve all the problem at once, and is unable to do it well just because you have so much to manage and worry about.Making something software defined, to me, is to reexamine how we do things, and ask ourselves if we would be better off try to breaking things up and do one at a time.

So why is "software-defined" getting so much heat recently? It is because with the data center trend, multi-tenancy computing at scale, and other technology advances, we have found certain things become hard -- so hard that we have to re-examine how we do it, and do it in a well defined, well structured way so that we could break complexity up.

This is true for software-defined network: with people require increasingly more control over the network, more and more stuff have been put in: firewalls, middleboxes, control programs which monitors network traffic and provide isolation, deduplication... So much that managing the network control plane and reason about network performance becomes very difficult. The network community's reaction to this situation is to break this down, have a single layer which handles distributing states across the whole network (to hide the distributed state complexity), a single layer which virtualize complex network into simple network view (to hide the physical network complexity), and a standard way for the control programs to express their needs (to hide hardware configuration complexity). And that is what they call software defined network.

And if you look at storage management, especially storage which happens in the data center, we are pretty much in the same situation. Applications are asking for more control over the data they store: they need availability, integrity, ordering and performance guarantees associated with the data they access. Multi-tenants and different applications are sharing the same storage infrastructure, which calls for isolation, both for performance and for security.People are managing storage in fancier ways: they need to back-up, restore, take snapshots whenever they want, and have plug-and-play storage hardware which they can manage from a single point. And we have more and more hardware: disks, flashes, RAM caches, deduplication appliance, RAID arrays, encryption hardware,archive devices, and many more. They are constantly failing and recovering, and new stuff regularly get plugged in. All of them are attached to different machines with different configuration and capacity, and are located in different locations in the data centers network, which is highly dynamic by itself. In a word, managing storage is becoming incredibly hard and complex; yet we do not have a systematic way to tackle this complexity. The state of art solutions for large scale, highly virtualized, multi-tenancy storage, all exhibit "undefined" behavior, in that they are doing too many things, yet doing them poorly.

GFS and its open source variant HDFS have been widely used for data center storage. In order to write this blog, I actually took a quick review of the GFS paper, and was amazed by how much stuff GFS is trying to do simultaneously. Just to name a few:

Distributed device management, including failure detection and reaction. This is very low level stuff, direct interaction with hardware.And GFS does it in a very limited way: it only manages disks, and indirectly use RAM using Linux's page cache mechanism. No fine control, and no heterogeneous devices, e.g, flash cache, deduplication hardware, or any special purpose storage. And this is not unique to GFS: RAMCloud (SOSP'11), among with many other storage systems, deliberately choose to use only memory, or other single kind of storage hardware; not because other hardware, say, flash, has nothing good to offer, but because it's just too hard to manage heterogeneous distributed storage devices, when you have other things to worry about.
Name space management and data representation to the application. This is, on the contrast, very high level interaction, which actually requires understanding and assumption on what application needs. GFS made one reasonable assumption;which works for certain kinds of applications, but certainly not for all. Other applications have many different kinds of storage needs.
Data locality and correlated failure region decisions. GFS itself decides which data is closer in the network than other, and which storage nodes are likely to fail correlatively.It is a very naive decision, though, which only considers rack locality and requires human configuration effort. It would work awkwardly in the more and more popular full bi-sectional network, and take no consideration about the complexity and dynamicity of the underlying network and power supply system. Flat Data Center storage (OSDI'12) takes another position and simply assumes the network is always flat and good enough --- another oversimplified assumption. There is no way these systems could make informed decisions other than naive ones, because they know too little about the current network status and device status -- too much information for GFS to keep up with.
Storage management functionalites. GFS actually try to offer certain management functionality, say, snapshots. Not too many, though, because it is not a storage configuration/management system after all.
Data distribution and replication.
Consistency model and concurrent access control.This, again, tied closely to application semantics.
Data integrity maintenance. GFS try to detect and recover from data corruption by checksum. This is certainly one solution to achieve data integrity, but arguably not the best or complete solution.

I could continue the list with hotspot reaction, (very limited) isolation attempts, and many more. But the point is that GFS is virtually trying to do everything in storage provisioning and management, up to very high level application interaction, to the very low level of physical hardware management, to storage administration stuff. This is also true to Amazon's Dynamo system, Google's MegaStore, and pretty much every storage solution we are deploying in today's data center that I can think of. All of them have redone the above things GFS attempted to do. This is exactly what I consider undefined: you have too big a task, and you are not decomposing the task carefully. Thus you end up doing everything a little bit in an uncontrolled way, and redoing the whole thing if you need to change a little bit on how you do it. And this is what caused a lot of problems in today's storage stack: no single point of control and configuration, inability to make efficient use of different hardware, very little isolation guarantee, very hard to conform with application's SLA requirements, and many others.

This is why now is the time for software-defined storage. Just like the network folks did, we should sit back and ask ourselves: how can we decompose the problem and what abstraction should we provide to hide the complexity.

I would argue that the decomposition techniques software-defined network uses could partially be applied here. We need an I/O distribution layer to manage and control the heterogeneous I/O devices which are distributed all over the network, mentoring their status and capacity, handle new devices plug in, respond to network status change, and present a single storage pool for the upper level. It only needs to do this and it needs to do it well. This service could be used by every storage solution running in the data center, without each system re-implementing their own version. We need an isolation layer, which handles security, performance, and failure isolation, and present an isolated storage view to the upper layer storage systems so that they could confidently reason about the system's performance and robustness without worrying about other's interference. And above that we need to virtualized storage, which is simple enough for application to use yet flexible enough to express their storage needs. This virtualized storage could be file system (which, in my view, is a fantastic storage virtualization layer and present a beautiful virtulized storage view in the form of files and directories) for applications which are happy with POSIX APIs. However, it could also be something else for application with different storage needs. A database-like data management system, say, could probably use some extensive APIs which allow its fine control over the I/O behavior. A key-value store might benefit from another form of virtualized storage. And with all the other layers and service in place, developing another virutliazed storage system shouldn't be as difficult as before.

This is, of course, very preliminary thoughts on how to divide the storage stack; and you may very well have different view on how we should decompose this task and what the abstractions should be. But I think it is fair to say we should seriously examine this, and this should be our first step toward software defined storage.

(I have no idea why this post end up so lengthy. I should really learn how to concisely express my thoughts and how to cut what I wrote down....:( )

2012年11月22日星期四

What is Softwae Defined Network

(This is mainly Scott Shenkar’s Definition, but I wholeheartedly agree)

SDN is three abstractions aiming at abstract out simplicity on the network control plane (which is currently ad-hot ACL, middleboxes, DPI, and other functionalities).They are distributed state abstraction, specification abstraction, and configuration abstraction.

1. Distributed State abstraction -- centralized state
network states are physically distributed over many many switches. But that doesn’t mean we have to always deal with this. This distributed states should be abstracted out into a logically centralized task, where you are dealing with a global network view, i.e., some data structure, not some distributed states. Then this logically centralized task could be dealt with in whatever way you like, you could even distributedly do it for scalability when approiprate. But that is a distributed system problem, not a networking problem with inherently distributed states. And you are not forced to deal with network scale complexity.

2. Specification abstraction (or network virtualization) -- simple network view
Control program should describe functionality, not how to realize it in the particular physical network. So what the control program see should be virtual network view which is only complex enough to express its desire, not as complex as the actual underlying physical network.
e.g., for ACL problem, program should only see endpoint-to-endpoint network.

3. Configuration abstraction (or forwarding abstraction) -- hardware oblivious forwarding specification.
Configuration abstraction should expose enough to enable flexible forwarding decisions, but it should NOT expose the hardware details. (OpenFlow comes in, but only partially solve the problem here. It assumes switches are the unit of forwaring abstraction, instead of , say, a fabric).

All in all, SDN is NOT OpenFlow. SDN doesn’t have to happen in a datacenter neither. SDN is just reexamining how we manage the control plane of our network.

How to realize SDN (not that important, and you probably have seen this dozen of times...):

            control programs
     ----------------------------------------- (control program’s network view, or virtualized network)
          virtualization layer
    --------------------------------------------- (centralized network view, i.e, one data structure)
common distribution layer (network OS)
     ------------------------------------------------- (physical, distributed network states)
physical network + switches

2012年11月13日星期二

Security in the Cloud

1. resource sharing among distrusful customers
cross VM side-channels attack
proof-of-concept attack: attacker and victim sharing the same core, attacker try to wake-up as frequently as possible, fill the instruction cache, let the victim run (which use a portion of cache), then wakeup again and measure the performance of previousely cached data. (so that how victim uses cache is learned). This could enable you to learn the secret key of the victim
For multi-core attacking: force shedular to re-schedular frequenctly, so you end up getting the same core with the victim a lot
DNA reassemble technique used to go from partial, noised secret key to complete secrete key

2. pricing of fine-grained sources
performance variies with different type of cpus, and network performance vary too, so not very predictable.
either predictable but low performance, or high but unpredictable performance
loss comes from workload contention (Zen does good job in cpu performance isolation, but so greate for memory, disk or network. but not much Zen can do anyway)
thus uniform of abstraction fails
and attackers have opportunities to interfere with other's workload
a. placement gaming:
start multiple instances and shut the ones which perform worse
when seeing bad performance, just shut the vm and launch a new one

b. resource freeing attack:
attacker and victim both run apache on the same physical machine, and both want more bandwidth
attacker could request a lot of dynamic pages from the victim, which is cpu intensive, and when the victim is busy processing these requests, bandwidth is free for the attacker to use

storage panel:

storage in the context of cloud and network (no slides....)

intersection of storage, sdn and computing
system management paradigm important
redefines what is to sell IT
slaf? (an open source storage cluster stack built on commodity hardware)
sotware defined storage align with SDS: volume management, virtual network + virtual storage device,)

I have no idea what he talked about for 5 minutes.....

reserch problems in software defined storage (NetApp)

trends in storage in datacenter: hetoragilty, dynamic, sharing, and ???
sotware define storage: have all types/layers storage components to seamlessly communicate with each other, and in a way which abstracts out details (like wheter you support dedup or not)

He has a tabke of SLO language (which is worth looking up later)
different stackts for different SLO at different cost points
SLO a core idea and need to be standardized

storage management not in a single box, but at mulitple layers
isolation in performance and failure/security case
failure handling when one storage service is composed by multiple components (remzi covered layered structure, could be other structure)

now storage happens a little bit in hypervisor, virtual machine and application. which layer should do what? how do we coordinate different layers.

new storage problems due to dynamism

want: data structure storage.

Evolution of software defined storage (VMware)
go back and look at how SDS changes, as we already have software defined storage
traditionally: hardware defined storage realized by special purpose boxes
clear boundaries between hardware/software, and limited interface: block read/wite
now interface is changing, and the enabling factors are:
1. commoditization of those big boxes
2. because cpu advances, hosts are more powerful and could have more intelligence, rather than, say, put dedup inside box
3. richer interface standards (like which offerred by SCSI, say, xcopy)
4. boxes simplifying, allowing applications to sepecify what they want
5. no distinction between server node and storage node anymore, just bricks (which has cpu, memory, flash and disk) -- already happening in Google, Facebook, etc.

summary: software defined storage is what happens when distinction of server node and storage node blurs, and we have disruptive, share-nothing model(what does he mean by this???)

Storage Infrastructure Vision(Google)
google needs mulitple data centers
data consistency first class citizen now, scaliabilty, reliability and pricing problems
google infrastructure goal: not just look simple by outside users, look simple to application programmers too. Complexity managed by infrastructure operators, not application developers

Datacenter sotrage systems (Facebook)
chanlenges:
1. sotware stack will be obsolete soon: (more cpus, not faster cus, faster storages, faster flatter networks)
2. heterogeneous workloads -- but you could potentially have specific datacenter for spedific things, so the workload no that different
3. heterogeneous hardware -- how to take advantage of different hardware profiles
4. dynamic applications -- need adaptive system

opportunies:
1. multi-threaded programming is clumsy -- new parallel programming paradigms
2. flash != hard disk -- new storage engine for flash
3. high speed network stack -- new network stack
4. dynamic system will win big -- high throughput vs. low latency, space vs time, data temperature aware

WISDOM discussion (microsoft)
COSMOS:
service internal to micorsoft, used for batch procesing and analytics
chanlanges:
transient outliers can pummel performance, thus hard to reason
any storage nodes servicing multiple sytpes of requests
exploiting cheap bandwidth (flat network), but we always have storage outpaces network

sds (kinda) work by remzi's group and others

use flash as a cache (from mike swift's group)
flash is widely used as cache
use ssd's block interface is inefficient for cache, because cache is different from storage
new firmware in ssd, to get rid of block mapping, and use unified address space
they also provide consistent cache interface (which blcok is clean/dirty, which block has been evicted, etc.)
when doing gabage collection, don't have to migarate because it's cache, and we have primary copy of the data somewhere
we plan to propose new interfaces and virtual ssds (so it could be more software defined???)

combating ordering loss in the storage stack -- the no-order fs work (Vijay)
ordering not respected in many layers
don's use ordering to ensure consistency
1. coerced cache eviction -- additional write to flush cache to ensure ordering
2. backpointer-based consistency, do verification on backpointers when following pointers in file systems
3. (on going) inconsistency in virtualized storage setting

de-virtualization for flash-based ssd (yiying) - the nameless writes in ssd
take off too many layers of block mappings in file system -TFL- physical block
store physcial block number directly in file systems (when file system write a block, no position is specified, and disk decide where to put the block, and inform file system where the block got written)

in a virtualized enviormant: file system de-virtualizer (on going, i think)
file system perform normal writes, but fsdv do nameless writes, and store physical mapping)

zettabyte reliability (yupu)
add checksum at file system/memory bondaries
also an analytical framework to reason about reliability -- the sum of probability model
future: data protection as a service? without modifications to the os? i am not quite sure how to realize this....)

low latency storage class memory
data-centric workloads are sensitive to storage latency
storage-class momory: phase-change memory, stt-ram etc.): persistent, low latency, byte addressable
mnemosyne: persistent regions + consistency
persistency: so that data structures don't get corrupted
consistency: update data in a crash-safe way

Harden HDFs( Than Do)
software defined way to ensure system reliability
1. selective 2-versioning programming
2. encode file system states using bloom filters

software defined storage by Remzi (more like virtualized storage to me...)

today:
big box storage dominates, but the selling point is just the software
so go from software/hardware to software/vmm/hardware

problem:
1. reliability
when errors are propagated, where in the system are the errors handled? -- file system just lost errors during propagation! (like 1/10)
error handling is fundamentaly hard (not because linux is written in C, or because it is open sourced thus poorly programmed)
how to reasonable about system error profile?

2. isolation
storage isolated from VM to VM
question: how does fs react to block failures? (type aware fault injection) -- Remzi somehow linked this work to fault isolation =,= ----- get a result matrix on how file system react to different types of faults -- write errors are largely ingnored by ext3 or other file systems, sometimes they panic too (for reitherfs)
so if one vm does something funny, it might cause the underly storage system to panic -- thus need to isolate faults systematically.

3. performance
how does performance change when you stack storage systems (say file systems) together, and how to systematically reason about it

4. application
how applications use storage (something along the line of Tyler's file is not a file paper)

beyond virtualization:
composition of storage!

Questions:
1. have you seen correctness guarantees been breaken?
Of course! Disk lie. Apple just change the semantics of fsync

2. how about error propagation in a layered storage?
we are doing that work. but we have shown that error propagation in a single layer is hard, could imagine it even worse for multiple virtualized storage layers

3. your vision of software defined storage is quite different with sdn? how do you relate your work to sdn, and maybe some analogy of sdn on how to manage storage?
we are starting small, and on problems we have already seen.
(i personally think remzi doesn't have a good answer on sdn-flavor like sds, in terms of storage control plane , yet...)

Future trends on storage

Future trends on hard drive:

1. area density growth
chanlenge: nearghering blocks interference
smr (overlapping sectors): could do sequential read/writes and random reads, can't do random writes
heat assitented magnete recording to increase AD (assymetric temperature for read and write, heated recording, and use laser to heat)

2. what about ssd:
ssd important to improve performance
not a viable candidate for capacity though (cost for fab, but not that much revenue for the whole industry)

Futture trends on NV memory (Fusion io)
muliti-layer vision
multi-layer memory (less reliable) is the vast majority used in data centers

how to effectively use flash:
hiarachy of DRAM, flash, disk

fustion io: api to directly interact with flash instead of traditional block interface
api more appropriate for the flash media (transactional semantics, etc.)
more memory-like semantics of flash instead of traditional stroage view of flash, and corresponding api

basic io: read/write
transaction io: commit
memory like: ability to chase a pointer

challenges:
1. reliability with low cost/high density meida
2. integration with existing software stacks, caches, tiering (falsh consumer orientated, not data center oriented, so up to software guy to make it work for data centers)
3. system and data center implications - networking, scale out vs. scale up

Future trend of data protection/backup (data domain)
phase 0: tape
phase 1: deduplicated disk
difference between backup storage and primary storage (see their fast'12 paper) -- don't care about iops for backup
so disk should be optimized for backup purpose
phase 2: optimized deduplicated disk (disk no longer behave like a tape, and can do things differently than you do using tapes)
new io interfaces to do back-up
incremental forever, virtual full (instead of weekly full backup and daily incremental)
phase 3: integrated data protection (backup) silos
now we endup doing multiple backups every body and each layer (and you don't know how much your organizations are spending on backups!)
phase 4: solve problems of phase 3
provide a data protection data cloud (what is that?)

Why buy innovation (microsoft research, not a storage guy, works on hw acclelaration on search engine)

what innovations on data center?
1. a different scale-cost curve (lower cost for same scale) for the same value -- but innovation has fixed starting up cost (even at 0 scale)
2. different (and greater than linear!) scale-value curve!
e.g., new capability, competitive advantage, new business

university research typically focus on catogary 1 but catogary 2, industry typical want tatogry 2 a lot.
Why do academia trying so hard to do little tweaks on performance but not thinking about 2???

Questions:
1. why add area densities rather than just adding platters or RPM?
limit on how many platters and rpm(energy, say)

2. what is the best mechanisms for the new interfaces?
people like to write programs differently (some people like memory model, while some people do io well)

3. when does these new media hit mainstream?
well, still in early phase.... see where they could go

4. for catagory 2 research, it's hard to measure new values rather than measure performance. How to convince people our research has value?
it's the decision of the community of a whole to reward research of type 2. 30 years ago we have papers with less measurements than today's. so it is a culture problem of the community

5. why didn't you mention arrays? (disk arrays, flash arrarys, memory arrays, etc)
raids are well known
people are building flash arrays, and there are some things new compared to traditional raids

6. innovations on interfaces to storage; for years we have read/write blocks, what has happened to change that? Or are we going to end up with read/write blocks anyway? (by Remzi)
data domain: we got value by changing the interface
fusion io: media changes, roles of open source community has also changed. even though new interfaces not being picked up by old app, but could be used by new apps
microsoft: if we could get a lot performance, then it's worth to change interface
data domain: not just performance, but for new capabilities a lot of times

SDN: industry view

Cisco:
Missed what he is saying.....
but one point is slicing is fundamental for network
and how to balance (academia) network flexibility with production isolation/security

NEC:
Openflow market potential huge
usecases:
1. enterprise departmental isolation/mobility
2. campus virtual circuits for research collaboration (say, geni)
3. service provider: network efficiency
4, cloud comprehensive virutalization (virtual network work with VM and other virtualized stuff)

Google: (not view, but what google is providing)
Google sdn (openflow) WAN -- to connect datacenters through out the world
they still use bgp and isis (for backward compability)
they also use openflow to create (isolated?) large network test enviroment
they used contrallized control to push workload to the edges as much as possible, because edge swiches are cheap
management can't be centrailized entirely -- it's a distributed system problem, but not a networking problem anymore
All in all, Google likes to turn networking problem into distributed system problem, which is what sdn does anyway
but they want to aggreate network information, and be able to understand and reason about network performance (as there is no formal way to debug/diagnose netowrk performance problem yet)

big switch:
what sdn is: transition from closed big boxes to open programmable solutions!
make network operators be able to develop applications (easy, intuitive APi, easy debugging)
use case: big tap
monitering network: use a separate tap network, and dump a huge amount of traffic into this tap network, then analyze it (dumb idea, but great win for operators)
middleboxes: focus on virtual paths, and middlebox-controller communication design (will need integrate middlebox into openflow, i think)

questions:
1. in a virtulized enviroment, where is the controller?
google: we are not vmware, we have controller managing all layers
big switch: we do not want to preclude trandional newwork vendors from this space

2. openflow is as powerful as closed flow?
nec: this is a matter of time, as use cases emerge, it will become more powerful to integrate functionalites (5-10 yrs)
google: sdn is not openflow, but openflow is a form of sdn, as long as you have external controll, doesn't have to be openflow
cisco: software enables new thing (verification, say)
google: how to have secuirty in an open network enviroment is a challenge
big siwtch: contrlloer will become an os, so what the google folks talk about are really process isolation

Consistency First Approach for Routing (BGP revision)

Consistency:

consistency-first approach (over availabiliy or performance)
1.necessary
2.feasible
3.powerful abstractions

scater: consisent storage manager??? (maybe some storage abstraction here????)

internet availability not high: 2.0-2.6%
because an available physical path doesn't imply a router path

90% of outages are less than 700 secs (due to tansiant failure of network maybe?)
so more robust routing prococols to avoid short-term outtage caused by dynamics of routing protocols

bgp protocol:
1.opaque local policies
2. distributed mechanism to update paths (with some delay, thus inconsistency)
so causes short network unavailability

so underlying cause of short -term outages is inconsistent global state

Consensus Routing:
consistency-first approach
decouple safety and liveness

safety: forwarding tables always consistent and police compliant:
apply route updates only after they have reached all dependent Ases
apply updaes synchronously accross Ases
mechanism:
1. run bgp but don't apply the updates
2. a distributed snapshot taken periodically
3. ases send list of incomplet consolidators
4. sonsolidators run a consensus algorithm to agree on the set of imcomplete updates
5. consolidators flood new routes, and forwarding table updated (Note: they are route states updates, not route updates! which route to take is still within choices of each as, and doesn't have to be exposed to outside world)
(in essense, a two-phase commit process)

liveness:routing system adapts to failures quickly
solution: use old path, but dynamically re-route around the failed link (use exissting techniques)

scatter (p2p consistent storage layer)

design patter for consistency-first approach:
1. separation of safety and liveness
2. consistency as a baseline guarantee: trade-off is then between performance and availability (more constraint design space)

Software defined network in datacenter (From AA's group)

Elastic Middleboxes in the Cloud: (Robert)

where is the bottleneck (app? bandwidth? middlebox?)

Fast bottleneck Identification:

Processing Time ( hard for packets in/out)

CPU/Mem info (hard to decide how often to sample)

Open Connections

Strato: captrue MB and NW bottlenecks

Use Greddy Heuristic (tentatively add middle box) --- doesn't work in complex MB topologies

Refinement: most common used (overlapping) MB first?

Midldleboxes scaling: (Aaron)

Move some of the middleboxes control to Controller.

1. How is the logic devided?

classify middlebox states, and define interfaces between middleboxes and controller

Action state + support state + tuning state

Represent states: key (Field1 = value1, field2 = value2....) + Action (drop, forward, etcc.)

interfaces: get, remove, add states

NaPs: network-aware placement and scheduling in Clusters (yizheng):

motivation: little work to examing the interplay between CPu/memory resource sharing and network resource sharing (by TCP congestion controll, etc)

Quincy placement: put instances as close as possible (not optimial)

NaPs design:

general framwork to enable network awareness of clusture:

Lowest level: sdn controller to expose network state

Higher level: cluster shedular to communicate with sdn controller for netowrk status, work nodes for workload info and storage for data placement information

2012年11月3日星期六

A script to modify Samsung Galaxy S3's rootfs (including init scrpit)

Granted you will need to modified it according to your own environment, and install appropriate software (mkbootimg, etc.)

After running this script, you could use clockworkmode (which should be in your recovery image if your phone is rooted, so you might not want to reflash recovery.img...) to flash modifed image (here only boot.img is modifed) to your phone.

In this script, new boot.img is located in /emmc/clockwork/mod in recovery mode.

You only need to restore boot.img to save time.

Reference:
how to unpack/repack boot images (including ramdisk):
http://android-dls.com/wiki/index.php?title=HOWTO:_Unpack%2C_Edit%2C_and_Re-Pack_Boot_Images
how to calucate md5:
http://www.mydigitallife.info/how-to-calculate-and-generate-md5-hash-value-in-linux-and-unix-with-md5sum/
Linux cpio facility:
man cpio
http://www.gnu.org/software/cpio/

Restore.sh:
cd modified3-2012-11-01.17.05.34/
mkdir tmp
cd tmp/
unpackbootimg -i ../boot.img
mkdir ramdisk
cd ramdisk
gunzip -c ../boot.img-ramdisk.gz |cpio -i
/bin/cp /scratch/suli/android/samsung/config/init.rc init.rc
/bin/cp /scratch/suli/android/samsung/programs/slowdevice/null_bd/null_bd.ko lib/modules/null_bd.ko
find . | cpio -o -H newc | gzip > ../newramdisk.cpio.gz
cd ..
rm -rf ramdisk
/bin/mv newramdisk.cpio.gz boot.img-ramdisk.gz
mkbootimg --kernel boot.img-zImage --ramdisk boot.img-ramdisk.gz --pagesize 2048 --base 10000000 -o ../newboot.img
cd ..
rm -rf tmp
/bin/mv newboot.img boot.img
rm -f nandroid.md5
md5sum boot.img cache.ext4.tar data.ext4.tar recovery.img system.ext4.tar > nandroid.md5
adb push boot.img /sdcard/clockworkmod/backup/modified-2012-11-01.17.05.34/boot.img
adb push nandroid.md5 /sdcard/clockworkmod/backup/modified-2012-11-01.17.05.34/nandroid.md5

2012年10月25日星期四

network session in OSDI'12 (likely involving how system using network/SDN suff?)

Tuesday, October 9, 2012

9:00 a.m.–10:30 a.m.				Tuesday
Distributed Systems and Networking Ray Dolby Ballroom 123 Session Chair: Jason Flinn, University of Michigan Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE Zhenyu Guo, Microsoft Research Asia; Xuepeng Fan, Microsoft Research Asia and Huazhong University of Science and Technology; Rishan Chen, Microsoft Research Asia and Peking University; Jiaxing Zhang, Hucheng Zhou, and Sean McDirmid, Microsoft Research Asia; Chang Liu, Microsoft Research Asia and Shanghai Jiao Tong University; Wei Lin and Jingren Zhou, Microsoft Bing; Lidong Zhou, Microsoft Research Asia Guo PDF View the slides View the video Listen to the mp3 MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han and Scott Marshall, University of California, Berkeley; Byung-Gon Chun, Yahoo! Research; Sylvia Ratnasamy, University of California, Berkeley Han PDF View the slides View the video Listen to the mp3 DJoin: Differentially Private Join Queries over Distributed Databases Arjun Narayan and Andreas Haeberlen, University of Pennsylvania Narayan PDF View the slides View the video Listen to the mp3

2012年10月24日星期三

Storage-network interaction

Measuremnt and Analysis of TCP Throughput Collapse in Cluster-based Storage System
CMU, 2007, tech report

Network incast problem caused by storage system reading in different blocks in parallel.

A cost effective, high-bandwidth storage architecture.
ASPLOS-VIII, 1998

High Performance NFS: Facts & Fictions
SC'06

The panasas activescale storage cluster: Delivering scalable high bandwidth storage
SC'04

Remote Direct memory access over the converged enhanced ethernt fabric: Evaluating the options
HotTI'09
D. Cohen, T. Talpey, A. Kanevsky, U. Cummings, M. Krause, R. Recio,
D. Crupnicoff, L. Dickman, and P. Grun

[6] C. DeSanti and J. Jiang. Fcoe in perspective. In Proceedings of the 2008
International Conference on Advanced Infocomm Technology (ICAIT
’08), pages 1–8, Shenzhen, China, July 2008.

Network support for network-attached storage
Hot Interconnects, 1999

incast:
http://www.cs.cmu.edu/~dga/papers/incast-sigcomm2009.pdf

outcast:
https://engineering.purdue.edu/~ychu/publications/nsdi12_outcast.pdf

2012年10月21日星期日

Software defined storage

Windows Azure Storage: a highly available cloud storage service with strong consistency:
Microsoft Research
SOSP 2011

GFS-like stream layer. They added another layer on top of it to implement Blobs, Tables, Queues (for reliable message) and Drives (NTFS volume) abstractions.
Focused on load balancing (by their fancy index techniques?) and consistency protocol.
Scale computing separate from storage: nice for multi-tenant, bi-sectional environment bad for latency/bandwidth from storage. They didn't talk about how Azure Storage stresses their network though, just general load balancing.

Argon: performance insulation for shared storage servers

Matthew Wachs, Michael Abd-El-Malek, Eno Thereska, Gregory R. Ganger

Carnegie Mellon University

FAST 2007

Performance isolation for storage systems. However, they focus on the I/O bound, by assigning time quota to use disk, combined with pre-fetching etc.

For nonstorage resources like CPU time and network bandwidth, they claim that well-established resource management mechanisms can support time-sharing with minimal inefﬁciency from interference and context switching.

The Panasas ActiveScale Storage Cluster: Delivering Scalable High Bandwidth Storage:

FIXME: read it!

Proportional-Share Scheduling for Distributed Storage Systems
UMich/HP Lab
FAST 2007

Assumed dedicated storage system (client v.s. data nodes), a variation of fair queuing to serve requests. In a sense similar to Fair cloud, which shedules requests to key-value store, however no replication/migration/multiple resources type etc. so flexiblity of this scheme is limited. They do it in a distributed way, v.s., fair cloud uses a central scheduler.

They assume network is good enough.

Ursa Minor: versatile cluster-based storage
CMU
FAST 2005

Online change (software defined, late binding, whatever you call it) and adaptive management to data encoding (replication or parity etc.) and fault model (how many stop-fail failure and how many byzantine failure to tolerate). Not quite sure how they learn it though??

Others:
Storage Virtualization / Software Defined Storage:

SNIA Technical tutorial on Storage Virtualization from 2003 -
http://www.snia.org/sites/default/files/sniavirt.pdf
SANRAD white paper about snapshots and Global replication etc with
storage virtualization -
http://www.sanrad.com/uploads/Server_and_Storage_Virtualization_a_Complete_Solution.pdf

Cloud database / file systems:

Smoke and mirrors: reflecting files at a geographically remote
location without loss of performance -
http://dl.acm.org/citation.cfm?id=1525924

Cassandra - A Decentralized Structured Storage System -
http://dl.acm.org/citation.cfm?id=1773922

API:
Cassandra offers a key + structured object (which consists of multiple, hierarchical column families) data model, and provides atomic operation per key per replica, no transactional guarantee here. Namely, insert row, get column of row and delete column ofrow are the main API.

Distributed aspects:
Cassandra is a decentralized (chord-like) distributed storage system, which uses chord-style consistent hashing for key partitioning and chord-style replication management (replicate to next N-1 nodes in the ring, but possibly rack aware). . All standard distributed techniques here. One thing interesting:
They use an accrual failure detector to detect node failure, and observes exponential time interval distribution for gossip message arrivals.

Local persistence mechanism (all sequential):
1. Data is first persisted to a commit log (to ensure durability) and only after than changes are applied to in-memory data structure. This commit log (or journal) is written sequentially to a dedicated disk, and, of course, a lot of fsync.

2. They write in-memory structure to disk also in a sequential-only fashion, i.e., every time memory is full, a bunch of new files are generated, instead of modifying and argumenting existing file. More specifically, one file per column family, one file for primary key, and also an index file for the primary key. (Sort of column-orientated storage, but not entirely). After this, the corresponding commit log could be deleted.

3. They combine these generated files into large files periodically.

The Case for RAMClouds: Scalable High-Performance Storage Entirely in
DRAM - http://dl.acm.org/citation.cfm?id=1713276

订阅：博文 (Atom)

Yang Suli的Blog