2013年3月19日星期二

self-encrypted disk and disk write prcess

Technical talk from Seagate

I will document the self-encrypted disk later. Almost all complexity likes in the bootstrapping process.

But here is the data write process within a seagate disk:

1. user data comes from the SATA interface, along with command, ecc code and dff code (to verify that data hasn't been modified)

2. Disk check the ecc code, the dff code, and recompute another checksum code (they have a name for it, I forgot). This chcksum code is being used afterwards, not the ecc and the dff code.
    If the data needs encryption, encryption is also done here.
    Note that this process happens *on the wire*. Only after the whole computation is the data going to be stored in the buffer cache

3. After data is in buffer memory, disk will do further processing:
   a. scagger (or another name?)  --- to filter out representative pattern in the dta
   b. ??? ---- to filter out repetitive bit string in the data?

2013年3月16日星期六

Measurements in a virtualized setting

Evaluating the Accuracy of Java Profilers
PLDI'10  IBM Research

This paper is not related to virtualization directly. Instead it talks about how popular Java profilers fails to sample the call stack randomly, because most profilers will defer sampling to application yield points.
The proposed fix is to interrupt the application execution using signals, and do true random sampling based on time.

However, it might be interesting to think about measurements in general in a cloud/virtualized setting.
1. What would be useful for people to measure in a virtualized/cloud setting? Call stacks? CPU statistics, e.g., branches? Memory usage?
2. What properties we need to ensure for each measurement, in order for the results to be relevant and useful? E.g., sampling needs to be random in time to give accurate profiles.
3. How are these measurements typically performed? How do they rely on hardware events, e.g., CPU cycles or interrupts?
4. How are these hardware events typically emulated in a virtualized setting? Either it is a trap-emulate architecture, or based on binary translation?
5. The way these events are emulated, will they affect the properties we desire? For example, software generated interrupts may be batched or deferred to deliver to the VM, which will affect the timing of the interrupts. An instruction in VM may cause trap into VMM and take thousands of cycles to emulate, which makes the CPU cycles concept fuzzy.
6. How will that affect our measurement results? How do we work around it to get correct measurement results which can shed light in optimizing our programs?


I don't have satisfactory answer to any of the above questions. But it might be interesting read/think about it.


Another example of this is that TCP relys on accurate RTT estimates in order to perform flow control and adjust window sizes. A VM, however, may be descheduled for tens or even huandreds of millseconds while a packet is pending. As a result, CPU time-multiplexing can distort a VM's RTT values, causing its congestion windows to grow too slowly, which degrades throughput signifiantly. To solve this, some has proposed offloading more TCP functionality to hypervisor, or presenting VMs with virutal NI hardware that supports optional TOE(TCP Offload Engine).

2013年3月3日星期日

CA-NFS: A Congestion-Aware Network File System

FAST'2011 NetApp

Key Idea:
1. Make NFS client behave better to improve distributed system performance.
2. They define congestion price in a device independent way based on utilization (of disk, cpu, network, and even virtual devices such as readahead effectivess serving as heuristics).-- ref.: “Throughput-competitive online routing” , FOCS ʼ93
3. The servers and the clients then use the price as a way to coordinate and schedule file system operations.

Networking Abstractions (in a cloud computing era)

Coflow: A Networking Abstractions for Cluster Application
HotNets'2012 UC-Berkeley (Same guys behind the "Resilient Distributed Datasets")

Key Ideas/Takeaway:
1. The completion time of a cluster application depends more on the fate of a collection of flows, instead of individual flows.
2. Most application needs can be expressed in terms of minimizing completion time or meeting deadlines. (really?)
3. Flows can be decoupled in time (by using storage) and in space (by using broadcast or multicast). --- Note: this is an example of network using storage!
4. Typical cluster application dataflow patterns: mapreduce, dataflow with barriers (multi-stage map-reduce, e.g., pig), dataflow without explicit barriers (dryad), dataflow with cycles (spark), bulk synchronous parallel (parallel scientific computing), partition-aggregate( google search engine).

API:
    four player: driver (cluster coordinator, or cloud controller), sender, receiver, network
    create(pattern,[options]) => coflow handle,  called by driver, pattern may be shuffle, broadcast, aggregation etc.
    update(handle, [options]) => result, called by driver
    put(handle, flow id, content, [options]) => result, called by sender
    get(handle, flow id, [options]) => content, called by receiver
    terminate(handle,[options]) => result, called by driver
 
Underlying Assumption:
    coflow assumes a fixed set of senders and receivers, the driver has to determine them without network participation; i.e., they exclude the possibility that the network determines where to put replications, etc. This might be mitigated by using candidate senders/receivers, I am not quite sure.
   "co-flow comes into action once you have already determined where your end-points are located. If the decision of end-point placement is not good one, there is only a limited opportunity"

Questions:
1. How does this work in a virtualized environment?
2. How does the cloud controller coordinate multiple coflows? They proposed sharing (reservation based), prioritization and ordering. This gives the cloud controller a way to allocate (the abstracted) network resources? What is the implication?
3. How does network coordinate requests from multiple cloud controllers: not explicit in paper.
4. What network topology is presented to the cloud controller? The real topology? Rings?
5. How does this framework handles the situation where network is not the bottleneck? It has to dynamically interact with computation unit and storage unit?




Programming Your Network at Run-time for Big Data Applications
HotSDN'2012 IBM Watson, Rice University

Key ideas:
1. Application manager sends a traffic demand matrix to the SDN controller, which in turn use this information to optimize network (using optical switch to setup topology, etc.). This traffic demand matrix is estimated using application level knowledge.
2. Application manager, based on the knowledge that it is operating on an optical switch enabled network, could do some simple optimizations, e.g., aggregate reducers in the same rack, submit requests to SDN in batch. Here application manager uses two pieces of network information: it is optical switch, and which nodes are in the same rack.
3. Used the traffic matrix (?), they argued for some efficient implementation using optical switch on some particular communication patterns, e.g., aggregation, shuffling, or overlapped aggregation.

Comments:
1. This work is based on rack granularity, because only ToR has optical links.
2. Another paper mixed API and implementations.......=.=
3. I am really sick of Hadoop...-__-b!!!

Questions:
1. What is the network topology presented? A ring per rack?




Fabric: A Retrospective on Evolving SDN
HostSDN'2012,  Nicira and UC Berkeley,  Scott Shenker

This paper advocates for a router/switch chassis style implementation of network

Before the paper:
1. I love Scott Shenker!!!! I almost agree on every word he said about SDN!
2. We definitely need some layer 2.5 addressing, since both IP and MAC have fundamental deficiencies.

What's the API of network:
1.  Host --- Network interface
     Host asks the network to send their packets, along with QoS requirements.
     Currently it is done by packet header

2. Operator -- Network interface
     Operators give requirements and decisions of the network operation to the network
     Currently it is done by box-to-box router configuration. SDN provides a more programmable interface of it, be decouping the distribution model of the control plane to the topology of te data plane

3. Packet -- Switch interface
     How a packet identifies itself to a switch. The switch then uses this piece of information to do forwarding thus actually implementing the connectivity of the netowrk.
     Currently this is done by packet header

  The problem is that we don't currently distinguish the Host-Netowrk interface and the Packet-Switch interface, thus unnecessarily couples network service implementations (isolation, security, etc.) and core connectivity implementation.

The fabric architecture:
1. hosts, which asks for network service
2. edge switches, which implement network services, using current header and protocol (e.g., IPv4)
3. core fabrics, which implement network connectivity, using its own label potentially (like MPLS)

This is very much like the internal architecture of a modern switch
(Two versions of) SDN should be introduced separately to edges for service management (complex), and to core for connectivity management (very basic).


Meso: Fine Grain Resource Sharing in Data Center
NSDI'2010, UC Berkey, Scott Shenker and Ion Stoica

Key Idea: Resource Offers
In stead of doing scheduling, make resource offers and push scheduling decisions to the framework applications. E.g., Meso offers two nodes with 8GB Ram, Hadoop decides wthether to take this offer and which task to launch on it.
A more traditional approach would be applications express its needs in a (specially designed) language and a central scheduler schedules based on these needs. But what if application has needs which can't be expressed in such a language? Also, Hadoop already has scheduling logic, why not utilize it?



The Datacenter Needs an Operating System
HotOS'2011, UC Berkeley, Scott Shenker and Ion Stoica

Think the datacenter as the new computer, and think the datacenter infrastructure problem from an OS perspective.

A datacenter OS needs to provide:
1. Resource sharing
    Hadoop already does scheduling between jobs.
    Unsolved: inter-framework sharing, sharing the network, independent services, and virtualization

2. Data sharing
    Currently done in the form of distributed file system
    Unsolved: standardized interfaces (like VFS?), performance isolation, etc.

3. Program abstractions
    including communication primitives

4. Debugging and Monitoring

Questions:
If we think of Hadoop as a form of data center OS, where does it fall short?



Location, Location, Location! Modeling Data Proximity in the Cloud
HotNets'2010, MSR and U Mich

Key Idea:
Insert a layer (which they call Contour) between application and key-value store, which report to the application the latency of accessing a particular key.
To calculate this, key-value store report to Contour a replication topology for each key, Contour combined this information with the network latency etc. to calculate update latency.
It suffers from security problem as revealing too much details about the storage layer to the application.