Mark from UT-Austin
argue for more suitable abstractions for accelerators
accelerators != co-processors (shouldn't be handled as such at software layer) -- even though in the hardware gpu is a co-processor --- it can not open files by itself (as it cannot interrup into host hardwares)
thus On-accelerator OS support + Accelerator applications
In current model, GPU is a co-processor, and you have to manually do double-buffering, pipelineing, etc. i.e., too much low level details exposed. (9 cpu loc for 1 gpu loc)
gpu has thre leves of paralleism:
1. mulitple cores on gpu
2. muliple context on a core (to compensent latency of memory access)
3. SIMD vector parallelism, i.e., mulitple ALUs via data parallesim
gpu can access its local memory 20x ( 200GB/s v.s 16GB/s) faster than accesing cpu memory, consistency is also compromised
file system API design:
1. disallow threads in the same SIMD group to open different files Iall of them collaboratively execute the same API call) --- to avoid divergence problem
2. gopen() is cached on GPU (same file descriptor for the same file, i.e., offset shared, thus read/write has too specify offset explicitly) gread()=pread(), gwrite()=pwrite()
3. When to sync? can't do it asynchrounsly because gpu can't have preemptive threads, and cu polling is too ineffcient. so they require sync explicity (otherwise data never gets to disk!)
GPUfs design:
1. system wide buffer cache : AFS close(sync)-to-open consistency semantics (but per block instead of per file)
2. buffer page false sharing (when gpu and cpu writing to different offsets of the same page)
3. client(gpu)-server(cpu) architecture based RPC system
4. L1 bypassing is needed to do the whole thing
Q&A:
1. why file system abstraction instead of shared memory abstraction? you don't need to do durability on gpu anyway.
if you have weak consistency on memory, hard to program. but if you have weak consistency on file system, no big deal.
没有评论:
发表评论