Review of "Managing Data Transfers in Computer Clusters with Orchestra"

20 Nov 2015

Review of "Managing Data Transfers in Computer Clusters with Orchestra"

Problem: Cluster computing applications transfer large amount of data between computation stages. And these transferring takes a lot of the job execution time. But traditional network research mainly focuses on per-flow traffic management. So this paper proposes a global management architecture and a set of algorithms that improves transfer times of common communication patterns and allow scheduling policies at the transfer level, such as prioritized transfers.

Key point & trade-off: Use a global coordination both within a transfer and across transfers in a hierarchical control structure. The top level of Orchestra is an Inter-Transfer Controller (ITC) that implements scheduling policies such as prioritizing transfers from ad-hoc queries over batch jobs. ITC manages multiples Transfer Controllers (TCs), one for each transfer in the cluster and TC is responsible for selecting a mechanism to use for their transfers. They also actively monitor and control the nodes participating in the transfer. TCs are supposed to be used when a cluster needs to do data transfer. The trade-off here is mainly the extra complexity, you have another layer above transporting, if it fails, they the communication will be aborted. Another possible trade-off is scalability, meaning, how many TCs the ITC can schedule efficiently at the same time.

Will this paper be influential in 10 years? Maybe. It provides a good solution speed up communication in clusters. As long as the complexity can be controlled elegantly and scales really well. I think it's a very good choice to go for system optimizations.