Review of "VL2: A Scalable and Flexible Data Center Network"

VL2 is a practical network architecture that scales to support huge data centers with uniform high capacity between any servers, performance isolation between services and Ethernet layer-2 semantics. It uses flat addressing to allow service instances be put anywhere physically in the network, Valiant Load Balancing to spread traffic uniformly across network paths, and end-system based address resolution to scale to large server pools without complexing the network control plane. It puts an end to the expensive over-subscription model usually used to achieve high performance and is built on proven network technologies already available at a low cost in high-speed hardware implementations.

By offering layer-2 semantics, VL2 lets servers believe they share a single large IP subnet with other servers in the same service while eliminating the ARP and DHCP scaling bottlenecks. The network is built with two separate address families, topologically significant Locator Addresses (LA) and Application Addresses (AA). When a server S wants to send packet to server D, the VL2 agent on S traps the packets from S and wrap it with LA address of the ToR of the destination. Once the packet arrives at the destination ToR, the switch decapsulates the packet and deliver it to the destination AA carried in the inner header. Address resolution and access control is done in a similar way. The agent traps the request and query the directory service to get the corresponding information and then decide how to proceed.

Will this paper be influential in 10 years? Maybe. It proposes a good data center networking prototype that provides both good performance and flexibility. If it's also easy to use then it would be a good potential candidate for the next generation data center networking infra.

Review of "Neo4J"

Ironically, traditional relational databases doesn't actually handle relation very well. They require the user to perform usually expensive JOINs to retrieve the relation. But in some cases, real-time relation is needed. Under this requirement, people switch to de-normalization, but then the user would need to handle the extra complexity. The new NoSQL databases provides the required scalability and availability, but failed to provide the relations and requires JOIN at the application level, which is even more trouble. Neo4J is created to handle complex relation storage of modern applications. It's very obvious that the most successful Internet companies nowadays put a lot of attention on relations of entities. Google organize web contents by using links as relations, LinkedIn uses connections to form organic professional networks.

Neo4J has the advantage of sharing the same format form from white board design to what you eventually see from the database. Thus ease the communication frictions of teams. And it can handle relations really well without incurring several expensive joins that usually happen in traditional databases. But Neo4J is not the silver bullet. The major drawback is it lacks the ability to do arbitrary queries as traditional relational database is capable for.

Will Neo4J be influential in 10 years? I think so. Although not a panacea, it does fill the blanks when it comes to the the storage of graph data. And a lot of systems rely on graph storage heavily.

Review of "Tao: Facebook's Distributed Data Store For The Social Graph"

Tao is a distributed cache for the social graph of Facebook. It's built to replace an old php data accessing interface that was used to render Facebook's website. This interface basically proxied the real mysql database and allows you to retrieve data from memcache for faster reads. Tao is a new system built to replace this somewhat "ad-hoc" system, but provides the same interface for web servers.

For Facebook, there are about 99.8% reads and 0.2% writes. This character of IO makes it crucial to have highly optimized reads whereas not that much important for writes. Tao acts as a write-through cache for the underlying MySQL, different from a look-aside cache. Write-though cache doesn't have the expensiveness of writes. When you do a write to the datastore, Tao routes that write the master cache and then the master cache updates its own cache and ask other servers to do range refill and then writes this cache to the database, by doing this, there's no cost of reloading the entire cached data from database, which is a huge cost if the cached data structure is huge. And the reason why they don't just resend the data to replica caches is because this makes the cache refill request idempotent and saves bandwidth in cases when the replica already have the data. Writes are sent synchronously in reply to the writer and asynchronously sent to other followers. When a read misses for followers, readers ask leaders to load the data.

Tao also provides high read availability it can endure follower failures by paying potential break of read-after-write consistency and leader failure is tolerated by routing read misses directly to DB (potentially huge load increase to DB?) and writes are routed to another member of leader tier.

Will this project be influential in 10 years? I think so. The giant cache architecture makes serves Facebook's infrastructure really well. Although Facebook has all these intricate relationship of data entities, this system still provides amazingly high performance.

Review of "PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs"

Large scale graph structured computation is useful for cases from targeted advertising to natural language processing, thus several graph-parallel abstractions including Pregel and GraphLab has been developed. However, these systems ignored the fact that natural graphs commonly found in real-world have highly skewed power-law degree distributions (a small fraction of the vertices are connected to most part of the graph). This paper proposes PowerGraph which exploits the internal structure of graph programs to address the challenges of computing on natural graphs. By leveraging PowerGraph abstraction, it introduces a new approach to distributed graph placement and representation that exploits the structure of power-lay graphs. It explicitly factors computation over edges instead of vertices. As a consequence it has substantially greater parallelism, reduces network communication and storage cost and provides highly effective approach to distributed graph placement.

A graph-parallel abstraction consists of sparse graph G = {V, E} and a vertex-program Q which is executed in parallel on each vertex v and can interact through shared state or messages with neighboring instances. To address the challenges of computation on power-law graphs, PowerGraph eliminates the degree dependence of the vertex program by directly exploiting the GAS decomposition to factor vertex program over edges. By lifting the Gather and Scatter phases into the abstraction, PowerGraph is able to retain the natural "think-like-a-vertex" philosophy while distributing the computation of a single vertex program over the entire cluster.

Will this paper be influential in 10 years? Yes, I think so. I like the fact that it's designed on a real-world data pattern – i.e. considered the power-law distribution and really performed well.

Review of "GraphX: Graph Processing in a Distributed Dataflow Framework"

Distributed dataflow frameworks are generally slower than specialized graph processing systems which provide tailored programming abstractions and accelerated the execution of iterative graph algorithms. But graph analysis is usually just a part of a larger analytics process, the problem of using several systems together to solve one problem is not very charming. And graph processing systems often abandon fault tolerance in favor of snapshot recovery. In contrast, general purpose dataflow frameworks are good at analyzing unstructured and tabular data, but fall short on iterative graph algorithms which requires multiple stages of complex joins. And they often miss the opportunity of leveraging the common patterns of structure in iterative graph algorithms. This paper argues that many of the specialized graph processing systems can be recovered in a modern general purpose distributed dataflow system. GraphX is introduced as an embedded graph processing framework built on top of Spark. To overcome the performance issues, GraphX recasts graph optimizations as distributed join optimizations and materialized view maintenance. And since it's built on Spark, it's automatically fault tolerant.

GraphX leverages normalized representation of property graph as a pair of vertex and edge property collections to embed graphs in a distributed dataflow framework. Graph parallel computation is the process of computing aggregate properties of the neighborhood of each vertex. It can be expressed in a distributed dataflow framework as a sequence of join stages and group-by stages punctuated by map operations. In join, vertex and edge properties are joined to form the triplets view consisting of each edge and its corresponding source and dest vertex properties. And in group-by stage, the triplets are grouped by source or destination vertex to construct the neighborhood of each vertex and compute aggregates.

Will this paper be influential in 10 years? Yes, I think so. We have already have so many processing frameworks, it's good to integrate graph analytics and tabular data analytics into one framework. And this paper also introduces a general way of building graph processing engines on other dataflow frameworks too.

Review of "Towards a Unified Architecture for in-RDBMS Analytics"

As use of statistical analysis in enterprise applications increasing, database vendors started to implement new statistical techniques from scratch in the RDBMS, which leads to a lengthy and complex development process. This paper proposes Bismarck, a unified architecture that can implement various machine learning analytics in existing RDBMS systems.

This paper identifies a classical algorithm from the mathematical programming cannon, called incremental gradient descent (IGD), which is used to solve convex programming problems, has a data-access pattern that is essentially identical to the data access pattern of any SQL aggregation function, e.g., an SQL AVG. This paper leverages this observation and built a unified architecture which shows that one can implement IGD for different models using the user-defined aggregate features that are available inside every major RDBMS. Analytical tasks that can be implemented include Logistic Regression, Support Vector Machine Classifier, Recommendation (LMF), Labeling (CRF), Kalman Filters, Portfolio Optimization, etc.

This paper shows that this approach is 2–4x faster than existing in-database analytic tools for simple tasks and for some newly added tasks such as matrix factorization, order of magnitude faster.

Although the method proposed by this paper is both more efficient to implement than existing approaches (building analytics tools from scratch for RDBMS), and faster to solve the same problems. I have some doubt on how easy the programming interface will be of different machine learning models for end developers. Will analyst buy programming in SQL for these complex models, are these models flexible enough to cover different requirements on different kinds of data?

Review of "Materialization Optimizations for Feature Selection Workloads"

One of the pressing challenge in the increased interest in data analytics is to improve the efficiency of the feature selection process. This paper propose Columbus, the first data-processing system designed to support the enterprise feature-selection process.

Columbus is an R lang extension and execution framework designed for feature selection. To use it a user writes a standard R program, Columbus provides a library of several common feature selection operations such as stepwise addition, i.e., "add each feature to the current feature set and solve." The library mirrors the most common operations in the feature selection literature and what they observed in analysts' programs. The optimizer of Columbus will then use these higher-level, declarative constructs to recognize opportunities for data and computation reuse by using block as the main unit for optimization.

There are three novel classes of optimizations studied by this paper. 1) Subsampling, which is used to reduce the amount of data the system has to process to improve runtime or reduce overfitting. Coresets technique is used by Columbus. Which can provide a provably small error when d << N. 2) Transformation materialization is used to handle linear algebra decompositions such as QR decomposition, which is widely used to optimize regression problems. Model caching is used to warm start tasks because in feature selections, people usually has to solve many similar problems.

Will this paper be influential in 10 years? Columbus is the first system to treat the feature selection dialogue as a database system problems. Although I hold my doubt about whether this should be treated as a db system problem, different optimization techniques proposed by this paper do have their positions.

Review of "Scaling Distributed Machine Learning with the Parameter Server"

As more and more large scale optimization and inference problems come into being. It becomes necessary to use distributed framework to solve these machine learning problems. This paper proposes Parameter Server framework to solve these problems.

There are several features provided by this framework.

  1. All communications are asynchronous unless requested otherwise. And are optimized for machine learning tasks to reduce network traffic and overhead.

  2. It provides a flexible consistency model that can allow some level of relaxation to better balance convergence rate and system efficiency.

  3. It provides elastic scalability in which new nodes can be added without restarting the running framework.

  4. It provides fault tolerance and durability.

  5. Globally shared parameters are represented as vectors and matrices to facilitate machine learning applications. By treating the parameters as sparse linear algebra objects, the parameter server can provide the same functionality as (key, value) abstraction, but also admits important optimized operations such as vector addition, multiplication, 2-norm, and other more sophisticated operations.

A parameter server instance can run more than one algorithm simultaneously. Parameter server nodes are grouped into a server group and several worker groups. Server nodes communicate with each other to replicate and/or to migrate parameters for reliability and scaling. Each worker group runs an application. A worker typically stores locally a portion of the training data to compute local statistics such as gradients. Worker nodes only communicate with server nodes, pushing and pulling parameters. A scheduler node is used for each worker group. It assigns tasks to workers and monitors their progress.

Will this paper be influential in 10 years? Not sure. On one hand it provides a way to scale machine learning computations, but on the other hand, it doesn't provide a flexible enough programming model for building different machine learning applications. It would be great if there are APIs for Java and Python too.