Review of "Paxos Made Simple"

Paxos algorithm is widely used in fault tolerant distributed systems, but has been regarded as difficult to understand. This paper makes it much easier for readers to understand the Paxos algorithm and different guarantees.

A consensus algorithm guarantees safety of values committed to the system, more specifically, 1) only a value that has been proposed may be chosen, 2) only a single value is chosen, and 3) a process never learns that a value has been chosen unless it has been. There are three types of agents in a Paxos algorithm runtime. proposers, acceptors, and learners. In an implementation, a single process may act as more than one agent.

Paxos uses two phases to propose and accept a value. In the prepare phase, a proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors. When an acceptor receives this request with number n greater than that of any prepare request to which it has already responded, it responds with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted. In the second phase – accept phase, the proposer receives the responses to its prepare requests from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest numbered proposal among the responses, or is any value if the responses reported no proposals. If acceptor receives an accept request for a proposal numbered n, it accepts it unless it has responded to a prepare request having a number greater than n.

Will this paper be influential in 10 years? Yes, I think so, it has been for over a decade now and a lot of systems are using it to build robust scalable infrastructures.

Review of "ZooKeeper: Wait-free coordination for Internet-scale systems"

ZooKeeper is a service for coordinating processes of distributed applications. It provides a simple and high performance kernel for building complex coordination primitives at the client. It incorporates group messaging, shared registers, and distributed locks in a replicated, centralized service.

ZooKeeper provides to the the client an abstraction of a set of data nodes (znodes), organized in a similar way to unix file systems. It can be used to store meta-data of client applications used for coordination purpose.

The interface exposed by ZooKeeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems. It also guarantees a per client FIFO execution of requests and linearizability for all requests that change its state. As an example of how these guarantees work for applications, let's see how a leader transition works. With ZooKeeper, the new leader cna designate a path as the ready znode; other processes will only use the configuration when the znode exists. The new leader starts changing the configuration by deleting this ready node, then updates configurations and then recreate the ready node. The wait-free guarantees makes the configuration change parallel, which is much faster than single thread. The FIFO guarantees that even though these operations requests are issued asynchronously, worker nodes will not see ready znode before all the configuration changes.

Very interestingly, ZooKeeper uses a fuzzy snapshot scheme, in which it doesn't lock the data structure, instead just do a traversal and take whatever state is on that structure. The reason why it can do this is state changes are idempotent, so they can be applied twice and the final state will still be the same. This snapshot is used to speeding up message reply when recovering from a crash.

Will this paper be influential in 10 years? I think so. It solves a tricky issue of distributed systems – how does processes coordinate with each other. And it has gained a lot of adoption in the industry.

Review of "In Search of an Understandable Consensus Algorithm"

Consensus algorithms allow a collection of machines to work as a coherent group that can survive failures of some members, and provides horizontal scalability. Paxos has dominated the discussion of consensus algorithms for the last decade. But Paxos is also known for its complexity and inability to be used as a good foundation for building practical implementations, thus, this paper rethink-ed the consensus algorithm and came up with a much simpler algorithm called Raft.

Normally, consensus algorithms ensures correctness under all non-Byzantine conditions, ensures availability as long as majority of the servers are functioning, are independent from timing and latency and ensures response as long as a majority of the cluster responded.

Three main novel features of Raft are 1) Strong leader, for example, log entries can only flow from the leader to other nodes. Which simplifies the management of replications and makes it easy to understand. 2) Randomized timers based leader election, which only adds a small amount of mechanism to heartbeats while resolving conflicts simply and rapidly. 3) Membership changes, joint consensus approach is used for changing the set of servers in the cluster where the majorities of two different configurations overlap during transitions. Raft makes sure the following properties hold at all times. Election safety: at most one leader can be elected in a given term; Leader never overwrites or deletes entries in its log, it's append only; Log matching: if two logs contain an entry with the same index and term, the all the logs are identical in all entries up through the given index; Leader completeness: if a log entry is committed in a given term, then the entry will be present in the logs of the leaders for all higher-numbered terms; State machine safety: no two servers will apply different log entries for the same index.

Single leader makes Raft much easier to implement and understand, but I'm wondering will this be enough for systems who wants to scale-out writes? In which case, information must flow from followers to followers?

Will this paper be influential in 10 years? Maybe, it does bridge the gap between understandability of consensus algorithm and the increasing need of them by a lot. But I'm not sure about whether the restriction on information flow will become a bottle neck for low latency systems.

Review of "CRDTs: Consistency without concurrency control"

Sharing mutable data at a large scale has long been a difficult problem. Two approaches dominate in practice, one ensures scalability by giving up consistency guarantees, for instance, last writer wins; the other guarantees consistency by serializing all updates, which sacrifices scalability. But in some limited cases, a radical simplification is possible, that is if concurrent updates to some data commute, then no matter what order the replicas execute all updates, they will eventually converge. This paper call this a Commutative Replicated Data Type (CRDT). It ensures that there are no conflicts, hence, no need for consensus-based concurrency control. One thing to note that is, it's not a universal solution, but the paper claims that it's applicable to a surprisingly large amount of applications.

A simple example of CRDT is a set with a single add-element operation. A delete-element operation can be emulated by adding it to the set that contains all deleted element. Because now adding and deleting element operation is commutative, and idempotent. This data structure can easily scale out to many replicas.

As a more full fledged example. this paper presents TreeDoc, another example that allows consistency without concurrency control.

Although it's possible to design the underlying data structure in a way that we can do concurrent updates without sacrificing consistency. But the designing of these data structures usually requires much more effort than those more intuitive data structures. So I'm not sure about how much people will be willing to adopt this approach. But it certainly points out a new possibility for certain types of applications.

Review of "Coordination Avoidance in Database Systems"

In order to achieve low latency and high scalability, systems tend to minimize coordination or blocking communication between concurrently executing operations a lot. But uninhibited coordination-free execution can potentially compromise application level consistency. This paper identifies that given knowledge of application transactions and correctness criteria (invariants), it's often possible to avoid coordination while still preserving correctness criteria, but gained much more performance over coordinated executions.

Invariant confluence (I-confluence) is applied in a transactional context, it informally ensures that divergent but I-valid database states can be merged into a valid database state. I-confluence requires that coordination can only be avoided if all local commit decisions are globally valid, alternatively, commit decisions are composable. If two independent decisions to commit can result in invalid converged state, then replicas must coordinate in order to ensure only one of the commits succeeds.

Invariant use is the key to I-confluence. By directly capturing application level correctness criteria via invariants, I-confluence analysis only identifies "true" conflicts. This allows I-confluence analysis to perform a more accurate assessment of concurrency restrictions compared to related conditions such as commutativity. But this also depends on the correctness specified by the user, if the user gets anything wrong in the invariants, the analysis may results in conflict results too.

Will this paper be influential in 10 years? Maybe, it provides a good way to reason about system consistency under concurrency context. In the age of more and more mega scale systems, this certainly has its field of application.

Review of "Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary"

As more and more online services leverage geo-replicated services, cross-site consistency is becoming more and more needed for various systems. This paper proposes RedBlue consistency to meet this requirement.

RedBlue consistency, in other words, is a combination of strong consistency and eventual consistency. In which blue consistency is used for low latency operations, whereas red consistency is used for those that requires transaction be serially executed. This coexistence of multiple consistency levels gives application the flexibility to choose the appropriate level of consistency as per the business logic requirements.

In order to use fast operation as much as possible, and only resort to strong consistency when it's needed, this paper also identifies conditions delineating when operations can be blue and must be red as RedBlue consistency doesn't ensure a priori that the system achieves state convergence and invariant preservation. The key idea is to split the original application operation into two sub actions, a generator operation with no side effects, it will only be executed on the primary site and transform the primary site to a new state and then produce a shadow operation, which will be executed on every other site. Based on this, it gives the procedure to decide what operations need to be absolutely red.

Will this paper be influential in 10 years? Maybe, not only does it propose a consistency model that's flexible enough to cater for both low-latency and strong consistency, it also gives out guidelines on how to practice based on this consistency model to leverage as much as low-latency as possible.

Review of "Shielding applications from an untrusted cloud with Haven"

As cloud computing becomes more and more prevalent, more and more data are stored in the cloud, including sensitive ones. Thus, without special techniques the security of your data will depend very much on the cloud provider. Both the humans involved and the software/hardware running the cloud infrastructure. But we all know that such a dependency is not always dependable. How do we cope with that? Haven a prototype that achieves shielded execution of unmodified legacy applications like SQL Server and Apache on a commodity OS and commodity hardware. SGX is used to defend against privileged code and physical attaches such as probes.

Two key features are used to provide shielded execution, Intel SGX and Drawbridge. Intel SGX protects the confidentiality and integrity of pages in an enclave, a region of user mode address space. While cache-resident, enclave data is protected by CPU access controls. SGX also allows CPU-based attestation, allowing a remote system to verify cryptographically that specific software has been loaded within an enclave and establish shared secrets allowing it to bootstrap an end-to-end encrypted channel with the enclave. Besides protecting the content and integrity of memory mappings, SGX also mediates transitions into and out of the enclave and protects the enclave's register file from OS exceptions handlers. Drawbridge provides two core mechanisms, picoprocess and library OS. picoprocess is a secure isolation container constructed from a hardware address space, but with no access to traditional OS services or system calls; Instead, a narrow ABI of OS primitives is provided,, implemented using a security monitor. LibOS is a version of Windows 8 refactored to run as a set of libraries within the picoprocess, depending only on the ABI. Together, picoprocess and LibOS enable sandboxing of unmodified Windows applications with comparable security to virtual machines, but with much lower overheads. Together with Drawbridge, Haven enabling mutual distrust between host and guest.

Will this paper be influential in 10 years? Maybe, it provides a great way to secure cloud computing. But if it also supports other OSes, like CoreOS and Linux, I think the future will be even better.

Review of "CryptDB: Protecting Confidentiality with Encrypted Query Processing"

Sensitive information are vulnerable to theft. As an example, you have probably heard of data leaks very often. CryptDb is a system that provides practical and provable confidentiality in the face of data leaks for applications backed by SQL databases. It works by executing SQL queries over encrypted data using a collection of efficient SQL-aware encryption schemes. It also allows for chaining encryption keys to user passwords, so that data item can be decrypted only by user who has the password.

Several of the key techniques include 1) a set of well-defined primitive operators such as equality checks, order comparisons, aggregates, and joins. CryptDB encrypts each data item in a way that allows the DBMS to execute on the transformed data. Symmetric-key encryption is used for execution efficiency. 2) adjustable query-based encryption is used to avoid revealing all possible encryptions of data to the DBMS a priori. Onions of encryption is used to adjust the the encryptions. 3) chaining encryption keys to user passwords, as a result, each data item in the db can be decrypted only through a chain of keys rooted in the password of one of the users with access to that data.

CryptDB provides a group of efficient techniques to secure data stored inside the database which is constantly causing security issues in the more and more connected world. I think it'll a influential in 10 years.