Review of "A Sampling Algebra for Aggregate Estimation"

First, sampling table is generally useful when debugging a large query or speeding up queries by trading-off some accuracy. Two of the major issues are, 1) a way to write such kind of sampling SQL queries, b) a way to estimate how accurate the result is. This paper defines the notion of Second Order Analytical equivalence (SOA equivalence), a key equivalence relationship between query plans that is strong enough to allow quantile analysis but weak enough to ensure commutativity of sampling and relational operators. It also defines GUS operator that emulates a wide range of sampling classes which commutes with most relational operators under SOA-equivalence. This paper develops an algebra over GUS and relational operators that allow deriving of SOA-equivalent plans to give moment calculations quantiles estimation.

The high-level goal of this paper is to introduce a tool that computes the confidence bounds of estimates based on sampling. It takes in a query plan which is used to execute queries, transform it into an analytically equivalent (Second Order Equivalence, SOA) query plan that has a particular structure: all relational operators except the final aggregate form a subtree as the input to a single GUS sampling operator. The GUS operator feeds the aggregate operator that produces the final result. Confidence intervals are the end goal, and preserving expected value and variance is enough to guarantee the same confidence interval using both CLT and Chebychev methods. Thus the paper defines two query plans are equivalent as their result has the same expected value and variance. Although allowing significant progress, this equivalence doesn't say anything about intermediate results. Thus this paper extends this notion into the classic relational algebra.

This paper also provides theory foundation for GUS Quasi-Operators and laid out interaction between GUS and Rel Ops. Will this paper be influential in 10 years? I think so, it laid a pretty solid foundation for efficiently sampling in relational databases and compute its confidence level which is seeing a growth given systems are collecting more and more data which is not practical for doing accurate queries.

Review of "BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data"

Modern data analytics applications usually involve computing aggregates over a large number of records. Traditionally, such queries have been executed using sequential scans over a large fraction of a database. But increasingly, some applications require near real-time response rates. Like updating ADs based on trends in social networks. In such cases, the applications can usually tolerate certain amount of error, but the requires a fast response. BlinkDB is a massively parallel approximate query engine for running interactive SQL queries on large volumes of data which provides fast response rate with bounded errors. It allows the user to balance response rate with response time flexibly.

It consists of two main modules, 1) Sample Creation and 2) Sample Selection. Sample creation module creates stratified samples on the most frequently used QCSs(query column sets, the columns used by WHERE, GROUP BY, and HAVING) to ensure efficient execution for queries on rare values. Sample creation is formulated into an optimization problem. BlinkDB choose a collection of stratified samples with total storage costs below some user configurable storage threshold which are neither over- nor under -specialized for the query workload.

BlinkDB extends Hive framework by adding two major components, 1) an offline sampling module that creates and maintains samples over time, and 2) a run-time sample selection module that creates an Error-Latency Profile(ELP) for queries. Samples are decided by QCSs that appear in queries, once the QCSs is decided, BlinkDB do distributed reservoir sampling or binomial sampling techniques to create a range of uniform and stratified samples samples across a number of dimensions. At run-time, ELP decides the sample to run the query after considering the error and response time requirements.

Will this paper to influential in 10 years? Yes, it provides a flexible mechanism to balance response time and error rate which is needed in the presence a more and more big data systems.

Review of "F1: A Distributed SQL Database That Scales "

Previously, the AdWords system uses a sharded MySQL as the backend system. But sharding has several issues. First, it's not transparent, so it incurs extra burden onto developers; Second, some shard server might get very hot, in which case you need to manually reshard the database, which is very complicated; And cross shard query is very difficult, and cross shard transaction is not supported at all. So sharded RDMS is really not a solution for large scale systems like AdWords. Then how about NoSQL systems like BigTable, it turned out that these systems also has its own problems, even though they do provide high availability and scalability. BigTable doesn't have a full fledge transactions support; It doesn't have secondary index, and no SQL query support, which is very import for relatively complex business logic; Also it doesn't provide joins and as a consequence results in a lot of data islands. So the engineers developed F1, which has both the usability and transactions support of traditional RDMS and the scalability and availability of NoSQL systems. It's also designed to support both OLTP and OLAP scenarios.

F1 is built on top of Spanner, a distributed, highly scalable storage systems. As Spanner controls the storage of all the data, one key design decision of F1 is it doesn't hold any data. So all data is remote, F1 just acts as an orchestration layer on top of Spanner, providing all the RDMS features. This makes F1 not able to leverage data locality, but on the other hand, makes the implementation simpler. It has several ways to mitigate the disadvantage of having no local data. Hierarchical schema makes a read efficient, because data you usually need at the same time is stored in one Spanner bucket, and it also enables parallel fetching and single range read. Protocol buffer types support also makes it really efficient to do small table joins.

As Spanner uses synchronous replication mechanism, the latency of the system is relatively high. But as it uses a lot of hash repartitioning and optimized distributed query technique, the performance of performing complex queries is actually better than the sharded MySQL, and the availability and scalability is much better!

Besides these, F1 also has some creative features like ChangeHistory PubSub system, nonblocking schema update, etc. I think it really proves that you can have both usability of traditional RDMS and scalability and availability of NoSQL systems all at the same time. So I think it'll be influential paper in 10 years.

Review of "Spanner: Google’s Globally-Distributed Database"

Systems like Bigtable has been complained for it's difficult to use for some kind of applications which has complex and evolving schemas or those that requires strong consistency in the presence of wide-area replications. Spanner is a database that shards data across many sets of Paxos state machines in data centers spread across the world. It automatically reshards data across machines as the amount of data or the number of servers changes.

Spanner provides externally consistent reads and writes, and globally-consistent reads across the databases at a timestamp which is enabled by having a globally meaningful commit timestamps to distributed transactions. This key enabler of this timestamps is a new TrueTime API and its implementation. The API directly exposes clock uncertainty and the guarantees on Spanner's timestamps depend on the bounds that the implementation provides. If the uncertainty is large, Spanner slows down to wait out the uncertainty. Uncertainty is typically less than 10ms by using GPS and atomic clocks as references.

A spanner deployment is called a universe. Spanner is organized as a set of zones which is the unit of administrative deployment. Each zone has a zonemaster and a hundred to several thousand spanservers. Zonemasters assign data to spanservers where spanservers serve data to clients. A universe also has a universemaster ad a placement driver, where the latter moves data across zones.

Data is organized as directories or more accurately buckets. It's a contiguous keys that share a common prefix. A directory is a unit of data placement. Spanner supports SQL like queries plus some extension to support protocol-buffer-valued fields. Every table is required to have an ordered set of one or more primary-key columns.

Will the paper be influential in 10 years? It proposes a very creatively way to expose the whole cluster with a consistent view of the current timestamp and uses it to achieve consistency of transactions. The guarantees and rich features it provides also makes it possible for other systems like F1 to build on top of it.

Review of "Mining Modern Repositories with Elasticsearch"

Organizations are generating more and more data, at a rate that have exceed their ability to analyze, but analyzing these data is very crucial to the success of these businesses. Under this situation, ElasticSearch is developed to cope with this needs. It's a distributed full-text search engine, which is scalable and provides near real-time query response.

ElasticSearch(ES) is based on Apache Lucene; each ES index consists one or more Lucene indices, called shards. When a new document is added to an index, the ES server defines the shard that will be responsible for storing and indexing this document. By doing this ES can automatically balance shards among nodes in a cluster. ES provides a RestAPI interface for communication with other applications. It's schema-less, so you can insert different types of documents into the same index. But you can also use a fixed mapping to better control the data inserted into the index, for example, to disable indexing on certain fields. In order to cater for both insertion throughput and data visibility, ES provides a tunable fixed time interval for index refreshing. You can perform query on ES using filters or queries, where the latter also gives the relevance scores of the returned items.

ES provides horizontal scalability and great performance compared to using traditional RDMS for similar queries, while at the same time provides much better agility. Whereas, it also has its weakness, such as no ACL and high learning curve of the query language.

You can use ES to build modules that requires search but exceeds the limit of traditional RDMS. It's a very good module that can certainly fill the gap between traditional RDMS and certain analytical requirements. But for a lot of analytics, especially structured data is required, it's still better to leverage on big data processing frameworks and implement the logic in your own code. Anyway, it serves the unstructured full-text analytics very well, so it's still a very good project.

Review of "Succinct: Enabling Queries on Compressed Data"

For traditional data stores, data scans incur high latency for large data sizes and have limited throughput since queries typically touch all machines. One can use index in this case, but index have high memory footprint which can be as much as 8x larger than the input size according to the paper. Thus, the author proposed Succinct, which is a distributed data store operates on compressed data directly which provides memory efficiency close to data scans and latency close to indexes.

The key idea is Succinct stores an entroy-compressed representation of the input data that allows random access, enabling efficient storage and retrieval of data. The data representation used by Succinct also supports count, search, range and wildcard queries without storing indexes – all the required information is embedded in the compressed data and execution doesn't require decompressing. By doing this, Succinct achieves similar or stronger functionality but requires 10-11x lower memory than data stores that uses indexes, while only loose part of the decompression throughput than traditional compression techniques.

More concretely, Succinct stores semi-structured data as compressed suffix arrays. Array of suffixes are compressed by using the fact that they are ordered and a help array called NextCharIdx. With this technique, essentially, you only need to store different characters used by your data once, along with the number of occurrences. By using NextCharIdx, Succinct has all the information to reconstruct the original suffix. Mapping between position of suffixes and the original substring are managed by Aos2Input and Input2Aos. Naive binary search would require reconstructing suffix at every step, whereas, in Succinct, it aggressively exploits the 2-d NextCharIdx representation, for a string of length of m, this algorithm only performs 2(m - t - 1) binary searches instead of two full binary searches on Aos2Input array.

Will this paper be influential in 10 years? I think so, it creatively applies compressed suffix array to store semi-structured data and achieves far better memory efficiency while maintaining similar performance of indexed data. It's a great example of applying theory advancement in algorithms to real-world problems!

Review of "Impala: A Modern, Open-Source SQL Engine for Hadoop"

Impala is an interactive, SQL query engine built on top of Hadoop. It's 5-65x faster than Apache Hive, responses in seconds instead of minutes. It runs natively on Hadoop/HBase storage and metadata so there's no need to duplicate/synchronize data between multiple systems.

Impala supports most of the SQL-92 SELECT statement syntax, plus additional SQL-2003 analytic functions and most of the standard scalar data types.

Ann Impala deployment is comprised of three services, the deamon service impalad is dually responsible for accepting queries from client processes and orchestrating their execution across the cluster and for executing individual query fragments on behalf of other Impala deamons. datanode process allows Impala to take advantage of data locality and statestored is Impala's pub-sub service which disseminates cluster-wide metadata to all Impala processes. Finally, there's also a catalogd service serving as Impala's catalog repository and metadata access gateway.

Applications communicate with Impala through ODBC or JDBC interface. Impalad takes the request and uses Query Planner to find an optimal query plan and then execute the query plan among all nodes. Nodes do local processing to avoid network bottlenecks.

Will this paper be influential in 10 years? Yes, I think so. It provides a much more efficient way to do data analytics on top of large scale of data and since it's built on top of Hadoop, it can cover a large amount of the big data community who is already using Hadoop MapReduce.

Review of "Dremel: Interactive Analysis of Web-Scale Datasets"

Dremel, as the root of later developed Apache Hive, Cloudera Impala, and Apache Drill, was designed for interactive analysis of web-scale datasets. It's designed to cope with the need of efficient analysis of large scale data.

Dremel uses columar storage representation for nested data. Unlike traditional record/row oriented data storages, it's column oriented, the same column of different records are placed together, in this way, it can easily support strongly-typed nested records when using together with a tree structure. But this brings challenge of efficient reassembly of record from this column layout. Dremel solves this by using a Finite State Machine.

Dremel's query language is based on SQL and is designed to be efficiently implementable on columnar nested storage. Each SQL statement takes as input one or multiple nested tables and their schemas and produces a nested table and its output schema. This query also supports nested subqueries, inter and itra-record aggregation, top-k, joins, etc. Query is executed using a multi-level serving tree. A root server receives incoming queries, reads metadata from tables, and routes the queries to the next level in the serving tree. The leaf servers communicate with the storage layer to access he data.

Will this paper be influential in 10 years? I think so. It innovatively combines SQL query language with big data and brings more efficient analytics over large amount of data. Under the influence of this paper, we now have Apache Drill, Google BigQuery, SparkSQL, Apache Hive, and a lot of interactive query frameworks.