Review of "Materialization Optimizations for Feature Selection Workloads"

01 Nov 2015

Review of "Materialization Optimizations for Feature Selection Workloads"

One of the pressing challenge in the increased interest in data analytics is to improve the efficiency of the feature selection process. This paper propose Columbus, the first data-processing system designed to support the enterprise feature-selection process.

Columbus is an R lang extension and execution framework designed for feature selection. To use it a user writes a standard R program, Columbus provides a library of several common feature selection operations such as stepwise addition, i.e., "add each feature to the current feature set and solve." The library mirrors the most common operations in the feature selection literature and what they observed in analysts' programs. The optimizer of Columbus will then use these higher-level, declarative constructs to recognize opportunities for data and computation reuse by using block as the main unit for optimization.

There are three novel classes of optimizations studied by this paper. 1) Subsampling, which is used to reduce the amount of data the system has to process to improve runtime or reduce overfitting. Coresets technique is used by Columbus. Which can provide a provably small error when d << N. 2) Transformation materialization is used to handle linear algebra decompositions such as QR decomposition, which is widely used to optimize regression problems. Model caching is used to warm start tasks because in feature selections, people usually has to solve many similar problems.

Will this paper be influential in 10 years? Columbus is the first system to treat the feature selection dialogue as a database system problems. Although I hold my doubt about whether this should be treated as a db system problem, different optimization techniques proposed by this paper do have their positions.