CS267 SoI

23 Jan 2016

About Me

I'm a first year master student at EECS. Prior to Berkeley, I have worked at Morgan Stanley and VMWare as software engineer and frontend engineer. So most of my time have been spent with JavaScript, DOM, CSS stuffs. I also write "backend" code like Scala and Java too.

I'm interested in a lot of topics – high performance web apps, distributed systems, data visualization, machine learning, etc.


Currently I'm trying to use machine learning techniques to learn the patterns of web pages. Not the text information of web pages, but the layout/design of it. We see all kinds of web pages all the time, but the layout are kind of not that random – they have their intrinsic patterns, for example, a title text is usually followed by several paragraphs of smaller text; a horizontal box on the top of a web page tends to contain a list of navigation elements; image element tends to be followed by a short description; with card layout becomes prevalent, boxes tend to be of the same width and has the same spaces in between. etc. Some of the patterns I'm not even aware of. So I want to use "big data" to learn and reveal those patterns for me.

Currently, I have a crawler that's collecting the DOM information of the top-million websites [1] ranked by Alexa. Since it needs to learn the DOM structure, so it's not as simple as an HTTP GET, there's plenty of work need to do. It also preserves the screenshots of all these web pages, hoping to find a correlation in the screenshots and the respective DOM trees. Or just do some simple analytics like what's the most popular color used in top-million websites, what will it look like when you cluster the top million website by color, etc. That can already generate a lot of interesting results.

My current approach on learning the pattern is to treat different DOM elements as different entities, and then a web page can be represented as a sequence of different DOM elements (with the tree hierarchy encoded). And then use LSTM (long short term memory neural networks) to fit the dataset I get.

I have about 100,000 web pages and their screenshots already and it takes about 60GB of space already, the required computation power is enormous too – my initial model would require an encoding space of several million, which on one hand needs some work to reduce the dimension (currently I'm considering similar techniques to Word2Vec), and on the other hand needs fast enough hardware to drive. So I hope to gain some skills from this course to cope with it.

Parallel computing is already used in the crawler. Since the crawler needs to contact remote servers, a lot of the crawling time is actually used in downloading the web page, and the associated resources like JavaScript, CSS, images and fonts, etc. If we do this sequentially, on one hand, it will only use one core, on the other hand even within one core, a lot of time is spent on waiting, which could actually be used to do computation or parallel fetching. So parallel crawling can make the whole process work like a pipeline.

In the screenshot clustering case, there's distributed K-means algorithm [2] which could be used to do compute the clustering in a parallel way.

As for LSTM, the parallelism mainly comes from GPUs, because matrix operations can be executed on GPUs in a highly parallel fashion thus gives about 10x speedup for LSTM training and evaluation.

Technology: NodeJS, PhantomJS, ES6, HTML, CSS, Python, Scala, etc.

[1]. https://www.domainnamenews.com/up-to-the-minute/alexa-releases-top-1-million-sites-free-download/3412

[2] Oliva, Gabriele, Roberto Setola, and Christoforos N. Hadjicostis. "Distributed k-means algorithm." arXiv preprint arXiv:1312.4176 (2013).