How Babbar is crawling billions of webpages? (2/2)

In part 1 of this article, you learned how Babbar crawls the web, extracts the links and computes the metrics (more to come on this subject soon) and now, you know almost everything there is to know about Babbar’s computing side.

Let’s now talk about how Babbar makes all these data available to its clients.

In order to efficiently make it available, there is a “serving” side of Babbar that requires some additional work on the database front as well as a few acrobatic feats between the cluster to be truly efficient. We made a strong effort on a step of the utmost importance: compression.

To give you a rough idea, let’s consider the number of 10 billion pages which is a big and round number. When Babbar crawls 10 billion pages, it gets around 200 billion external links. It will also get twice as much of internal links but they are stored differently (we’ll talk more about this later also).

An external link is at the very least two URLs and all the metrics, a text anchor, a date, etc. Knowing than a URL is 90 characters on average, then without any compression, it become totally intractable for cost reasons.

The first challenge was to write an algorithmic solution to compress URLs as fast and efficiently as possible. This first step of compression divides by three the size of the data that needs to be handled for all the computations. After that, there is another step of compression done by the database that relies on the many repetitions in the data.

It is impossible for us to finish this article without mentioning the compression of semantic vectors which we are really proud of.

A semantic vector is around 1kb and is very hard to compress. The problem being that in order to compute the semantic value it is imperative that we include this vector with the information we send for each outlink we have (That’s 200 billion links and 1kb per link for 10 billion pages)

This is totally unrealistic from an operational point of view. We had to find a way to compress these vectors. We tried already existing solutions like Product Quantization, Scalar Quantization of representing vectors using words. All these techniques were very efficient but way too costly to deploy in Babbar.

After a lot of trials and errors, we finally found the very efficient solution we were looking for. It is based on scalar quantization but uses a few tricks regarding the shape of the vectors. Using this technique a vector can be compressed down to 24 bytes (instead of 1kb!) while maintenant 70% of the characteristics of the original vector (for specialists, 70% corresponds to the Recall@100 of a nearest neighbors done using these vectors).

Putting all of this together we store less than 4kb of data on average for each crawled page including the 30% overhead due to the database (RocksDB is based on a LSM ,which are extremely efficient but have a slight overhead on storage more than compensated by the possibility of compressing data blocks).

Now you now everything there is to know about how Babbar crawls, analyze and store tens of billions of webpages. Babbar is currently in its “early access” phase, feel free to contact us if you want to know more (through the website or @BabbarTech on twitter).

More to come pretty soon!