Categories
tech

How Babbar is crawling billions of webpages? (1/2)

Those of you who follow Babbar since the beginning already know that Babbar is the reunion of both Exensa and ix-labs teams, people that have known each other for up to 20 years but chose to work together only a few months back.

Today we start sharing some technical details about what Babbar does.

First, some of the building blocks of Babbar’s technology have been developed by Exensa. Exensa has been working more than two years on the analytics technology behind the “similar sites” feature in Babbar. Those of you who attended the iSwag conferences in 2015 and 2016 (organized by the ix-labs together with queduweb) may have seen the conferences presenting the algorithmic building blocks of this technology.

In order to test this technology, Exensa started crawling the web using an open source software. While efficient it was pretty “dumb”. Still, these early results were impressive and recently we all decided to take some time to build a service providing information about web linking a bit more serious than the already existing competition.

Since then, we started working on a more intelligent crawling technology and on the storage/indexing of all information collected during the crawl. We needed both aspects to be really efficient as we intended to store A LOT of data :).

The first idea we had was to use a distributed storage system. We benchmarked a few and decided to go with Apache Ignite at the time to prove the feasibility of our approach. It was a complete disaster, we could never reach reasonable performances, and this solution made it  very hard to control the data locality as required for efficient computing on this kind of data. A service that collects data and offers metrics but which is unable to compute them is not very useful…

We quickly shifted to handling the storage ourselves. After some tests and thinking about local storage solutions we chose RocksDB. It offers some really interesting features but is still far from perfect for our particular needs. We had to seriously “hack” the technology to achieve good performances especially on merging operators in Java. In the end it was the right choice, we now have the quality and the performance we were reaching for.

Then we focused on tackling the message distribution problem. It was much simpler than anticipated. We quickly reached the conclusion that the solution consisted in choosing between two already mature technologies: Kafka and Pulsar.

We chose Pulsar for its flexibility even if the robustness is not always spot on.

The architecture of our solution is quite simple:

  • The main cluster sends crawl requests through Pulsar (This, in fact, is a whole subject since we finely tuned the crawler to visit in priority high value pages. This will be further explained in another article).
  • The requests are received in a crawler node which handles the queueing by website and IP address (not to clog up crawled websites)
  • When the crawler fetches a webpage (60kb to 80kb on average based on our experience), it is parsed (HTML tree analysis), the HTML is cleaned and the result is sent, through Pulsar to one of the machines of the main cluster.
  • The main cluster extracts the links found within the webpage and the raw content of the page. At this stage the information is already smaller (We will talk about compression a bit later) but still between 10kb and 20kb per webpage. It may seem small enough but since we want to store information on a really high volume of webpages it is still too much.

The semantic information is obtained from our own vector embedding method. We developed it to be truly efficient and optimised. Basically, it does the same thing as Doc2Vec and similar methods, but only way more efficient :).

In the end we only keep a numeric vector representing the semantic orientation of the webpage.

Once all these first steps are done, things start to get serious, the “reactive” part of the main cluster will process the webpage’s information, first by storing it, then by sending through Pulsar all the outlinks from the webpage (since we want all the backlinks, we need to process every link a webpage receive). We also store additional information at this stage and the “reactive” part of the main cluster is done.

Next to the “reactive” part there is a “background activity” which continuously processes all the data doing two things: it computes the metrics and broadcasts them (these are graph metrics and they need to be broadcasted at each update) and also it pre-aggregates metrics (number of unique referring IPs for example) at the URL, host and domain level. It is also at this stage that Babbar collects the “IP to website” information.

Now you know everything there is to know about the computing side of Babbar. We will continue to unveil technical information in the next article in a few days.

Babbar is currently in its “early access” phase, feel free to contact us if you want to know more (through the website or @BabbarTech on twitter).

More to come pretty soon!