WiseAnalytics | I Hear Facebook has data, does it have data tools?

Insights

I Hear Facebook has data, does it have data tools?

2 min read

By Julien Kervizic

Two years after I left Facebook, I still often get asked what kind of tools data people were using back there. Below, I tried to summarize the main tools that were used to explore datasets back then:

Scuba: Scuba offered slice and dice functionality typically handled by a pivot or some cube like structure, just at a significantly larger scale and in real-time. The downside of it was that data displayed in the tool was not always the most accurate. Further description of the tool is available in the following paper

Dataswarm: A data workflow automation and scheduling platform, predecessor to airflow. Like airflow it is also centered around the concept of directed acyclic graphs (DAGs). At Facebook dataswarm represented the way to automate anything that required batch automation of data pipelines. It allowed for running multi-steps data pipeline, interacting in different platforms or programming languages through the concept of operators and allowed to make dependencies explicit within the different processing steps. Some are available here

Deltoid: Deltoid in its numerous iterations offered a standard A/B testing platform for handling the reporting and analysis of the different experiences being tested at Facebook. It allowed for an easy ingestion and analysis of different experiment based on allocation or exposure of user, providing for each metrics estimates and confidence intervals of the impact of a test, checks for group imbalance and much more.

Scribe & Puma: Scribe was a log server for aggregating log data streamed in real time, data from it was then fed into Scuba, a Hive Table or fed to Puma a real time computation engine, supporting SQL like commands.

Hive/Presto And Hipal/DaggerDS/Daiquery: Were the then de-facto ways to analyse dataset at Facebook, Hive and presto are both SQL like framework/computation engine on-top of Hadoop and Hipal/DaggerDS were web interfaces to perform queries on them. Hive and Presto were both open source by Facebook while Airpal is an open source successor to Hipal.

FBLearner: Machine Learning pipeline was handled in a way similar way to most data pipeline at Facebook with the exception that it is using some added specific machine learning operators and have a different management UI. More info on FB learner here

Argus & Unidash: Dashboarding capabilities were handled by web dashboard with querying capabilities onto databases and hive tables. They offered basic functionalities of what Apache Superset is offering nowadays.

‍