With tens of millions of users and more than a billion page views every day, Facebook ends up accumulating massive amounts of data. One of the challenges that we have faced since the early days is developing a scalable way of storing and processing all these bytes since using this historical data is a very big part of how we can improve the user experience on Facebook. This can only be done by empowering our engineers and analysts with easy to use tools to mine and manipulate large data sets.
About a year back we began playing around with an open source project called Hadoop. Hadoop provides a framework for large scale parallel processing using a distributed file system and the map-reduce programming paradigm. Our hesitant first steps of importing some interesting data sets into a relatively small Hadoop cluster were quickly rewarded as developers latched on to the map-reduce programming model and started doing interesting projects that were previously impossible due to their massive computational requirements. Some of these early projects have matured into publicly released features (like the Facebook Lexicon) or are being used in the background to improve user experience on Facebook (by improving the relevance of search results, for example).
We have come a long way from those initial days. Facebook has multiple Hadoop clusters deployed now - with the biggest having about 2500 cpu cores and 1 PetaByte of disk space. We are loading over 250 gigabytes of compressed data (over 2 terabytes uncompressed) into the Hadoop file system every day and have hundreds of jobs running each day against these data sets. The list of projects that are using this infrastructure has proliferated - from those generating mundane statistics about site usage, to others being used to fight spam and determine application quality. An amazingly large fraction of our engineers have run Hadoop jobs at some point (which is also a great testament to the quality of technical talent here at Facebook).
The rapid adoption of Hadoop at Facebook has been aided by a couple of key decisions. First, developers are free to write map-reduce programs in the language of their choice. Second, we have embraced SQL as a familiar paradigm to address and operate on large data sets. Most data stored in Hadoop's file system is published as Tables. Developers can explore the schemas and data of these tables much like they would do with a good old database. When they want to operate on these data sets, they can use a small subset of SQL to specify the required dataset. Operations on datasets can be written as map and reduce scripts or using standard query operators (like joins and group-bys) or as a mix of the two. Over time, we have added classic data warehouse features like partitioning, sampling and indexing to this environment. This in-house data warehousing layer over Hadoop is called Hive and we are looking forward to releasing an open source version of this project in the near future.
At Facebook, it is incredibly important that we use the information generated by and from our users to make decisions about improvements to the product. Hadoop has enabled us to make better use of the data at our disposal. So we'd like to take this opportunity to say, "Thank you" to all the people who have contributed to this awesome open-source project.
Joydeep is a Facebook Engineer