How Facebook handle data in Big Data World
Before starting with this first lets understand what is Big data and how big it is
Big data usually includes data sets with sizes beyond the ability of commonly used software can store , manage and process data within a tolerable elapsed time. Big data “size” is a constantly moving target ,from a few dozen terabytes to many zetta bytes of data
Big data can be described by the following characteristics:
- Volume — Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.
- Variety — Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
- Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.
- Variability — This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.
HOW DATA IS INCREASING?
- SOCIAL MEDIA — Social Media is a place where people connect with each other by online mode and share their emotions and journey by images, audios, videos, etc. Social Media is one of the important factors of Big Data. Instagram, Facebook, Whatsapp, takes a lot of data like personal details, pictures, likes or reactions, etc.
- FACEBOOK — Facebook is a social media platform that has almost 2.7 billion active users until the second quarter of 2020. Facebook generates 4 petabytes of data per day. People can chat and upload images, videos, etc. on Facebook.
- GOOGLE Search Engine- Google is a Search Engine that has 4 billion users and it processes 3.5 billion searches per day and if we break down this it processes 40,000 searches per second on an average. Google processes approximately 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters.
- INTERNET OF THINGS(IoT) — IoT connects with a device and makes it smarter. Nowadays we have a smart A.C., smart room, etc. Due to IoT we humongous amount of data is generated. It is assumed that till 2025 41.6 billion of data will be generated by IoT devices.
There are many more things due to data is Increasing.
- STORING THE DATA — The data is coming in huge volume and where to store it is a big issue. To store a huge amount of data in a traditional system is not possible.To buy one expensive hardware with a huge volume storing capability is not a good idea because it will raise another issue.We have one file of 500 MB but we have only 200 MB of storage left now what to do?
- VARIOUS FORMATS OF DATA- Earlier, we used to store data in Relational Database but currently, 80% of the data is Unstructured Data. Also now there are different types of data:
it‘s hard to handle this data in a traditional manner.
- PROCESSING DATA FASTER- let’s take one example, we have one harddisk of 100MB and we stored data there but now more data is coming so we increased its size from 100MB to 500MB but now more data is coming and we are increasing it’s size again and again. Now all data is stored but did you thought about How we will be going to retrieve this data or process this data? Though the CPU speed, RAM Memory, Disk Capacity have improved a lot, the thing not improved is the speed. From the last 7–10 years, the read/write speed of a disk is 80 MB/Sec. So these are the problems faced by industries when the data converted to Big Data.
TYPES OF BIG DATA
The Relational Database is known as Structured data which is in the form of Row and Column.Example- Stock Information, Credit Card details, Medical records of the hospital, Bank Records, etc.
Facebook especially make their own query language based on SQL which handles Big Data Known as Hive Query Language.
Unstructured data are images, audios, videos, etc. Almost 80% of the data is unstructured data. It is generated more by Social Media.
JSON, XML, CSV File, Tab Delimited files,log files etc are semi-structured data. Log files are the files that store the data when we login till logout to any application. Like on Facebook when we log in, what activity is done by us, when we logout .everything is stored in log file.
Facebook Big data challenges
Big data stores are the workhorses for data analysis at Facebook. They grow by millions of events (inserts) per second and process tens of petabytes and hundreds of thousands of queries per day. The three data stores used most heavily are:
1. ODS (Operational Data Store) stores 2 billion time series of counters. It is used most commonly in alerts and dashboards and for trouble-shooting system metrics with 1–5 minutes of time lag. There are about 40,000 queries per second.
2. Scuba is Facebook’s fast slice-and-dice data store. It stores thousands of tables in about 100 terabytes in memory. It ingests millions of new rows per second and deletes just as many. Throughput peaks around 100 queries per second, scanning 100 billion rows per second, with most response times under 1 second.
3. Hive is Facebook’s data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day. Presto, HiveQL, Hadoop, and Giraph are the common query engines over Hive.
ODS, Scuba, and Hive share an important characteristic: none is a traditional relational database. They process data for analysis, not to serve users, so they do not need ACID guarantees for data storage or retrieval. Instead, challenges arise from high data insertion rates and massive data quantities.
Much of data — including data related to 2.7 billion Likes and 2.5 billion content items shared per day — is devoured and analyzed in order to optimize product features and Facebook advertising performance. Hadoop is the key tool Facebook uses, not simply for analysis, but as an engine to power many features of the Facebook site, including messaging. That multitude of monster workloads drove the company to launch its Prism project, which supports geographically distributed Hadoop data stores