Big Data Challenges how big MNC's like Google, Facebook, Instagram etc stores, manages and manipulate Thousands of Terabytes of data with High Speed and High -By Ritvik Ranjan
Before we talk about serious issues like top big data challenges, it is logical that we first understand what big data is.
So, What Is Big Data?
The majority of experts define data using three ‘V’ terms. Therefore, your organization has big data, if your data stores bear the below characteristics.
Volume – your data is so large that your company faces challenges linked to processing, monitoring, and storage. With trends such as mobility, Internet of Things (IoT), social media and eCommerce in place, a lot of information is being generated. As a result, almost every organization satisfies this criterion.
Velocity – does your firm generate new data at a high speed and you are required to respond in real-time? If yes, then your organization has the velocity associated with big data. The majority of companies involved with technologies such as social media, the Internet of Things and eCommerce meet this criterion.
Big data challenges and how you can solve them.
Data Storage
Relational database management systems (RDBMSs) are traditional storage systems designed for structured data and accessed by means of SQL. RDBMSs are facing challenges in handling Big Data and providing horizontal scalability, availability and performance required by Big Data applications. In contrast to relational databases, MapReduce provides computational scalability, but it relies on data storage in a distributed file system such as Google File System (GFS) or Hadoop Distributed File System (HDFS).NoSQL and NewSQL data stores have emerged as alternatives to Big Data storage. NoSQL refers to “Not Only SQL”, highlighting that SQL is not a crucial objective of those systems. Their main defining characteristics include schema flexibility and effective scaling over a large number of commodity machines. NoSQL horizontal scalability includes data storage scaling as well as scaling of read/write operations K. Grolinger et al. (2013). Analyze features driving the NoSQL systems ability to scale such as partitioning, replication, consistency, and concurrency control. NoSQL systems typically adopt the MapReduce paradigm and push processing to the nodes where data is located to efficiently scale read operations. Consequently, data analysis is performed via MapReduce jobs. MapReduce itself is schema-free and index-free; this provides great flexibility and enables MapReduce to work with semi-structured and unstructured data. Moreover, MapReduce can run as soon as data is loaded. However, the lack of indexes on standard MapReduce may result in poor performance in comparison to relational databases. This may be outweighed by MapReduce scalability and parallelization. Database vendors, such as Oracle, provide in-database MapReduce (X.Su and G. Swart 2012), taking advantage of database parallelization. Another example of providing analytics capabilities indatabase is the MAD Skills project J. Cohen et al. (2009) which implements MapReduce within the database using an SQL runtime execution engine. Map and Reduce functions are written in Python, Perl, or R, and passed to the database for execution. NoSQL systems from column-family and document categories adopt the MapReduce paradigm while providing support for various indexing methods. In this approach MapReduce jobs can access data using the index, therefore query performance is significantly improved. For example Cassandra supports primary and secondary indexes (Apache Cassandra ) . In CouchDB (J. C. Anderson and N. Slater 2010) the primary way of querying and reporting is through views which use the MapReduce paradigm with JavaScript as a query language. A view consists of a Map function and an optional Reduce function. Data emitted by Map function is used to construct an index and consequently, queries against that view run quickly. Another challenge related to MapReduce and data storage is the lack of a standardized SQL-like language. Therefore one direction of research is concerned with providing SQL on top of MapReduce. An example of this category is Apache Hive A. Thusoo et al. (2009) which provides an SQL-like language on top of Hadoop. Another Apache effort, Mahout (Apache Mahout), aims to build scalable machine learning libraries on top of MapReduce. Although those efforts provide powerful data processing capabilities, they lack data management features such as advanced indexing and a sophisticated optimizer. NoSQL solutions choose different approaches for providing querying abilities K. Grolinger et al. (2013): Cassandra and MongoDB provide proprietary SQL-like querying while HBase uses Hive. It is important to point out the efforts on integration between traditional databases, MapReduce, and Hadoop. For example, the Oracle SQL connector for HDFS (Oracle 2014) provides ability to query data in Hadoop within the database using SQL. The Oracle Data Integrator for Hadoop generates Hivelike queries which are transformed into native MapReduce and executed on Hadoop clusters. Even though the presented efforts advanced the state of the art for Data Storage and MapReduce, a number of challenges remain, such as: *The lack of a standardized SQL-like query language, *limited optimization of MapReduce jobs, *Integration among MapReduce, distributed file system, RDBMSs and NoSQL stores
Big Data Management and Storage
In Big Data Big means the size of data is growing continuously, on the other hand increasing speed of storage capacity is much less than the rising amount of Data. The reconstruction of available information framework is needed to form a hierarchical framework because Researchers has come up with the conclusion that available DBMSs are not adequate to process the large amount of data Changqing et al. (2012) .
Architecture commonly used for processing of data uses the database server, Database server has constraint of scalability and cost which are prime goals of Big Data. A different business model has been suggested by the providers of database but basically those are application specific forget. Google seems to be more interested in small applications .
Big Data Storage is another big issue in Big Data management as available computer algorithms are sufficient to store homogeneous
data but not able to smartly store data comes in real time because of its heterogeneous behaviour Avita Katal et al. (2013) .
So how to rearrange Data is another big problem in context of Big Data Management. Virtual server technology can sharpen the problem reason is it raises the issue of overcommitted resources specially when there is lack of communication between the application server and storage administrator. Also need to solve the problem of concurrent I/O and a single node master /slave architecture.
Hadoop Distributed File System (HDFS) and MapReduce
Hadoop comes with its default distributed file system which is Hadoop distributed file system (HDFS) Amrit Pal et al. (2014). It stores file in blocks of 64 MB. It can store files of varying size from 100MB to GB, TB. Hadoop architecture contains the Name node, data nodes, secondary name node, Task tracker and job tracker. Name node maintained the Metadata information about the block stored in the Hadoop distributed file system. Files are stored in blocks in a distributed manner. The Secondary name node does the work of maintaining the validity of the Name
Node and updating the Name Node Information time to time. Data node actually stores the data.
The Job Tracker actually receives the job from the user and split it into parts. Job Tracker then assigns these split jobs to the Task Tracker. Task Tracker runs on the Data node they fetch the data from the data node and execute the task. They continuously talk to the Job Tracker. Job Tracker coordinates the job submitted by the user. Task Tracker has fixed number of the slots for running the tasks. The Job tracker selects the Task Tracker which has the free available slots.
It is useful to choose the Task Tracker on the same rack where the data is stored this is known as rack awareness.
With this inter rack bandwidth can be saved. The arrangement of the different component of Hadoop on a single node. In this arrangement all the component Name Node, Secondary Name Node, Data Node, Job Tracker, and Task Tracker are on the same system. The User submits its job in the form of MapReduce task. The data Node and the Task Tracker are on the same system so that the best speed for the read and write can be achieved.
Map-Reduce was introduced by Google in order to process and store large datasets on commodity hardware. It provides a programming paradigm which allows useable and manageable distribution of many computationally intensive tasks. As a result, many programming languages now have Map-Reduce implementations which extend its uptake. On the other hand, Hadoop is a highly popular free Map-Reduce implementation by the Apache Foundation (White T 2012). With the popularity of the Hadoop applications there have been many complementing applications
developed by the open source community and packaged up under apache foundation Saecker et al. (2013).
Comments
Post a Comment