HETEROGENEITY BASED FAIRLY DATA DISTRIBUTION IN HADOOP ENVIRONMENT
Keywords:
Hadoop,Data-Locality,Distributed Computing,Big Data,Cloud ComputingAbstract
The Hadoop framework has been developed to effectively process data-intensive MapReduce applications.
Hadoop users specify the application computation logic in terms of a map and a reduce function, which are often termed
MapReduce applications. The Hadoop distributed file system is used to store the MapReduce application data on the
Hadoop clusternodes called Datanodes, whereas Namenode is a control point for all Datanodes. While its resilience is
increased, its current data-distribution methodologies are not necessarily efficient forheterogeneous distributed
environments such as public clouds.This work contends that existing data distribution techniques are not necessarily
suitable, since the performance of Hadoop typically degrades in heterogeneous environments whenever data distribution
is not determined as per the computing capability of the nodes. The concept of data-locality and its impact onthe
performance of Hadoop are key factors, since they affect the performance in the Map phase when scheduling tasks. The
task scheduling techniques in Hadoop should arguably consider data locality to enhance performance. Various task
scheduling techniques have been analysed to understand their data-locality awareness while scheduling applications.
Other system factors also play a major role while achieving high performance in Hadoop data processing.