An Overview of Apache’s Hadoop

By Chandra Heitzman

Designed by the Apache Software Foundation, Hadoop is a Java-based open-source platform designed to process massive amounts of data in a distributed computing environment. Hadoop’s key innovations lay in its ability to store and access massive amounts of data over thousands of computers and to coherently present that data.

Though data warehouses can store data on a similar scale, they are costly and do not allow for effective exploration of huge amounts of discordant data. Hadoop addresses this limitation by taking a data query and distributing it over multiple computer clusters. By distributing the workload over thousands of loosely networked computers (nodes), Hadoop can potentially examine and present petabytes of heterogeneous data in a meaningful format. Even so, the software is fully scalable and can operate on a single server or small network.

Hadoop’s distributed computing abilities are actually derived from two software frameworks: the Hadoop Distributed File System (HDFS) and MapReduce. HDFS facilitates rapid data transfer between computer nodes and allows continued operation even in the event of node failure. MapReduce distributes all data processing over these nodes, thus reducing the workload on each individual computer and allowing for computations and analysis beyond the capabilities of a single computer or network. For example, Facebook uses MapReduce for analysis of user behavior and advertisement-tracking, amounting to about 21 petabytes of information. Other prominent users include IBM, Yahoo, and Google, typically for use in search engines and advertising.

A typical application of Hadoop requires the understanding that it is designed to run on a large number of machines without shared hardware or memory. When a financial institution wants to analyze data from dozens of servers, Hadoop breaks apart the data and distributes it throughout those servers. Hadoop also replicates the data, preventing data loss in the event of most failures. In addition, MapReduce expands potential computing speed by dividing and distributing LARGE data analysis through all servers or computers in a cluster, but answers the query in a single result set.

Though Hadoop offers a scalable approach to data storage and analysis, it is not meant as a substitute for a standard database (e.g. SQL Server 2012 database). Hadoop stores data in files, but does not index them for easy locating. Finding the data requires MapReduce, which will take more time than what can be considered efficient for simple database operations. Hadoop functions best when the dataset is too large for conventional storage and too diverse for easy analysis.

The digitization of information has increased nine times in the last five years, with companies spending an estimated four trillion dollars worldwide on data management in 2011. Doug Cutting, creator of Cloudera and Hadoop, estimates that 1.8 zettabytes (1.8 trillion gigabytes) were created and replicated in the same year. Ninety percent of this information is unstructured, and Hadoop and applications like it offer the only current method of keeping this data comprehensible.

For more information about SQL 2012 development, visit Magenic who have been one of the leading software development companies providing innovative custom software development to meet unique business challenges for some of the most recognized companies and organizations in the nation.

Article Source: http://EzineArticles.com/?expert=Chandra_Heitzman

Article Source: http://EzineArticles.com/6928548

An Overview of Apache’s Hadoop

About News

No Comments

Leave a Reply Cancel reply

Suggested Sites

Curated topic

Recent Posts

Ranking