Hadoop Components

Posted by Sagar Patil

Quick Info

Flume : Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS).

Sqoop 2: Sqoop is a Data Export tool used for RDBMS and Hadoop ecosystem. Sqoop is the tool to transfer data between RDBMS to Hadoop and vise versa.

Pig : Pig is an open-source high-level dataflow system. It provides a simple language for queries and data manipulation Pig Latin, that is compiled into map-reduce jobs that are run on Hadoop.

HBase :  HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and written in Java.

Hive : The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

Hue: Hue is an open-source Web interface that supports Apache Hadoop and its ecosystem.
Hue aggregates most common Apache Hadoop components into a single interface and targets user experience. Its main goal is to have users “just use” Hadoop without worrying about the underlying complexity or using a command line.

Impala : Impala is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop – combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise.

Key-Value Store Indexer:  The Key-Value Store Indexer service uses the Lily HBase NRT Indexer to index the stream of records being added to HBase tables. Indexing allows you to query data stored in HBase with the Solr service.

The Key-Value Store Indexer service is installed in the same parcel or package along with the CDH 5 or Solr service. The Indexer service depends on the HBase, HDFS, Solr, and ZooKeeper services.

Oozie : Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Solr : Apache Solr is the open source platform for searches of data stored in HDFS in Hadoop.

Spark : Apache Spark is an open-source cluster computing framework. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications.

YARN (MR2 Included) : YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.

ZooKeeper : ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications

Top of Page

Top menu