apache projects

projects

ambari
avro
bigtop
cassandra
chukwa
crunch
derby
drill
falcon
flink
flume
hadoop
hbase
hive
kafka
knox
mahout
mesos
oozie
parquet
phoenix
pig
solr
spark
sqoop
storm
tez
thrift
zookeeper

ambari

site
cate big-data
lang java/python/js

desc

     apache ambari makes hadoop cluster provisioning, managing, and monitoring dead simple. 

avro

site
cate big-data/library
lang c/c++/c#/java/php/python/ruby

desc

     apache avro is a data serialization system.

bigtop

site
cate big-data
lang java

desc

     bigtop is a project for the development of packaging and tests of the apache hadoop ecosystem. 

cassandra

site
cate database
lang java

desc

     apache cassandra database is the right choice when you need scalability and high availability without compromising performance. linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.   

chukwa

site
cate hadoop
lang java/js

desc

     chukwa is an open source data collection system for monitoring large distributed systems.  

crunch

site
cate big-data/library
lang java/scala

desc

     the apache crunch java library provides a framework for writing, testing, and running mapreduce pipelines. 

derby

site
cate database
lang java

desc

     apache derby is an open source relational database implemented entirely in java.   

drill

site
cate big-data
lang java

desc

     apache drill is a distributed mpp query layer that supports sql and alternative query languages against nosql and hadoop data storage systems. it was inspired in part by google's dremel. 

falcon

site
cate big-data
lang java

desc

     apache falcon is a data processing and management solution for hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on hadoop clusters.

flink

site
cate big-data
lang java/scala

desc

     flink is an open source system for expressive, declarative, fast, and efficient data analysis. 

flume

site
cate big-data
lang java

desc

     apache flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.

hadoop

site
cate database
lang java

desc

     hadoop is a distributed computing platform. this includes the hadoop distributed filesystem (hdfs) and an implementation of mapreduce. 

hbase

site
cate database
lang java

desc

     use apache hbase software when you need random, realtime read/write access to your big data. this project's goal is the hosting of very large tables -- billions of rows x millions of columns -- atop clusters of commodity hardware. hbase is an open-source, distributed, versioned, column-oriented store modeled after google's bigtable: a distributed storage system for structured data by chang et al. just as bigtable leverages the distributed data storage provided by the google file system, hbase provides bigtable-like capabilities on top of hadoop and hdfs.   

hive

site
cate database
lang java

desc

     the apache hive (tm) data warehouse software facilitates querying and managing large datasets residing in distributed storage. built on top of apache hadoop (tm), it provides * tools to enable easy data extract/transform/load (etl) * a mechanism to impose structure on a variety of data formats * access to files stored either directly in apache hdfs (tm) or in other data storage systems such as apache hbase (tm) * query execution via mapreduce hive defines a simple sql-like query language, called hiveql, that enables users familiar with sql to query the data. at the same time, this language also allows programmers who are familiar with the mapreduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. hiveql can also be extended with custom scalar functions (udf's), aggregations (udaf's), and table functions (udtf's).   

kafka

site
cate big-data
lang scala

desc

     a single kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. it can be elastically and transparently expanded without downtime. data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. messages are persisted on disk and replicated within the cluster to prevent data loss. each broker can handle terabytes of messages without performance impact. 

knox

site
cate big-data
lang java

desc

     the apache knox gateway is a rest api gateway for interacting with hadoop clusters. the knox gateway provides a single access point for all rest interactions with hadoop clusters. in this capacity, the knox gateway is able to provide valuable functionality to aid in the control, integration, monitoring and automation of critical administrative and analytical needs of the enterprise. authentication (ldap and active directory authentication provider) federation/sso (http header based identity federation) authorization (service level authorization) auditing while there are a number of benefits for unsecured hadoop clusters, the knox gateway also complements the kerberos secured cluster quite nicely. coupled with proper network isolation of a kerberos secured hadoop cluster, the knox gateway provides the enterprise with a solution that: integrates well with enterprise identity management solutions protects the details of the hadoop cluster deployment (hosts and ports are hidden from endusers) simplifies the number of services that clients need to interact with   

mahout

site
cate library
lang java

desc

     scalable machine learning library

mesos

site
cate cloud
lang c++

desc

     apache mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. it can run hadoop, mpi, hypertable, spark, and other frameworks on a dynamically shared pool of nodes.

oozie

site
cate big-data
lang java/js

desc

     oozie is a workflow scheduler system to manage apache hadoop jobs. oozie is integrated with the rest of the hadoop stack supporting several types of hadoop jobs out of the box (such as java map-reduce, streaming map-reduce, pig, hive, sqoop and distcp) as well as system specific jobs (such as java programs and shell scripts).

parque

site
cate big-data
lang java

desc

     apache parquet is a general-purpose columnar storage format, built for hadoop, usable with any choice of data processing framework, data model, or programming language.   

phoenix

site
cate big-data/database
lang java/sql

desc

     apache phoenix is a relational database layer on top of apache hbase. it is accessed as a jdbc driver and enables querying, updating, and managing hbase tables through standard sql. instead of using map-reduce, apache phoenix compiles your sql query into a series of hbase scans and orchestrates the running of those scans to produce regular jdbc result sets. direct use of the hbase api, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.   

pig

site
cate database
lang java

desc

     apache pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. the salient property of pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. pig's infrastructure layer consists of a compiler that produces sequences of map-reduce programs. pig's language layer consists of a textual language called pig latin, which has the following key properties: * ease of programming. it is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. * optimization opportunities. the way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. * extensibility. users can create their own functions to do special-purpose processing. 

solr

site
cate web-framework/network-server
lang java

desc

     solr is an open source enterprise search server based on the lucene java search library, with xml/http and json, ruby, and python apis, hit highlighting, faceted search, caching, replication, and a web administration interface.

spark

site
cate big-data
lang java/scala/python

desc

     apache spark is a fast and general engine for large-scale data processing. it offers high-level apis in java, scala and python as well as a rich set of libraries including stream processing, machine learning, and graph analytics.  

sqoop

site
cate big-data
lang java

desc

     apache sqoop(tm) is a tool designed for efficiently transferring bulk data between apache hadoop and structured datastores such as relational databases.   

storm

site
cate big-data
lang java

desc

     apache storm is a distributed real-time computation system. similar to how hadoop provides a set of general primitives for doing batch processing, storm provides a set of general primitives for doing real-time computation. 

tez

site
cate big-data
lang java

desc

     the apache tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. it is currently built atop apache hadoop yarn 

thrift

site
cate http/library/network-client/network-server
lang actionscript/c/c#/c++/cocoa/d/delphi/erlang/go/haskell/java

desc

     apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across lang. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business. 

zookeeper

site
cate database
lang java

desc

     apache zookeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.