projects
-
ambari
-
avro
-
bigtop
-
cassandra
-
chukwa
-
crunch
-
derby
-
drill
-
falcon
-
flink
-
flume
-
hadoop
-
hbase
-
hive
-
kafka
-
knox
-
mahout
-
mesos
-
oozie
-
parquet
-
phoenix
-
pig
-
solr
-
spark
-
sqoop
-
storm
-
tez
-
thrift
-
zookeeper
ambari
-
cate
big-data
-
lang
java/python/js
-
desc
apache ambari makes hadoop cluster provisioning, managing, and monitoring dead simple.
avro
-
cate
big-data/library
-
lang
c/c++/c#/java/php/python/ruby
-
desc
apache avro is a data serialization system.
bigtop
-
cate
big-data
-
lang
java
-
desc
bigtop is a project for the development of packaging and tests of the apache hadoop ecosystem.
cassandra
-
cate
database
-
lang
java
-
desc
apache cassandra database is the right choice when you need scalability and high availability without compromising performance. linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
chukwa
-
cate
hadoop
-
lang
java/js
-
desc
chukwa is an open source data collection system for monitoring large distributed systems.
crunch
-
cate
big-data/library
-
lang
java/scala
-
desc
the apache crunch java library provides a framework for writing, testing, and running mapreduce pipelines.
derby
-
cate
database
-
lang
java
-
desc
apache derby is an open source relational database implemented entirely in java.
drill
-
cate
big-data
-
lang
java
-
desc
apache drill is a distributed mpp query layer that supports sql and alternative query languages against nosql and hadoop data storage systems. it was inspired in part by google's dremel.
falcon
-
cate
big-data
-
lang
java
-
desc
apache falcon is a data processing and management solution for hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on hadoop clusters.
flink
-
cate
big-data
-
lang
java/scala
-
desc
flink is an open source system for expressive, declarative, fast, and efficient data analysis.
flume
-
cate
big-data
-
lang
java
-
desc
apache flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
hadoop
-
cate
database
-
lang
java
-
desc
hadoop is a distributed computing platform. this includes the hadoop distributed filesystem (hdfs) and an implementation of mapreduce.
hbase
-
cate
database
-
lang
java
-
desc
use apache hbase software when you need random, realtime read/write access to your big data. this project's goal is the hosting of very large tables -- billions of rows x millions of columns -- atop clusters of commodity hardware. hbase is an open-source, distributed, versioned, column-oriented store modeled after google's bigtable: a distributed storage system for structured data by chang et al. just as bigtable leverages the distributed data storage provided by the google file system, hbase provides bigtable-like capabilities on top of hadoop and hdfs.
hive
-
cate
database
-
lang
java
-
desc
the apache hive (tm) data warehouse software facilitates querying and managing large datasets residing in distributed storage. built on top of apache hadoop (tm), it provides * tools to enable easy data extract/transform/load (etl) * a mechanism to impose structure on a variety of data formats * access to files stored either directly in apache hdfs (tm) or in other data storage systems such as apache hbase (tm) * query execution via mapreduce hive defines a simple sql-like query language, called hiveql, that enables users familiar with sql to query the data. at the same time, this language also allows programmers who are familiar with the mapreduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. hiveql can also be extended with custom scalar functions (udf's), aggregations (udaf's), and table functions (udtf's).
kafka
-
cate
big-data
-
lang
scala
-
desc
a single kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. it can be elastically and transparently expanded without downtime. data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. messages are persisted on disk and replicated within the cluster to prevent data loss. each broker can handle terabytes of messages without performance impact.
knox
-
cate
big-data
-
lang
java
-
desc
the apache knox gateway is a rest api gateway for interacting with hadoop clusters. the knox gateway provides a single access point for all rest interactions with hadoop clusters. in this capacity, the knox gateway is able to provide valuable functionality to aid in the control, integration, monitoring and automation of critical administrative and analytical needs of the enterprise. authentication (ldap and active directory authentication provider) federation/sso (http header based identity federation) authorization (service level authorization) auditing while there are a number of benefits for unsecured hadoop clusters, the knox gateway also complements the kerberos secured cluster quite nicely. coupled with proper network isolation of a kerberos secured hadoop cluster, the knox gateway provides the enterprise with a solution that: integrates well with enterprise identity management solutions protects the details of the hadoop cluster deployment (hosts and ports are hidden from endusers) simplifies the number of services that clients need to interact with
mahout
-
cate
library
-
lang
java
-
desc
scalable machine learning library
mesos
-
cate
cloud
-
lang
c++
-
desc
apache mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. it can run hadoop, mpi, hypertable, spark, and other frameworks on a dynamically shared pool of nodes.
oozie
-
cate
big-data
-
lang
java/js
-
desc
oozie is a workflow scheduler system to manage apache hadoop jobs. oozie is integrated with the rest of the hadoop stack supporting several types of hadoop jobs out of the box (such as java map-reduce, streaming map-reduce, pig, hive, sqoop and distcp) as well as system specific jobs (such as java programs and shell scripts).
parque
-
cate
big-data
-
lang
java
-
desc
apache parquet is a general-purpose columnar storage format, built for hadoop, usable with any choice of data processing framework, data model, or programming language.
phoenix
-
cate
big-data/database
-
lang
java/sql
-
desc
apache phoenix is a relational database layer on top of apache hbase. it is accessed as a jdbc driver and enables querying, updating, and managing hbase tables through standard sql. instead of using map-reduce, apache phoenix compiles your sql query into a series of hbase scans and orchestrates the running of those scans to produce regular jdbc result sets. direct use of the hbase api, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
pig
-
cate
database
-
lang
java
-
desc
apache pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. the salient property of pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. pig's infrastructure layer consists of a compiler that produces sequences of map-reduce programs. pig's language layer consists of a textual language called pig latin, which has the following key properties: * ease of programming. it is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. * optimization opportunities. the way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. * extensibility. users can create their own functions to do special-purpose processing.
solr
-
cate
web-framework/network-server
-
lang
java
-
desc
solr is an open source enterprise search server based on the lucene java search library, with xml/http and json, ruby, and python apis, hit highlighting, faceted search, caching, replication, and a web administration interface.
spark
-
cate
big-data
-
lang
java/scala/python
-
desc
apache spark is a fast and general engine for large-scale data processing. it offers high-level apis in java, scala and python as well as a rich set of libraries including stream processing, machine learning, and graph analytics.
sqoop
-
cate
big-data
-
lang
java
-
desc
apache sqoop(tm) is a tool designed for efficiently transferring bulk data between apache hadoop and structured datastores such as relational databases.
storm
-
cate
big-data
-
lang
java
-
desc
apache storm is a distributed real-time computation system. similar to how hadoop provides a set of general primitives for doing batch processing, storm provides a set of general primitives for doing real-time computation.
tez
-
cate
big-data
-
lang
java
-
desc
the apache tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. it is currently built atop apache hadoop yarn
thrift
-
cate
http/library/network-client/network-server
-
lang
actionscript/c/c#/c++/cocoa/d/delphi/erlang/go/haskell/java
-
desc
apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across lang. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business.
zookeeper
-
cate
database
-
lang
java
-
desc
apache zookeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.