23 September 2015

projects

  1. ambari

  2. avro

  3. bigtop

  4. cassandra

  5. chukwa

  6. crunch

  7. derby

  8. drill

  9. falcon

  10. flink

  11. flume

  12. hadoop

  13. hbase

  14. hive

  15. kafka

  16. knox

  17. mahout

  18. mesos

  19. oozie

  20. parquet

  21. phoenix

  22. pig

  23. solr

  24. spark

  25. sqoop

  26. storm

  27. tez

  28. thrift

  29. zookeeper

ambari

  1. site

  2. cate big-data

  3. lang java/python/js

  4. desc

         apache ambari makes hadoop cluster provisioning, managing, and monitoring dead simple. 
    

avro

  1. site

  2. cate big-data/library

  3. lang c/c++/c#/java/php/python/ruby

  4. desc

         apache avro is a data serialization system.
    

bigtop

  1. site

  2. cate big-data

  3. lang java

  4. desc

         bigtop is a project for the development of packaging and tests of the apache hadoop ecosystem. 
    

cassandra

  1. site

  2. cate database

  3. lang java

  4. desc

         apache cassandra database is the right choice when you need scalability and high availability without compromising performance. linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.   
    

chukwa

  1. site

  2. cate hadoop

  3. lang java/js

  4. desc

         chukwa is an open source data collection system for monitoring large distributed systems.  
    

crunch

  1. site

  2. cate big-data/library

  3. lang java/scala

  4. desc

         the apache crunch java library provides a framework for writing, testing, and running mapreduce pipelines. 
    

derby

  1. site

  2. cate database

  3. lang java

  4. desc

         apache derby is an open source relational database implemented entirely in java.   
    

drill

  1. site

  2. cate big-data

  3. lang java

  4. desc

         apache drill is a distributed mpp query layer that supports sql and alternative query languages against nosql and hadoop data storage systems. it was inspired in part by google's dremel. 
    

falcon

  1. site

  2. cate big-data

  3. lang java

  4. desc

         apache falcon is a data processing and management solution for hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on hadoop clusters.
    
  1. site

  2. cate big-data

  3. lang java/scala

  4. desc

         flink is an open source system for expressive, declarative, fast, and efficient data analysis. 
    

flume

  1. site

  2. cate big-data

  3. lang java

  4. desc

         apache flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.
    

hadoop

  1. site

  2. cate database

  3. lang java

  4. desc

         hadoop is a distributed computing platform. this includes the hadoop distributed filesystem (hdfs) and an implementation of mapreduce. 
    

hbase

  1. site

  2. cate database

  3. lang java

  4. desc

         use apache hbase software when you need random, realtime read/write access to your big data. this project's goal is the hosting of very large tables -- billions of rows x millions of columns -- atop clusters of commodity hardware. hbase is an open-source, distributed, versioned, column-oriented store modeled after google's bigtable: a distributed storage system for structured data by chang et al. just as bigtable leverages the distributed data storage provided by the google file system, hbase provides bigtable-like capabilities on top of hadoop and hdfs.   
    

hive

  1. site

  2. cate database

  3. lang java

  4. desc

         the apache hive (tm) data warehouse software facilitates querying and managing large datasets residing in distributed storage. built on top of apache hadoop (tm), it provides * tools to enable easy data extract/transform/load (etl) * a mechanism to impose structure on a variety of data formats * access to files stored either directly in apache hdfs (tm) or in other data storage systems such as apache hbase (tm) * query execution via mapreduce hive defines a simple sql-like query language, called hiveql, that enables users familiar with sql to query the data. at the same time, this language also allows programmers who are familiar with the mapreduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. hiveql can also be extended with custom scalar functions (udf's), aggregations (udaf's), and table functions (udtf's).   
    

kafka

  1. site

  2. cate big-data

  3. lang scala

  4. desc

         a single kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. it can be elastically and transparently expanded without downtime. data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees. messages are persisted on disk and replicated within the cluster to prevent data loss. each broker can handle terabytes of messages without performance impact. 
    

knox

  1. site

  2. cate big-data

  3. lang java

  4. desc

         the apache knox gateway is a rest api gateway for interacting with hadoop clusters. the knox gateway provides a single access point for all rest interactions with hadoop clusters. in this capacity, the knox gateway is able to provide valuable functionality to aid in the control, integration, monitoring and automation of critical administrative and analytical needs of the enterprise. authentication (ldap and active directory authentication provider) federation/sso (http header based identity federation) authorization (service level authorization) auditing while there are a number of benefits for unsecured hadoop clusters, the knox gateway also complements the kerberos secured cluster quite nicely. coupled with proper network isolation of a kerberos secured hadoop cluster, the knox gateway provides the enterprise with a solution that: integrates well with enterprise identity management solutions protects the details of the hadoop cluster deployment (hosts and ports are hidden from endusers) simplifies the number of services that clients need to interact with   
    

mahout

  1. site

  2. cate library

  3. lang java

  4. desc

         scalable machine learning library  
    

mesos

  1. site

  2. cate cloud

  3. lang c++

  4. desc

         apache mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. it can run hadoop, mpi, hypertable, spark, and other frameworks on a dynamically shared pool of nodes.
    

oozie

  1. site

  2. cate big-data

  3. lang java/js

  4. desc

         oozie is a workflow scheduler system to manage apache hadoop jobs. oozie is integrated with the rest of the hadoop stack supporting several types of hadoop jobs out of the box (such as java map-reduce, streaming map-reduce, pig, hive, sqoop and distcp) as well as system specific jobs (such as java programs and shell scripts).
    

parque

  1. site

  2. cate big-data

  3. lang java

  4. desc

         apache parquet is a general-purpose columnar storage format, built for hadoop, usable with any choice of data processing framework, data model, or programming language.   
    

phoenix

  1. site

  2. cate big-data/database

  3. lang java/sql

  4. desc

         apache phoenix is a relational database layer on top of apache hbase. it is accessed as a jdbc driver and enables querying, updating, and managing hbase tables through standard sql. instead of using map-reduce, apache phoenix compiles your sql query into a series of hbase scans and orchestrates the running of those scans to produce regular jdbc result sets. direct use of the hbase api, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.   
    

pig

  1. site

  2. cate database

  3. lang java

  4. desc

         apache pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. the salient property of pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. pig's infrastructure layer consists of a compiler that produces sequences of map-reduce programs. pig's language layer consists of a textual language called pig latin, which has the following key properties: * ease of programming. it is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. * optimization opportunities. the way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency. * extensibility. users can create their own functions to do special-purpose processing. 
    

solr

  1. site

  2. cate web-framework/network-server

  3. lang java

  4. desc

         solr is an open source enterprise search server based on the lucene java search library, with xml/http and json, ruby, and python apis, hit highlighting, faceted search, caching, replication, and a web administration interface.
    

spark

  1. site

  2. cate big-data

  3. lang java/scala/python

  4. desc

         apache spark is a fast and general engine for large-scale data processing. it offers high-level apis in java, scala and python as well as a rich set of libraries including stream processing, machine learning, and graph analytics.  
    

sqoop

  1. site

  2. cate big-data

  3. lang java

  4. desc

         apache sqoop(tm) is a tool designed for efficiently transferring bulk data between apache hadoop and structured datastores such as relational databases.   
    

storm

  1. site

  2. cate big-data

  3. lang java

  4. desc

         apache storm is a distributed real-time computation system. similar to how hadoop provides a set of general primitives for doing batch processing, storm provides a set of general primitives for doing real-time computation. 
    

tez

  1. site

  2. cate big-data

  3. lang java

  4. desc

         the apache tez project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. it is currently built atop apache hadoop yarn 
    

thrift

  1. site

  2. cate http/library/network-client/network-server

  3. lang actionscript/c/c#/c++/cocoa/d/delphi/erlang/go/haskell/java

  4. desc

         apache thrift allows you to define data types and service interfaces in a simple definition file. taking that file as input, the compiler generates code to be used to easily build rpc clients and servers that communicate seamlessly across lang. instead of writing a load of boilerplate code to serialize and transport your objects and invoke remote methods, you can get right down to business. 
    

zookeeper

  1. site

  2. cate database

  3. lang java

  4. desc

         apache zookeeper is an effort to develop and maintain an open-source server which enables highly reliable distributed coordination.
    


blog comments powered by Disqus