from spark site
Apache Sparkā¢ is a fast and general engine
for large-scale data processing
-
speed
-
run programs up to 100x faster than hadoop mapreduce in memory, or 10x faster on disk
-
spark has an advanced
DAG
execution engine that supports cyclic data flow and in-memory computing
-
-
ease of use
-
write apps quickly in java, scala or python
-
spark offers over 80 high-level operators that make it easy to build parallel apps
-
and you can use it interactively from the scala and python shells
-
-
generality
-
combine sql, streaming, and complex analytics
-
spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming
-
you can combine these libraries seamlessly in the same app
+-----------+ +-----------+ +-----------+ +-----------+ | | | | | | | | | spark | | spark | | MLlib | | GraphX | | sql | | streaming | | machine | | graph | | | | | | learning | | | | | | | | | | | +-----------+ +-----------+ +-----------+ +-----------+ +-----------------------------------------------------+ | | | apache spark | | | +-----------------------------------------------------+
-
-
runs everywhere
-
runs on
-
hadoop
-
mesos
-
standalone
-
in the cloud
-
-
it can access diverse data sources including
-
hdfs
-
cassandra
-
hbase
-
s3
-
-
you can run spark readily using its
-
on ec2
-
or run it on hadoop yarn or apache mesos
-
it can read from hdfs, hbase, cassandra, and any hadoop data source
-