get started
-
tutorial exercise 1 ingest structured data
$ sqoop import-all-tables -m 1 \ --connect jdbc:mysql://quickstart:3306/retail_db \ --username=retail_dba \ --password=cloudera \ --compression-codec=snappy \ --as-avrodatafile \ --warehouse-dir=/user/hive/warehouse
-
this command take a while to complete
-
export data from our
mysql
database -
put those export files in
avro
format inhdfs
-
alse creating
avro schema
-
so that we can easily load our
hive
tables for use inimpala
later$ hadoop fs -ls /user/hive/warehouse $ hadoop fs -ls /user/hive/warehouse/categories
-
sqoop
also created schema files$ ls -l *.avsc
-
schema
and the data are stored in separate file -
the schema is only applied when the data si queried, a technique called
schema-on-read
-
`hive will need schema files
$ sudo -u hdfs hadoop fs -mkdir /user/examples $ sudo -u hdfs hadoop fs -chmod +rw /user/examples $ hadoop fs -put ~/*.avsc /user/examples
-
-
tutorial exercise 2 query structured data
-
use
hue
’s impala app to create the metadata for our tables in hue -
web-based interface on port 8888 of
master node
http://10.0.2.15:8888 username cloudera password cloudera
-
click
query editors
and openimpala query editor
create external table categories stored as avro location 'hdfs:///user/hive/warehouse/categories' tblproperties ('avro.schema.url'='hdfs://quickstart/user/examples/sqoop_import_categories.avsc'); customers... departments... orders... order_items... products...
-
verify all of the tables
show tables;
-
calculating total revenue per product and showing top 10
select c.category_name, count(order_item_quantity) as count from order_items oi inner join products p on oi.order_item_product_id = p.product_id inner join categories c on c.category_id = p.product_category_id group by c.category_name order by count desc limit 10; select p.product_id, p.product_name, r.revenue from products p inner join (select oi.order_item_product_id, sum(cast(oi.order_item_subtotal as float)) as revenue from order_items oi inner join orders o on oi.order_item_order_id = o.order_id where o.order_status <> 'CANCELED' and o.order_status <> 'SUSPECTED_FRAUD' group by order_item_product_id) r on p.product_id = r.order_item_product_id order by r.revenue desc limit 10;
-
-
tutorial exercise 3 correlate structured data with unstructured data
-
buld upload data
$ sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/original_access_logs $ sudo -u hdfs hadoop fs -put /opt/examples/log_files/access.log.2 /user/hive/warehouse/original_access_logs
-
verigy data in hdfs
$ hadoop fs -ls /user/hive/warehouse/original_access_logs
-
query hive using command-line jdbc client for hive called beeline
$ beeline -u jdbc:hive2://quickstart:10000/default -n admin -d org.apache.hive.jdbc.HiveDriver
-
run query on impala
-
when create new table while impala is running you must tell it to refresh metadata
# impala query editor invalidate metadata;
-
query editor
select count(*), url from tokenized_access_logs where url like '%\/product\/%' group by rul order by count(*) desc;
-
-
tutorial 4 relationship strength analytics using spark
-
start spark
$ spark-shell --jars /usr/lib/avro/avro-mapred.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
-
paste the following code
import org.apache.avro.generic.GenericRecord import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper} import org.apache.hadoop.io.NullWritable val warehouse = 'hdfs://quickstart/user/hive/warehouse/' val order_items_path = warehouse + 'order_items' val order_items = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](order_items_path) val products_path = warehouse + 'products' val products = sc.hadoopFile[AvroWrapper[GenericRecord], NullWritable, AvroInputFormat[GenericRecord]](products_path)
-
extract the fields form order_items and products
val orders = order_items.map { x => (x._1.datum.get('order_item_product_id'), (x._1.datum.get('order_item_order_id'), x._1.datum.get('order_item_quantity'))) }.join(products.map { x => (x._1.datum.get('product_id'), (x._1.datum.get('product_name'))) }).map(x => ( // order_id scala.Int.unbox(x._2._1._1), ( // quantity scala.Int.unbox(x._2._1._2), // product_name x._2._2.toString ) )).groupByKey()
-
-
tutorial exercise 5 explore log events interactively
-
extend
flume
configuration that is already ingesting the web log data to also post events tosolr
for indexing in real-time -
create your search index
# 1. creating an empty configuration $ solrctl --zk 10.0.2.15:2128/solr instancedir --generate solr_configs # 2. edit your schema # 3. uploading your configuration $ cd /opt/examples/flume $ solrctl --zk 10.0.2.15:2181/solr instancedir --create live_logs ./solr_configs # 4. creating your collection $ solrctl --zk 10.0.2.15:2181/solr collection --create live_logs -s 1
-
verify using
hue
# search -> indexes -> live_logs -> click # viewing the fields defined in schema.xml
-
staring the log generator
$ start_logs # verify has started $ tail_logs # ctrl+c # stop log generator $ stop_logs
-
flume and the morphline
flume is a system for collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data source with a few simple configuration files, we can use flume and a morphline (a simple way to accomplish on-the-fly etl) to load our data into solr index
-
start flume agent
$ flume-ng agent \ --conf /opt/examples/flume/conf \ --conf-file /opt/examples/flume/conf/flume.conf \ --name agent1 -Dflume.root.logger=DEBUG,INFO,console
-