setup
-
have
hadoop
in your path orexport HADOOP_HOME=hadoop-install-dir
- check my post 2015-05-25-apache-hive-on-mac-osx-yosemite
-
create dir in hdfs
$ hadoop fs -mkdir /tmp $ hadoop fs -mkdir -p /user/hive/warehouse $ hadoop fs -chmod g+w /tmp $ hadoop fs -chmod g+w /user/hive/warehouse
-
running hiveserver2 and beeline
-
hiveserver2 has its own cli beeline
-
HiveCLI is now deprecated
$ bin/hiveserver2 $ bin/beeline -u jdbc:hive2://$HS2_HOST:$HS_PORT
-
start beeline and hiveserver2 in the same process for testing purpose
$ bin/beeline -u jdbc:hive2://
-
-
running hcatalog
$ hcatalog/sbin/hcat_server.sh
-
running webhcat (templeton)
$ hcatalog/sbin/webhcat_server.sh
configuration management overview
-
default configuration
conf/hive-default.xml
-
change location of configuration directory by setting
HIVE_CONF_DIR
-
defining/redefining variables
conf/hive-site.xml
-
log4j configuration
conf/hive-log4j.properties
-
inherits hadoop configuration
-
manipulate hive configuration
-
editing
hive-site.xml
-
using set command
-
invoking hive (deprecated), beeline or hiveserver2
$ bin/hive --hiveconf x1=y1 --hiveconf x2=y2 $ bin/hiveserver2 --hiveconf x1=y1 --hiveconf x2=y2 $ bin/beeline --hiveconf x1=y1 --hiveconf x2=y2
-
setting
HIVE_OPTS
environment variable--hiveconf x1=y1 --hiveconf x2=y2
-
hive mr and local-mode
-
enable local mode execution
hive> set mapred.job.tracker=local; # `mapred.local.dir` must point to a valid path # /tmp/<username>/mapred/local
metadata store
-
metadata is in an embedded derby database
javax.jdo.option.ConnectionURL # default => ./metastore_db
-
metastore can be stored in any database that supported by
jpox
javax.jdo.option.ConnectionURL javax.jdo.option.ConnectionDriverName
-
in the future metastore itself can be a standalone server
simple example use cases
-
create a table with tab-delimited text file format
create table u_data( userid int, movieid int, rating int, unixtime string ) row format delimited fields terminated by '\t' stored as textfile; # 1. <---------------- # com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException # Specified key was too long; max key length is 767 bytes mysql> select * from information_schema.schemata s where schema_name='metastore'; +--------------+-------------+----------------------------+------------------------+----------+ | CATALOG_NAME | SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | SQL_PATH | +--------------+-------------+----------------------------+------------------------+----------+ | def | metastore | utf8 | utf8_general_ci | NULL | +--------------+-------------+----------------------------+------------------------+----------+ 1 row in set (0.00 sec) mysql> alter database metastore character set latin1; Query OK, 1 row affected (0.00 sec) mysql> select * from information_schema.schemata s where schema_name='metastore'; +--------------+-------------+----------------------------+------------------------+----------+ | CATALOG_NAME | SCHEMA_NAME | DEFAULT_CHARACTER_SET_NAME | DEFAULT_COLLATION_NAME | SQL_PATH | +--------------+-------------+----------------------------+------------------------+----------+ | def | metastore | latin1 | latin1_swedish_ci | NULL | +--------------+-------------+----------------------------+------------------------+----------+ 1 row in set (0.00 sec) hive> create table employee(emp_id int, emp_name string, emp_title string) row format delimited fields terminated by ',' stored as textfile; OK Time taken: 0.755 seconds mysql> show variables like '%char%';
-
download movielens 100k
$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip # or $ curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip # unzip $ unzip ml-100k.zip
-
load
u.data
into the table just createdhive> load data local inpath '<path>/u.data' overwrite into table u_data; hive> select count(*) from u_data;
-
complex data analysis on table
u_data
-
weekday_maapper.py
import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)])
-
use mapper script
hive> create table u_data_new( userid int, movieid int, rating int, weekday int ) row format delimited fields terminated by '\t'; add file weekday_maapper.py insert overwrite table u_data_new select transform (userid, movieid, rating, unixtime) using 'weekday_maapper.py' as (userid, movieid, rating, weekday) from u_data; # 2. <---------------- hive> insert overwrite table u_data_new select transform (userid, movieid, rating, unixtime) using 'weekday_maapper.py' as (userid, movieid, rating, weekday) from u_data; Query ID = hqlgree2_20150601170202_150ee6ee-fa50-49d9-ad27-7158cda130ef Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2015-06-01 17:02:12,361 Stage-1 map = 0%, reduce = 0% Ended Job = job_local524868591_0001 with errors Error during job, obtaining debugging information... Job Tracking URL: http://localhost:8080/ FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec # solution: 2. <---------------- hive> insert overwrite table u_data_new select transform (userid, movieid, rating, unixtime) using 'python /Users/hqlgree2/Documents/hadoop/hive/weekday_mapper.py' as (userid, movieid, rating, weekday) from u_data; Query ID = hqlgree2_20150601221515_9686053f-8a88-45e9-b02a-90e0fff696e9 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2015-06-01 22:15:19,293 Stage-1 map = 100%, reduce = 0% Ended Job = job_local526336603_0006 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: hdfs://localhost:9000/user/hive/warehouse/u_data_new/.hive-staging_hive_2015-06-01_22-15-18_053_7587415369814957289-1/-ext-10000 Loading data to table default.u_data_new Table default.u_data_new stats: [numFiles=1, numRows=100000, totalSize=1179173, rawDataSize=1079173] MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 4134474 HDFS Write: 1179256 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 1.377 seconds hive> select weekday, count(*) from u_data_new group by weekday; Query ID = hqlgree2_20150601221616_fb0af076-8d19-4788-bd16-383152f1856d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Defaulting to jobconf value of: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2015-06-01 22:16:47,499 Stage-1 map = 100%, reduce = 100% Ended Job = job_local111019165_0007 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 10627460 HDFS Write: 2358512 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1 12254 2 13579 3 14430 4 15114 5 14743 6 18229 7 11651 Time taken: 1.802 seconds, Fetched: 7 row(s)
-
-
java.io.FileNotFoundException
-
cp hive-env.sh.template hive-env.sh
hive> select weekday, count(*) from u_data_new group by weekday; Query ID = hqlgree2_20150602221313_3ecfb503-ce16-455f-8b48-64f381df9208 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/usr/local/hive/libexec/lib/ST4-4.0.4.jar at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93) at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57) at org.apache.hadoop.mapreduce.JobResourceUploader.uploadFiles(JobResourceUploader.java:179) at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:98) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:193) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:428) at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:137) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:88) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1638) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1397) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1183) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1049) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1039) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:207) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:754) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Job Submission failed with exception 'java.io.FileNotFoundException(File does not exist: hdfs://localhost:9000/usr/local/hive/libexec/lib/ST4-4.0.4.jar)' FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
-
mv hive-env.sh hive-env.sh.bak
hive> select weekday, count(*) from u_data_new group by weekday; Query ID = hqlgree2_20150602221515_cf6735ad-2974-48fc-9122-b97f95d107f5 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2015-06-02 22:15:35,666 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1524182395_0001 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 2358346 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1 12254 2 13579 3 14430 4 15114 5 14743 6 18229 7 11651 Time taken: 2.37 seconds, Fetched: 7 row(s)
-