line 01 fundamentals
-
matrices & linear algebra fundamentals
-
hash functions, binary tree, o(n)
-
relational algebra, db basics
-
inner, outter, cross, theta join
-
cap theorem
-
tabular data
-
data frames & series
-
sharding
-
olap
-
multidimensional data model
-
etl
-
reporting vs bi vs analytics
-
json & xml
-
nosql
-
regex
-
vendor landscape
-
env setup
line 02 statistics
-
pick a dataset(uci repo)
-
descriptive statistics(mean, median, range, sd, var)
-
exploratory statistics
-
histograms
-
percentiles & outliers
-
probability theory
-
bayes theorem
-
random variables
-
cumul dist fn(cdf)
-
continous distributions(normal, poisson, gaussian)
-
skewness
-
anova
-
prob den fn(pdf)
-
central limit theorem
-
monte carlo method
-
hypothesis testing
-
p-value
-
chiĀ² test
-
estimation
-
confid int(ci)
-
mle
-
kernel density estimate
-
regression
-
convariance
-
correlation
-
pearson coeff
-
causation
-
leastĀ² fit
-
euclidean distance
03 programming
-
python basics
-
working in excel
-
r setup rstudio
-
r basic
-
expression
-
variables
-
ibm spss
-
rapid miner
-
vectors
-
matrices
-
arrays
-
factors
-
lists
-
data frame
-
reading csv data
-
reading raw data
-
subsetting data
-
manipulate data frames
-
functions
-
factor analysis
-
install pkgs
04 machine learning
-
what is ml?
-
numerical var
-
categorical var
-
supervised learning
-
unsupervised learning
-
concepts, inputs & attributes
-
training & test data
-
classifier
-
prediction
-
lift
-
overfitting
-
bias & variance
-
tress & classification
-
classfication rate
-
decision trees
-
boosting
-
naive bayes classifiers
-
k-nearest neighbor
-
logistic regression
-
regression
-
ranking
-
linear regression
-
perception
-
-
clustering
-
hierarchical clustering
-
k-means clustering
-
-
neural networks
-
sentiment analysis
-
tagging
-
vocabulary mapping
05 text mining / nlp
-
corpus
-
named entity recognition
-
text analysis
-
uima
-
term document matrix
-
term frequency & weight
-
support vector machines
-
association rules
-
market based analysis
-
feature extraction
-
using mahout
-
using weka
-
using nltk
-
classify text
-
vocabulary mapping
-
tagging
06. visulization
-
data exploration in r(hist, boxplot etc)
-
uni, bi &multivariate viz
-
ggplot2
-
histogram & pie(uni)
-
tree & tree map
-
scatter plot(bi)
-
line charts(bi)
-
spatial charts
-
survey plot
-
timeline
-
decision tree
-
d3.js
-
info vis
-
ibm manyeyes
-
tableau
07 big data
-
map recude fundamentals
-
hadoop components
-
hdfs
-
data replication priciples
-
setup hadoop(ibm/cloudera/hortonworks)
-
name & data nodes
-
job & task tracker
-
mr programming
-
sqoop loading data in hdfs
-
flume, scribe for unstruct data
-
sql with pig
-
dwh with hive
-
scribe, chukwa for weblog
-
using mahout
-
zookeeper avro
-
storm hadoop realtime
-
rhadoop, rhipe
-
rmr
-
cassandra
-
mongodb, neo4j
08 data ingestion
-
summary of data formats
-
data discovery
-
data sources & acquisition
-
data integration
-
data fusion
-
transformation & enrichment
-
data survey
-
google openrefine
-
how much data?
-
using etl
09 data munging
-
dimensionality & numerosity reduction
-
normalization
-
data scrubbing
-
handing missing values
-
unbiased estimators
-
binning sparse values
-
feature extration
-
denoising
-
sampling
-
stratified sampling
-
pricipal component analysis
-
transformation & enrichment
10 toolbox
-
ms excel wl analysis toolpak
-
java, python
-
r rstudio rattle
-
weka, knime, rapidminer
-
hadoop dist of choice
-
spark, storm
-
flume, scribe, chukwa
-
nutch, talend, scraperwiki
-
webscraper, flume, sqoop
-
tm, rweka, nltk
-
rhipe
-
d3.js, ggplot2, shiny
-
ibm languageware
-
cassandra, mongodb