18 February 2016

line 01 fundamentals

  1. matrices & linear algebra fundamentals

  2. hash functions, binary tree, o(n)

  3. relational algebra, db basics

  4. inner, outter, cross, theta join

  5. cap theorem

  6. tabular data

  7. data frames & series

  8. sharding

  9. olap

  10. multidimensional data model

  11. etl

  12. reporting vs bi vs analytics

  13. json & xml

  14. nosql

  15. regex

  16. vendor landscape

  17. env setup

line 02 statistics

  1. pick a dataset(uci repo)

  2. descriptive statistics(mean, median, range, sd, var)

  3. exploratory statistics

  4. histograms

  5. percentiles & outliers

  6. probability theory

  7. bayes theorem

  8. random variables

  9. cumul dist fn(cdf)

  10. continous distributions(normal, poisson, gaussian)

  11. skewness

  12. anova

  13. prob den fn(pdf)

  14. central limit theorem

  15. monte carlo method

  16. hypothesis testing

  17. p-value

  18. chiĀ² test

  19. estimation

  20. confid int(ci)

  21. mle

  22. kernel density estimate

  23. regression

  24. convariance

  25. correlation

  26. pearson coeff

  27. causation

  28. leastĀ² fit

  29. euclidean distance

03 programming

  1. python basics

  2. working in excel

  3. r setup rstudio

  4. r basic

  5. expression

  6. variables

  7. ibm spss

  8. rapid miner

  9. vectors

  10. matrices

  11. arrays

  12. factors

  13. lists

  14. data frame

  15. reading csv data

  16. reading raw data

  17. subsetting data

  18. manipulate data frames

  19. functions

  20. factor analysis

  21. install pkgs

04 machine learning

  1. what is ml?

  2. numerical var

  3. categorical var

  4. supervised learning

  5. unsupervised learning

  6. concepts, inputs & attributes

  7. training & test data

  8. classifier

  9. prediction

  10. lift

  11. overfitting

  12. bias & variance

  13. tress & classification

  14. classfication rate

  15. decision trees

  16. boosting

  17. naive bayes classifiers

  18. k-nearest neighbor

  19. logistic regression

  20. regression

    1. ranking

    2. linear regression

    3. perception

  21. clustering

    1. hierarchical clustering

    2. k-means clustering

  22. neural networks

  23. sentiment analysis

  24. tagging

  25. vocabulary mapping

05 text mining / nlp

  1. corpus

  2. named entity recognition

  3. text analysis

  4. uima

  5. term document matrix

  6. term frequency & weight

  7. support vector machines

  8. association rules

  9. market based analysis

  10. feature extraction

  11. using mahout

  12. using weka

  13. using nltk

  14. classify text

  15. vocabulary mapping

  16. tagging

06. visulization

  1. data exploration in r(hist, boxplot etc)

  2. uni, bi &multivariate viz

  3. ggplot2

  4. histogram & pie(uni)

  5. tree & tree map

  6. scatter plot(bi)

  7. line charts(bi)

  8. spatial charts

  9. survey plot

  10. timeline

  11. decision tree

  12. d3.js

  13. info vis

  14. ibm manyeyes

  15. tableau

07 big data

  1. map recude fundamentals

  2. hadoop components

  3. hdfs

  4. data replication priciples

  5. setup hadoop(ibm/cloudera/hortonworks)

  6. name & data nodes

  7. job & task tracker

  8. mr programming

  9. sqoop loading data in hdfs

  10. flume, scribe for unstruct data

  11. sql with pig

  12. dwh with hive

  13. scribe, chukwa for weblog

  14. using mahout

  15. zookeeper avro

  16. storm hadoop realtime

  17. rhadoop, rhipe

  18. rmr

  19. cassandra

  20. mongodb, neo4j

08 data ingestion

  1. summary of data formats

  2. data discovery

  3. data sources & acquisition

  4. data integration

  5. data fusion

  6. transformation & enrichment

  7. data survey

  8. google openrefine

  9. how much data?

  10. using etl

09 data munging

  1. dimensionality & numerosity reduction

  2. normalization

  3. data scrubbing

  4. handing missing values

  5. unbiased estimators

  6. binning sparse values

  7. feature extration

  8. denoising

  9. sampling

  10. stratified sampling

  11. pricipal component analysis

  12. transformation & enrichment

10 toolbox

  1. ms excel wl analysis toolpak

  2. java, python

  3. r rstudio rattle

  4. weka, knime, rapidminer

  5. hadoop dist of choice

  6. spark, storm

  7. flume, scribe, chukwa

  8. nutch, talend, scraperwiki

  9. webscraper, flume, sqoop

  10. tm, rweka, nltk

  11. rhipe

  12. d3.js, ggplot2, shiny

  13. ibm languageware

  14. cassandra, mongodb



blog comments powered by Disqus