machine learning
-
modeling
mathematical (or probabilistic) relationship that exists between different variables
-
what is machine learning
use existing data to `develop models` that can use to `predict` various outcomes for new data predicting whether an email message is spam or not predicting whether a credit card transaction is fraudulent predicting which advertisement a shopper is most likely to click on predicting which football team is going to win the super bowl
-
overfitting and underfitting
-
simplest way is split data set
def split_data(data, prob): """split data into fractions [prob, 1 - prob]""" result = [], [] for row in data: result[0 if random.random() < prob else 1].append(row) return result
-
training data or test data
def train_test_split(x, y, test_pct): data = zip(x, y) train, test = split_data(data, 1 - test_pct) x_train, y_train = zip(*train) x_test, y_test = zip(*test) return x_train, x_test, y_train, y_test # do something like model = SomeKindOfModel() x_train, x_test, y_train, y_test = train_test_split(xs, ys, 0.33) performace = model.test(x_test, y_test)
-
correctness
-
every data lies in one of four categories
-
true positive
this message is spam, and correctly predicted spam
-
false positive
type 1 error: not spam but predicted spam
-
false negative
type 2 error: is spam but predicted not spam
-
true negative
is not spam correctly predicted not spam
-
-
confusion matrix
+------------------+---------------+----------------+ | | is spam | is not spam | +------------------+---------------+----------------+ | predict spam | true positive | false positive | +------------------+---------------+----------------+ | predict not spam | true positive | false positive | +------------------+---------------+----------------+
-
accuracy
is defined as the fraction of correct predictiondef accuracy(tp, fp, fn, tn): correct = tp + tn total = tp + fp + fn + tn return correct / total print accuracy(70, 4940, 13930, 981070) # 0.98114
-
it’s common to look at the combination of
precision
andrecall
def precision(tp, fp, fn, tn): return tp / (tp + fp) print precision(70, 4940, 13930, 981070) # 0.014 def recall(tp, fp, fn, tn): return tp / (tp + fn) print recall(70, 4940, 13930, 981070) # 0.005 0.014 and 0.005 are both terrible numbers reflecting that is a terrible model
-
precision and recall are combined into the
f1 score
def f1_score(tp, fp, fn, tn): p = precision(tp, fp, fn, tn) r = recall(tp, tp, fn, tn) return 2 * p * r / (p + r) # this is the `harmonic mean` of precision and recall # and necessarily lies between them
-
the bias-variance trade-off
-
another way of thinking about
overfitting
is as a trade-off between bias and variance-
high bias and low variance typically correspond to underfitting
-
low bias and high variance typically correspond to overfitting
-
if your model has high bias
which means it performs poorly even on your training data then one thing to try is adding more features
-
if your model has high variance
you can similarly remove features but another solution is to obtain more data
-
feature extraction and selection
-
features are whatever inputs we provide to our model
-
when data doesn’t have enough features, model is likely to underfit
-
when data has too many features, it’s easy to overfit
-
-
e.g.
if you want to predict someone's salay based on her `years of experience` then `years of experience` if the only feature you have
-
extract features from our data that fall into one of these three categories
-
naive bayes classifier
suited to yes-or-no features
-
regression models
require numeric features
-
decision trees
deal with numeric or categorical data
-