26 March 2016

multiple regression

  1. hypothesize a linear model with more independent variables

         minutes = α + β1friends + β2work hours + β3phd + ε
    
         # in ch11 introduce a dummy variable
         # 1 for users with phds 0 for users without
    
  2. the model

    1. ch14

       # fit a model of the form
       yi = α + βxi + εi
      
       # each input xi is not a signle number
       # but rather a vector of k numbers
       # xi1,...xik
      
       # multiple regression model assumes that
       yi = α + β1xi1 + ... + βkxik + εi
      
    2. in multiple regression the vector of parameters is ususlly called β

       beta = [alpha, beta_1, ..., beta_k]
       # and
       x_i = [1, x_i1, ..., x_ik]
      
       # then our model is just
       def predict(x_i, beta):
           """assumes that the first element of each x_i is 1"""
           return dot(x_i, beta)
      
       # our independent variable x will be a list of vectors
       # each of which looks like this
       [1,   # constant term
        49,  # number of friends
        4,   # work hours per day
        0]   # doesn't have phd
      
  3. further assumptions of the least squares model

         # e.g. in ch14 we built a model
         # predicted that each additonal friends
         # was associated with an extra 0.90 daily minutes on the site
    
    1. imagine

       # 1. people work more hours spend less time on site
       # 2. people with more friends tend to work more hours
      
       # "actual" model is
       minutes = α + β1friends + β2work hours + ε
      
       # work hours and friends are positively correlated
       minutes = α + β1friends + ε
      
       # we will underestimated β1
      
  4. fitting the model

         def error(x_i, y_i, beta):
             return y_i - predict(x_i, beta)
    
         def squared_error(x_i, y_i, beta):
             return error(x_i, y_i, beta) ** 2
    
         # use calculus
         def squared_error_gradient(x_i, y_i, beta):
             """the gradient (with respect to beta)
             corresponding to the ith squared error term"""
             return [-2 * x_ij * error(x_i, y_i, beta)
                     for x_ij in x_i]
    
    1. find optimal beta using stochastic gradient descent

       def estimate_beta(x, y):
           beta_initial = [random.random() for x_i in x[0]]
           return minimize_stochastic(squared_error,
                                      squared_error_gradient,
                                      x, y,
                                      beta_initial,
                                      0.001)
       random.seed(0)
       # [30.63, 0.972, -1.868, 0.911]
       beta = estimate_beta(x, daily_minutes_good)
      
       # this means our model looks like
       minutes = 30.63 + 0.972 friends − 1.868 work hours + 0.911 phd
      
  5. interpreting the model

  6. goodness for fit

    1. r-squared

       def multiple_r_squared(x, y, beta):
           sum_of_squared_errors = sum(error(x_i, y_i, beta) ** 2
                                       for x_i, y_i in zip(x, y))
           return 1.0 - sum_of_squared_errors / total_sum_of_squares(y)
      
       # keep in mind
       # adding new variables to a regression
       # will `necessarily` increase the r-squared
      
  7. digression the bootstrap

         # a sample of n data point
         # generated by some (unknow to us) distribution
         data = get_sample(num_points=n)
    
         # in ch5 function `median`
         # 1. if all the data are very close to 100
         #    => the actual median is close to 100
         # 2. half the data is close to 0 and the other half is colse to 200
         #    => can't be nearly as certain about the median
    
    1. bootstrap

       def bootstrap_sample(data):
           """randomly sample len(data) element with replacement"""
           return [random.choice(data) for _ in data]
      
       def bootstrap_statistic(data, stats_fn, num_samples):
           """evaluates stats_fn on num_samples bootstrap samples from data"""
           return [stats_fn(bootstrap_sample(data))
                   for _ in range(num_samples)]
      
       # e.g.
       # 101 points all very close to 100
       chose_to_100 = [99.5 + random.random() for _ in range(101)]
      
       # 101 point, 50 near 0, 50 near 200
       far_from_100 = ([99.5 + random.random()] +
                       [random.random() for _ in range(50)] +
                       [200 + random.random() for _ in range(50)])
      
       # median of each both close to 100
       # however
       bootstrap_statistic(close_to_100, median, 100)
       bootstrap_statistic(far_from_100, median, 100)
      
  8. standard error of regression coefficients



blog comments powered by Disqus