Statistics Notes

Created:4/27/2020 09:44
Updated:10/21/2020 17:21
Author:michaelg.tadesse@gmail.com
Source:http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability12.html

normal-normal conjugate
a normal-normal conjugate method uses a normal pdf to weight the alternative hypothesis for mu - meaning the prior distribution for the maple syrup problem is a normal distribution 
#when using the normal-normal conjugate
use a normal distribution as a prior distribution for the unknown parameter of a normal distribution [assume sigma is known]
collect data [from a normal distribution]
use conjugate shortcut to generate the posterior distribution for the unknown parameter - which will also be a normal distribution
#normal-normal conjugate analytical solution
https://gyazo.com/9bdc90ef80e493ca3ff1c7d04674f0d5

gamma-Poisson conjugate
refresher on conjugation: there are cases where you can use a particular pdf as a prior distribution, collect data of a specific flavor, and then derive the posterior pdf with a closed-form solution 

the pdf's of the prior and posterior distributions are the same probability density function, but their parameters may differ 

prior is called a conjugate prior - effect of the data can be interpreted in terms of changes in parameter values 

when more data comes in -> use the posterior as a prior and use the shortcut to update the old posterior to a new posterior

#to summarize
gamma prior + Poisson data == gamma posterior
https://gyazo.com/b25ec7b7603c3321ea99235ad902c22e
https://gyazo.com/b25ec7b7603c3321ea99235ad902c22e
#poisson distribution
expresses the probability of a given number of events occurring in a fixed interval of time or space
the mean and variance of this distribution is lambda 
basically this is telling us the probability of observing a random variable equal to k

the possible values that k can assume for the Poisson distribution is not finite; rather it is countably finite

as lambda gets larger the distribution starts to look like normal
#posterior updating
remember that as data comes in, the Bayesian's previous posterior becomes her new prior, so learning is self-consistent
#hyperparameters
in Bayesian statistics, hyperparameter is a parameter of a prior or posterior distribution 
#gamma function
https://www.youtube.com/watch?v=7y-XTrfNvCs
gamma(n) = (n-1)! - n is a positive integer
gamma(z+1) = z*gamma(z)
gamma(1) = 1
gamma(1/2) = sqrt(PI)
this was computed by relating the integral of gamma with the normal distribution

#gamma distribution

#bayesian way of parameter estimation
#beta family
a flexible family of distributions - describes a wide range of prior beliefs
p ~ beta(alpha, beta)
the uniform distribution is a special kind of the beta family with alpha = beta = 1

if enough data are observed, you will converge to an accurate posterior distribution
#conjugacy
- for non-conjugate case, there is usually no simple mathematical expression, and one must resort to computation 
in this case, an analytical solution exists that makes the posterior update possible - avoids the integration required in the denominator of Bayes Theorem
flat priors are not necessarily non-informative, and non-informative priors are not necessarily flat 
#but what does that really mean?

the beta distribution is a conjugate distribution that can be updated with binomial data - beta-binomial conjugate

the proof for the above can be found in Appendix 1 of Intro to Bayes book

a conjugate prior is an algebraic convenience, giving a closed -form expression for the posterior - otherwise a difficult integration may be necessary

the beta distribution is a suitable model for the random behavior of percentages and proportions - can be used as a conjugate prior pdf for bernoulli, binomial, negative binomial and geometric distributions 

#how to present posterior results
  • display the posterior distribution
  • use metrics [mean, median, mode, quantiles]


#bayes formula
above, both the prior and posterior distributions are pdfs - P is a probability density
  • to compute the likelihood, we first need to make an assumption about how the data was generated - this can be a probability distribution (normal, exponential etc.)
  • likelihood == the probability of observing the data, given the hypothesis
    • P(observing 4 pregnancies | we choose m10 for treatment)
    • P(president Trump has covid | he would do whatever it takes to remain in power)
the area under the likelihood curve is not equal to 1.0 - how is that possible?? 

#difference between likelihood and probability
"because we generally do not entertain the full set of alternative hypotheses and because some are nested within others, the likelihoods that we attach to our hypotheses do not have any meaning in and of themselves; only the relative likelihoods -- that is, the ratios of two likelihoods --- have meaning." 
                                                C. Randy Gallistel
probabilities attach to results (which are exclusive and exhaustive)
likelihoods attach to hypotheses 

#solving for likelihood - example
Suppose we hypothesize that μ = 5.0, and assume that σ is known to be 0.5. And further suppose that we draw a random bacterium that lives x = 4.5 hours. We can ask, “What is the likelihood that x = 4.5 given that μ is 5.0 and σ = 0.5?” We will use the normal pdf to answer this question


#prior elicitation

#comparison between frequentist and bayesian inferences
frequentist approach is highly sensitive to the null hypothesis - whereas the bayesian method this will probably not be the case
#p-value
definition: the probability of observing something at least as extreme as the data, given that the null hypothesis is true (more extreme in the direction of the alternative hypothesis)

For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“significant”) your results.

a p-value is needed to make an inference decision with the frequentist approach


#credible interval - the Bayesian alternative for confidence interval
a range for which the Bayesian thinks that the probability of including the true value is, say, 0.95

thus, a Bayesian can say that there is a 95% chance that the credible interval contains the true parameter value
#solved problem on how to calculate confidence intervals
https://www.statisticshowto.com/probability-and-statistics/confidence-interval/

#debunk common misconception about confidence interval

#frameworks under which we can define probabilities
  • classical: outcomes equally likely --> have equal probabilities p(x=4) = 1/6
  • frequentist: have a hypothetical infinite sequence of events -> look at the relative frequency of the events - how frequent did we observe the outcome that interests us out of the total
    • empirical probability distribution: based on raw data 
    • law of large numbers: your estimate of Pr(X) gets closer and closer to the true probability when you use more trials
  • Bayesian: personal perspective - your measure of uncertainty - based on a synthesis of evidence and personal judgement regarding that evidence
    • prior probabilities are updated through an iterative process of data collection
#Bayesian approach
true positive rate | sensitivity | recall | probability of detection
P(ELISA is positive | person tested has HIV)

true negative rate | specificity
P(ELISA is negative | person tested has no HIV) 

#review of probability distributions
Bernoulli - only two possible outcomes 
basically, a binomial with just one trial (n = 1)
X ~ B(p) -> P(X = 1) = p; P(X = 0) = 1 - p
f(X = x|p)) = f(x|p) = p^x(1-p)^(1-x)
E(X) = p
Var(X) = p(1-p)

Binomial - generalization of Bernoulli when we have n repeated trials
key assumption: trials are independent
X ~ Bin(n,p)
E[X] = np
Var(X) = np(1-p)

#poisson distribution
it has an unusual property such that both the mean and variance are mu


#expected value of a random variable
#test statistics
  • allow us to quantify how close things are to out expectations or theories
 15 (std) -> twice the amount we expect on average

 
    • if your z statistic is more extreme that the critical value - you call it statistically significant 
  • we use t-statistic if we don't know the the true population std 
 
  • t-statistic has thicker tails, as we are estimating the population std and estimation adds more uncertainty -> as we get more data the t distribution approaches the z distribution
  • large test statistics and smaller p values refer to samples that are extreme
#bayesian statistics
  • frequentist
  • Bayesian 
    • take what you currently know about the population -> use that to estimate what the population is (prior) -> this will adjust out belief about what the population is like 
    • posterior - instead of saying I will hit 4 stop lights, I will say it's 3 because of the new data - aggregation of what be believed before and the data we got now  

#randomForest
  • from the feature + class matrix -> subset random chunks -> create Decision trees from these sample sets 
#PDF value greater than 1
  • even if the pdf takes on values greater than 1, if the domain that it integrates over is less than 1, it can add up to only 1
  • for a continuous random variable, we take an integral of a PDF over a certain interval to find its probability that X will fall in that interval
  • what does a probability density at point x mean?
  • if we know the mean and std, we know the entire distribution of probabilities
it means how much probability is concentrated per unit length (dx) near x, or how dense the probability is near x
#standard error
standard error quantifies the variation in the means from multiple sets of measurments 
#confidence intervals
  • bounds on the estimates of a population variable
  • to present the skill of a predictive model
  • they seek to quantify the uncertainty in a population parameter such as mean or std
  • 95% CI is a range of values calculated from out data, that most likely includes the true value of what we are estimating
  • smaller confidence interval == more precise estimate
  • also help to facilitate trade-offs between models - matching CIs indicates equivalence between the models and might provide a reason to favor the less complex or more interpretable model
#to calculate margin of error
Critical Value * Standard deviation of the population
Critical Value * Standard error of the sample
critical value is either a t-score or z-score


#prediction intervals
  • quantify and communicate the uncertainty in a prediction
  • describe the uncertainty for a single specific outcome
  • uncertainty comes from
    • model
    • noise in the input data
  • larger than a confidence interval - as it takes into account the confidence interval and the variance in the output variable
  • its computed as some combination of the estimated variance of the model and the variance of the outcome variable
  • make assumptiosn - distributions of x and y and the predictors errors made by the model (residuals) are Gaussian
  • PI (prediction interval) = yhat +/- z*sigma
    • yhat - predicted value
    • z - number of std from Gaussian distribution (1.96 for a 95% interval)


#Assumptions of multiple linear regression
* residuals are normally distributed
* no multicollinearity
* homoscedasticity - variance of error terms are similar across the values of the independent variables
#what are degrees of freedom (DOF)
they indicate the number of independent values that can vary in an analysis without breaking any constraints

the number of values that are free to vary as you estimate parameters 

typically DOF = sample size - number of parameters we need to estimate

DOF == the number of observations in a sample that are free to vary while estimating statistical parameters 



#how are the outliers in box plot determined

Extreme Value Theorem
  • assumes that all components are identically distributed - this is not needed in central limit theorem
  • CLT describes the limit of sums; EVT describes the limits of 
  • The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). If the population is normal, then the theorem holds true even for samples smaller than 30. In fact, this also holds true even if the population is binomial, provided that min(np, n(1-p))> 5, where n is the sample size and p is the probability of success in the population. This means that we can use the normal probability model to quantify uncertainty when making inferences about a population mean based on the sample mean. [source]
  • question: what possible distributions might be considered candidates for the distribution for Mn as n -> infinity?
  • degenerate random variable - a distribution assigning all of the probability to a single point -> P(x = c) = 1
  • and we want to prevent the degenerate limit in order to have a distribution -> for CLT we apply a linear scaling (subtract population mean from xbar and divide by population standard deviation) -> that will give us a standard normal

#Gumbel distribution

#Frechet distribution
#Weibull distribution
support -> x element of [0, inf)
    









#Weibull PDF
when shape parameter is 1 -> it defaults to exponential pdf