normal-normal conjugate
a normal-normal conjugate method uses a normal pdf to weight the alternative hypothesis for mu - meaning the prior distribution for the maple syrup problem is a normal distribution
#when using the normal-normal conjugate
use a normal distribution as a prior distribution for the unknown parameter of a normal distribution [assume sigma is known]
collect data [from a normal distribution]
use conjugate shortcut to generate the posterior distribution for the unknown parameter - which will also be a normal distribution
#normal-normal conjugate analytical solution
gamma-Poisson conjugate
refresher on conjugation: there are cases where you can use a particular pdf as a prior distribution, collect data of a specific flavor, and then derive the posterior pdf with a closed-form solution
the pdf's of the prior and posterior distributions are the same probability density function, but their parameters may differ
prior is called a conjugate prior - effect of the data can be interpreted in terms of changes in parameter values
when more data comes in -> use the posterior as a prior and use the shortcut to update the old posterior to a new posterior
#to summarize
gamma prior + Poisson data == gamma posterior
#poisson distribution
expresses the probability of a given number of events occurring in a fixed interval of time or space
the mean and variance of this distribution is lambda
basically this is telling us the probability of observing a random variable equal to k
the possible values that k can assume for the Poisson distribution is not finite; rather it is countably finite
as lambda gets larger the distribution starts to look like normal
#posterior updating
remember that as data comes in, the Bayesian's previous posterior becomes her new prior, so learning is self-consistent
#hyperparameters
in Bayesian statistics, hyperparameter is a parameter of a prior or posterior distribution
#gamma function
https://www.youtube.com/watch?v=7y-XTrfNvCs
gamma(n) = (n-1)! - n is a positive integer
gamma(z+1) = z*gamma(z)
gamma(1) = 1
gamma(1/2) = sqrt(PI)
this was computed by relating the integral of gamma with the normal distribution
#gamma distribution
#bayesian way of parameter estimation
#beta family
a flexible family of distributions - describes a wide range of prior beliefs
p ~ beta(alpha, beta)
the uniform distribution is a special kind of the beta family with alpha = beta = 1
if enough data are observed, you will converge to an accurate posterior distribution
#conjugacy
- for non-conjugate case, there is usually no simple mathematical expression, and one must resort to computation
in this case, an analytical solution exists that makes the posterior update possible - avoids the integration required in the denominator of Bayes Theorem
flat priors are not necessarily non-informative, and non-informative priors are not necessarily flat
#but what does that really mean?
the beta distribution is a conjugate distribution that can be updated with binomial data - beta-binomial conjugate
the proof for the above can be found in Appendix 1 of Intro to Bayes book
a conjugate prior is an algebraic convenience, giving a closed -form expression for the posterior - otherwise a difficult integration may be necessary
the beta distribution is a suitable model for the random behavior of percentages and proportions - can be used as a conjugate prior pdf for bernoulli, binomial, negative binomial and geometric distributions
#how to present posterior results
display the posterior distribution
use metrics [mean, median, mode, quantiles]
#bayes formula
above, both the prior and posterior distributions are pdfs - P is a probability density
to compute the likelihood, we first need to make an assumption about how the data was generated - this can be a probability distribution (normal, exponential etc.)
likelihood == the probability of observing the data, given the hypothesis
the area under the likelihood curve is not equal to 1.0 - how is that possible??
#difference between likelihood and probability
"because we generally do not entertain the full set of alternative hypotheses and because some are nested within others, the likelihoods that we attach to our hypotheses do not have any meaning in and of themselves; only the relative likelihoods -- that is, the ratios of two likelihoods --- have meaning."
probabilities attach to results (which are exclusive and exhaustive)
likelihoods attach to hypotheses
#solving for likelihood - example
Suppose we hypothesize that μ = 5.0, and assume that σ is known to be 0.5. And further suppose that we draw a random bacterium that lives x = 4.5 hours. We can ask, “What is the likelihood that x = 4.5 given that μ is 5.0 and σ = 0.5?” We will use the normal pdf to answer this question
#prior elicitation
#comparison between frequentist and bayesian inferences
frequentist approach is highly sensitive to the null hypothesis - whereas the bayesian method this will probably not be the case
#p-value
definition: the probability of observing something at least as extreme as the data, given that the null hypothesis is true (more extreme in the direction of the alternative hypothesis)
For example, a p value of 0.0254 is 2.54%. This means there is a 2.54% chance your results could be random (i.e. happened by chance). That’s pretty tiny. On the other hand, a large p-value of .9(90%) means your results have a 90% probability of being completely random and not due to anything in your experiment. Therefore, the smaller the p-value, the more important (“significant”) your results.
a p-value is needed to make an inference decision with the frequentist approach
#credible interval - the Bayesian alternative for confidence interval
a range for which the Bayesian thinks that the probability of including the true value is, say, 0.95
thus, a Bayesian can say that there is a 95% chance that the credible interval contains the true parameter value
#solved problem on how to calculate confidence intervals
https://www.statisticshowto.com/probability-and-statistics/confidence-interval/
#debunk common misconception about confidence interval
#frameworks under which we can define probabilities
classical: outcomes equally likely --> have equal probabilities p(x=4) = 1/6
frequentist: have a hypothetical infinite sequence of events -> look at the relative frequency of the events - how frequent did we observe the outcome that interests us out of the total
Bayesian: personal perspective - your measure of uncertainty - based on a synthesis of evidence and personal judgement regarding that evidence
#Bayesian approach
true positive rate | sensitivity | recall | probability of detection
P(ELISA is positive | person tested has HIV)
true negative rate | specificity
P(ELISA is negative | person tested has no HIV)
#review of probability distributions
Bernoulli - only two possible outcomes
basically, a binomial with just one trial (n = 1)
X ~ B(p) -> P(X = 1) = p; P(X = 0) = 1 - p
f(X = x|p)) = f(x|p) = p^x(1-p)^(1-x)
E(X) = p
Var(X) = p(1-p)
Binomial - generalization of Bernoulli when we have n repeated trials
key assumption: trials are independent
X ~ Bin(n,p)
E[X] = np
Var(X) = np(1-p)
#poisson distribution
it has an unusual property such that both the mean and variance are mu
#expected value of a random variable
#test statistics
15 (std) -> twice the amount we expect on average
t-statistic has thicker tails, as we are estimating the population std and estimation adds more uncertainty -> as we get more data the t distribution approaches the z distribution
large test statistics and smaller p values refer to samples that are extreme
#bayesian statistics
frequentist
Bayesian
take what you currently know about the population -> use that to estimate what the population is (prior) -> this will adjust out belief about what the population is like
posterior - instead of saying I will hit 4 stop lights, I will say it's 3 because of the new data - aggregation of what be believed before and the data we got now
#randomForest
#PDF value greater than 1
even if the pdf takes on values greater than 1, if the domain that it integrates over is less than 1, it can add up to only 1
for a continuous random variable, we take an integral of a PDF over a certain interval to find its probability that X will fall in that interval
what does a probability density at point x mean?
if we know the mean and std, we know the entire distribution of probabilities
it means how much probability is concentrated per unit length (dx) near x, or how dense the probability is near x
#standard error
standard error quantifies the variation in the means from multiple sets of measurments
bounds on the estimates of a population variable
to present the skill of a predictive model
they seek to quantify the uncertainty in a population parameter such as mean or std
95% CI is a range of values calculated from out data, that most likely includes the true value of what we are estimating
smaller confidence interval == more precise estimate
also help to facilitate trade-offs between models - matching CIs indicates equivalence between the models and might provide a reason to favor the less complex or more interpretable model
#to calculate margin of error
Critical Value * Standard deviation of the population
Critical Value * Standard error of the sample
critical value is either a t-score or z-score
quantify and communicate the uncertainty in a prediction
describe the uncertainty for a single specific outcome
uncertainty comes from
model
noise in the input data
larger than a confidence interval - as it takes into account the confidence interval and the variance in the output variable
its computed as some combination of the estimated variance of the model and the variance of the outcome variable
make assumptiosn - distributions of x and y and the predictors errors made by the model (residuals) are Gaussian
PI (prediction interval) = yhat +/- z*sigma
#Assumptions of multiple linear regression
* residuals are normally distributed
* no multicollinearity
* homoscedasticity - variance of error terms are similar across the values of the independent variables
#what are degrees of freedom (DOF)
they indicate the number of independent values that can vary in an analysis without breaking any constraints
the number of values that are free to vary as you estimate parameters
typically DOF = sample size - number of parameters we need to estimate
DOF == the number of observations in a sample that are free to vary while estimating statistical parameters
#how are the outliers in box plot determined
Extreme Value Theorem