1) Overview:
2) Details:
The Normal Distribution
The normal distribution.
Symmetrical.
Area = 1.
Curve never hit zero.
Described by mean and standard deviation.
68% falls within 1 standard deviation.
95% falls within 2 standard deviations.
99.7% falls within 3 standard deviations.
Python Code:
What percent of women are shorter than 154cm
from scipy.stats import norm
norm.cdf(154,161,7)
Percent of women are taller than 154cm:
1-norm.cdf(154,161,7)
Percent of women between 154 and 157 cm:
norm.cdf(157,161,7) - norm.cdf(154,161,7)
What height are 90% of women shorter than?
norm.ppf(0.9, 161, 7)
What height are 90% of women taller than?
norm.ppf((1-0.9),161,7)
Generating random numbers
norm.rvs(161,7, size=10)
Distribution of Amir's sales:
# Create a histogram with 10 bins:
amir_deals['amount'].hist(bins=10)
plt.show()
Probabilities from the normal distribution.
norm.cdf(value, loc = mean, scale = standard deviation)
norm.ppf(percentage, loc = mean, scale = standard deviation)
Simulating sales under new market conditions.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Amir's current average sale amount and standard deviation
current_mean = 5000
current_sd = 2000
# Increase the average amount by 20% and standard deviation by 30%
increase_percentage_mean = 0.20
increase_percentage_sd = 0.30
new_mean = current_mean * (1 + increase_percentage_mean)
new_sd = current_sd * (1 + increase_percentage_sd)
# Generate 36 simulated amounts from a normal distribution
num_simulations = 36
new_sales = norm.rvs(loc=new_mean, scale=new_sd, size=num_simulations)
# Plot the distribution of new_sales
plt.hist(new_sales, bins=10, edgecolor='k', alpha=0.7)
plt.xlabel('Sale Amount')
plt.ylabel('Frequency')
plt.title('Distribution of New Sales Amounts')
plt.show()
# Print the new mean and standard deviation
print("New Mean:", new_mean)
print("New Standard Deviation:", new_sd)
The Poison Distribution
Poison processes
Events appear to happen at a specific rate, but completely at random.
Ex: Number of animals adopted from an animal shelter per week.
Time unit is irrelevant, as long as you use the same unit when talking about the same situation.
The poison distribution.
Probability of some number of events occurring over a fixed period of time.
The PMF of a discrete probability distribution, such as the Poisson distribution, gives the probability of a specific discrete random variable taking on a particular value.
The CDF of a probability distribution gives the probability that a random variable takes on a value less than or equal to a specific value.
Python code:
Lambda: average number of events per time interval. (peak value of distribution)
If the average number of adoptions per week is 8, what is P(#adoptions in a week = 5)
from scipy.stats import poisson
poisson.pmf(5,8)
If the average number of adoptions per week is 8, what is P(#adoptions in a week <= 5)
poisson.cdf(5,8)
If the average number of adoptions per week is 8, what is P(#adoptions in a week >= 5)
1-poisson.cdf(5,8)
Sampling from a Poison Distribution
from scipy.stats import poisson
poisson.rsv(8, size=10)
More Probability Distribution
Exponential Distribution
Probability of time between Poisson events.
Probability of >1 day between adoptions.
Probability of < 10 minutes between restaurant arrivals.
Probability of 6-8 months between earthquakes.
Also uses lambda.
Continuous.
Expected value of exponential distribution:
In term of rate (Poisson): lambda = 0.5 requests per minute.
In term of time between events (exponential): 1/lambda = 1 request per 2 minutes.
How long until a new request is created?
P(wait < 1min)
from scipy.stats import expon
expon.cdf(1, scale = 2)
P(wait > 1min)
1 - expon.cdf(4, scale = 2)
P(1min < wait < 4min)
expon.cdf(4, scale = 2) - expon.cdf(1, scale = 2)
The t-distribution.
Similar shape to the Normal Distribution
The tail is thicker
Degree of freedom (df) which affects the thickness of the tails
Lower df = thicker tails, higher standard deviation.
Higher df = closer to normal distribution.
Log-Normal Distribution
Variable whose logarithm is normally distributed.
Examples:
Length of chess game.
Adult blood pressure.
Number of hospitalizations in the 2003 SARS outbreak.
Samling Distributions
Sample Size:
How the size of the sample affects the accuracy of the point estimates.
Sample size is the number of rows
Relative error of point estimates.
Metric to compare between population parameter and point estimate:
Geometric distributions.
Geometric distributions.
Allows us to calculate the probability of success after k trials given the probability of success for each trial.
Python Code:
Probability mass function (pmf): give the prob that the first success will occur on the k trials.
from scipy.stats import geom
geom.pmf(k=30, p=0.333)
# p is parameter specifies probability of success.
Cumulative distribution funstion (cdf): give the prob that the first success will occur on or before k trials.
geom.cdf(k=4, p=0.3)
Survival function (sf): give the prob that the first success will occur after the k trials.
geo.sf(k=2, p=0.3)
Percent point function (ppf): inverse of the cdf, tells the number of trials required for a specified cumulative prob.
geo.ppf(q=0.6, p=0.3)
Sample generation (rvs): generate a random sample from a geometric distribution.
sample = geom.rvs(p=0.3, size = 1000, random_state = 13)
sns.distplot(sample, bins = np.linspace(0,20,21), kde = False)
plt.show()
The Binomial Distribution (Random Numbers and Probability)
Coin Flipping
Binomial distribution:
Probability distribution of the number of successes in a sequence of independent trials.
Described by n (total number of trials) and p (probability of success.
Expected value = n x p
Independence:
The binomial distribution is a prob distribution of the number of successes in a sequence of independent trials.
Python:
from scipy.stats import binom
A single coin flipping:
binom.rvs (1, 0,5, size =8) # flip 1 coin with 50% chance of success 1 time.
What's the prob of = 7 heads?
binom.pmf(7, 10, 0.5) # num heads, num trials, prob of heads
What's the prob of < 7 heads?
binom.cdf(7, 10, 0.5)
What's the prob of > 7 heads?
1-binom.cdf(7, 10, 0.5)
Sampling Methods
Simple random sampling
df.sample(n=5, random_state = 190000113)
Systematic sampling.
sample_size= 5
pop_size = len(df)
interval = pop_size // sample_size
df.iloc[::interval]
newdf= df.reset_index()
Meking systematic sampling safe
shuffle = df.sample(frac=1)
shuffle = shuffle.reset_index(drop=True).reset_index()
shuffle.plot(x='index', y='aftertaste',kind = 'scatter')
plt.show()
Stratified sampling:
A technique that allows us to sample a population that contains subgroups.
Use simple random sampling on every subgroup.
df.sample(frac=0.1, random_state = 2021)
Weighted random: We create a column of "weight" that adjusts the relative prob of sampling each row.
df.sample(frac = 0.1, weights = "weight")
Cluster sampling.
Use simple random sampling to pick some subgroups.
Use simple random sampling on only those subgroups.
Stage 1: sampling for subgroups
import random
varieties_samp = random.sample(varieties_pop, k =3)
Stage: sampling each group
variety_condition = df['variety'].isin(varieties_samp)
cluster = df[variety_condition]
cluster['variety'] = cluster['variety'].cat.remove_unused_categories()
cluster.groupby("variety").sample(n=5, random_state=2021)
Comparing sampling methods.
Performing t-tests.
Compare sample statistics across groups of a variable.
converted_comp is a numerical variable.
age_first_code_cut is a categorical variable with levels ("child" and "adult")
Calculating p-values From t-statistics.
t-distribution:
t statistic follows a t-distribution
Have a parameter named degrees of freedom, or df.
Look like normal distributions, with fatter tails.
Larger degrees of freedom df => t-distribution gets closer to the normal distribution.
Normal distribution => t-distribution with infinite degree of freedom.
Calculating degrees of freedom:
Dataset has 5 independent observations
Four of the values are 2, 6, 8, and 5.
The sample mean is 5.
That last value must be 4.
Here, there are 4 degrees of freedom
df = nchild + nadult - 2
Paired t-tests
Compare two groups.
ANOVA Tests
A test for differences between groups.
Proportion Tests
Chi-square test of independence.
The chi-square test distribution has degrees of freedom and non-centrality parameters.
When these numbers are large, the chi-square distribution can be approximated by a normal distribution.
Chi-square test of independence.
Chi-square goodness of fit tests.