Statistical Experimentation in Python

1) Overview:

2) Details:

The Normal Distribution

The normal distribution.
- Symmetrical.
- Area = 1.
- Curve never hit zero.
- Described by mean and standard deviation.
- 68% falls within 1 standard deviation.
- 95% falls within 2 standard deviations.
- 99.7% falls within 3 standard deviations.
Python Code:
- What percent of women are shorter than 154cm

from scipy.stats import norm

norm.cdf(154,161,7)

- Percent of women are taller than 154cm:

1-norm.cdf(154,161,7)

- Percent of women between 154 and 157 cm:

norm.cdf(157,161,7) - norm.cdf(154,161,7)

- What height are 90% of women shorter than?

norm.ppf(0.9, 161, 7)

- What height are 90% of women taller than?

norm.ppf((1-0.9),161,7)

- Generating random numbers

norm.rvs(161,7, size=10)

Distribution of Amir's sales:

# Create a histogram with 10 bins:

amir_deals['amount'].hist(bins=10)

plt.show()

Probabilities from the normal distribution.

norm.cdf(value, loc = mean, scale = standard deviation)

norm.ppf(percentage, loc = mean, scale = standard deviation)

Simulating sales under new market conditions.

import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import norm

# Amir's current average sale amount and standard deviation

current_mean = 5000

current_sd = 2000

# Increase the average amount by 20% and standard deviation by 30%

increase_percentage_mean = 0.20

increase_percentage_sd = 0.30

new_mean = current_mean * (1 + increase_percentage_mean)

new_sd = current_sd * (1 + increase_percentage_sd)

# Generate 36 simulated amounts from a normal distribution

num_simulations = 36

new_sales = norm.rvs(loc=new_mean, scale=new_sd, size=num_simulations)

# Plot the distribution of new_sales

plt.hist(new_sales, bins=10, edgecolor='k', alpha=0.7)

plt.xlabel('Sale Amount')

plt.ylabel('Frequency')

plt.title('Distribution of New Sales Amounts')

plt.show()

# Print the new mean and standard deviation

print("New Mean:", new_mean)

print("New Standard Deviation:", new_sd)

The Poison Distribution

Poison processes
- Events appear to happen at a specific rate, but completely at random.
- Ex: Number of animals adopted from an animal shelter per week.
- Time unit is irrelevant, as long as you use the same unit when talking about the same situation.
The poison distribution.
- Probability of some number of events occurring over a fixed period of time.
- The PMF of a discrete probability distribution, such as the Poisson distribution, gives the probability of a specific discrete random variable taking on a particular value.
- The CDF of a probability distribution gives the probability that a random variable takes on a value less than or equal to a specific value.
Python code:
- Lambda: average number of events per time interval. (peak value of distribution)
- If the average number of adoptions per week is 8, what is P(#adoptions in a week = 5)

from scipy.stats import poisson

poisson.pmf(5,8)

- If the average number of adoptions per week is 8, what is P(#adoptions in a week <= 5)

poisson.cdf(5,8)

- If the average number of adoptions per week is 8, what is P(#adoptions in a week >= 5)

1-poisson.cdf(5,8)

- Sampling from a Poison Distribution

from scipy.stats import poisson

poisson.rsv(8, size=10)

More Probability Distribution

Exponential Distribution
- Probability of time between Poisson events.
  - Probability of >1 day between adoptions.
  - Probability of < 10 minutes between restaurant arrivals.
  - Probability of 6-8 months between earthquakes.
- Also uses lambda.
- Continuous.
- Expected value of exponential distribution:
  - In term of rate (Poisson): lambda = 0.5 requests per minute.
  - In term of time between events (exponential): 1/lambda = 1 request per 2 minutes.
- How long until a new request is created?
- P(wait < 1min)

from scipy.stats import expon

expon.cdf(1, scale = 2)

- P(wait > 1min)

1 - expon.cdf(4, scale = 2)

- P(1min < wait < 4min)

expon.cdf(4, scale = 2) - expon.cdf(1, scale = 2)

The t-distribution.
- Similar shape to the Normal Distribution
- The tail is thicker
- Degree of freedom (df) which affects the thickness of the tails
  - Lower df = thicker tails, higher standard deviation.
  - Higher df = closer to normal distribution.

Log-Normal Distribution
- Variable whose logarithm is normally distributed.
- Examples:
  - Length of chess game.
  - Adult blood pressure.
  - Number of hospitalizations in the 2003 SARS outbreak.

Samling Distributions

Sample Size:
- How the size of the sample affects the accuracy of the point estimates.
- Sample size is the number of rows
Relative error of point estimates.
- Metric to compare between population parameter and point estimate:

Geometric distributions.

Geometric distributions.
- Allows us to calculate the probability of success after k trials given the probability of success for each trial.
Python Code:
- Probability mass function (pmf): give the prob that the first success will occur on the k trials.

from scipy.stats import geom

geom.pmf(k=30, p=0.333)

# p is parameter specifies probability of success.

- Cumulative distribution funstion (cdf): give the prob that the first success will occur on or before k trials.

geom.cdf(k=4, p=0.3)

Survival function (sf): give the prob that the first success will occur after the k trials.

geo.sf(k=2, p=0.3)

Percent point function (ppf): inverse of the cdf, tells the number of trials required for a specified cumulative prob.

geo.ppf(q=0.6, p=0.3)

Sample generation (rvs): generate a random sample from a geometric distribution.

sample = geom.rvs(p=0.3, size = 1000, random_state = 13)

sns.distplot(sample, bins = np.linspace(0,20,21), kde = False)

plt.show()

The Binomial Distribution (Random Numbers and Probability)

Coin Flipping
Binomial distribution:
- Probability distribution of the number of successes in a sequence of independent trials.
- Described by n (total number of trials) and p (probability of success.
Expected value = n x p
Independence:
- The binomial distribution is a prob distribution of the number of successes in a sequence of independent trials.
Python:

from scipy.stats import binom

- A single coin flipping:

binom.rvs (1, 0,5, size =8) # flip 1 coin with 50% chance of success 1 time.

- What's the prob of = 7 heads?

binom.pmf(7, 10, 0.5) # num heads, num trials, prob of heads

- What's the prob of < 7 heads?

binom.cdf(7, 10, 0.5)

- What's the prob of > 7 heads?

1-binom.cdf(7, 10, 0.5)

Sampling Methods

Simple random sampling

df.sample(n=5, random_state = 190000113)

Systematic sampling.

sample_size= 5

pop_size = len(df)

interval = pop_size // sample_size

df.iloc[::interval]

newdf= df.reset_index()

Meking systematic sampling safe

shuffle = df.sample(frac=1)

shuffle = shuffle.reset_index(drop=True).reset_index()

shuffle.plot(x='index', y='aftertaste',kind = 'scatter')

plt.show()

Stratified sampling:
- A technique that allows us to sample a population that contains subgroups.
- Use simple random sampling on every subgroup.

df.sample(frac=0.1, random_state = 2021)

Weighted random: We create a column of "weight" that adjusts the relative prob of sampling each row.

df.sample(frac = 0.1, weights = "weight")

Cluster sampling.
- Use simple random sampling to pick some subgroups.
- Use simple random sampling on only those subgroups.
- Stage 1: sampling for subgroups

import random

varieties_samp = random.sample(varieties_pop, k =3)

- Stage: sampling each group

variety_condition = df['variety'].isin(varieties_samp)

cluster = df[variety_condition]

cluster['variety'] = cluster['variety'].cat.remove_unused_categories()

cluster.groupby("variety").sample(n=5, random_state=2021)

Comparing sampling methods.

Performing t-tests.

Compare sample statistics across groups of a variable.
- converted_comp is a numerical variable.
- age_first_code_cut is a categorical variable with levels ("child" and "adult")

Calculating p-values From t-statistics.

t-distribution:
- t statistic follows a t-distribution
- Have a parameter named degrees of freedom, or df.
- Look like normal distributions, with fatter tails.
- Larger degrees of freedom df => t-distribution gets closer to the normal distribution.
- Normal distribution => t-distribution with infinite degree of freedom.
Calculating degrees of freedom:
- Dataset has 5 independent observations
- Four of the values are 2, 6, 8, and 5.
- The sample mean is 5.
- That last value must be 4.
- Here, there are 4 degrees of freedom
- df = nchild + nadult - 2

Paired t-tests

Compare two groups.

ANOVA Tests

A test for differences between groups.

Proportion Tests

Chi-square test of independence.
- The chi-square test distribution has degrees of freedom and non-centrality parameters.
- When these numbers are large, the chi-square distribution can be approximated by a normal distribution.

Chi-square test of independence.
Chi-square goodness of fit tests.

References:

Page updated

Google Sites

Report abuse

Statistical Experimentation in Python

About Me: