Definitions

Key Statistics Terms, Definitions and Explanations

What is a Bernoulli distribution and when should I use it?

Definition: A Bernoulli distribution models a random variable with two possible outcomes: success (1) and failure (0).
Probability:
- P(X = 1) = p (probability of success)
- P(X = 0) = 1 - p (probability of failure)
Characteristics:
- Mean: E(X) = p
- Variance: Var(X) = p(1 - p)
When to use: Use when there are only two outcomes (success/failure), such as:
- Coin toss (heads/tails)
- Yes/No surveys
- Defective/non-defective items
Example:
- Coin toss: p = 0.5 for heads, 1 - p = 0.5 for tails.
- Quality control: p = 0.02 for a defective product.

The Bernoulli distribution is ideal for modeling simple binary outcomes.

What is a Binomial distribution and when should I use it?

Definition: A Binomial distribution models the number of successes in a fixed number of independent trials of a binary experiment, with the same probability of success in each trial.
Probability Mass Function (PMF):
- P(X = r) = C(n, r) * p^r * (1 - p)^(n - r)
- Where:
Characteristics:
- Mean: E(X) = n * p
- Variance: Var(X) = n * p * (1 - p)
When to use: Use when:
- There is a fixed number of trials (n)
- Each trial has two possible outcomes (success/failure)
- The probability of success is constant (p)
- The trials are independent
Examples:
- Coin tosses: Modeling the number of heads in 10 flips of a fair coin
- Quality control: Modeling the number of defective items in a sample of 50 products
- Survey responses: Modeling the number of positive responses in a sample of 200 people

The Binomial distribution is ideal for modeling experiments with repeated trials that have two outcomes.

Comparison: Bernoulli vs Binomial Distribution

Key Differences

Aspect	Bernoulli Distribution	Binomial Distribution
Number of Trials	1 trial	Fixed number of trials (n)
Outcomes	Two outcomes (success/failure)	Two outcomes in each trial (success/failure), modeled over n trials
Probability	Single trial probability (p)	Same probability of success (p) in each of n trials
Random Variable	Single success/failure (0 or 1)	Number of successes (r) in n trials
Mean	p	n * p
Variance	p(1 - p)	n * p * (1 - p)
When to Use	Single binary outcome (success/failure)	Multiple trials with binary outcomes (successes in n trials)
Examples	Coin toss (1 toss), yes/no survey (single respondent)	Coin tosses (multiple tosses), quality control (multiple items tested)

Summary

The Bernoulli distribution models a single trial with two possible outcomes, whereas the Binomial distribution models the number of successes in multiple independent Bernoulli trials. While the Bernoulli distribution focuses on one trial, the Binomial distribution is used when there are several trials, all with the same probability of success.

What is a Continuous random variable and when should I use it?

A continuous random variable is a random variable that can take an infinite number of values within a given range. These variables are often associated with measurements like time, height, temperature, or distance. Unlike discrete random variables, which can only take specific values, continuous random variables have a continuum of possible outcomes.

Key Characteristics

Infinite Possible Outcomes: Continuous random variables can take any value within a specified range, such as time, weight, or temperature.
Probability Density Function (PDF): The probability distribution of a continuous random variable is described using a PDF, where the probability of any exact value is zero, but the probability of a range of values is determined by the area under the curve.
Cumulative Distribution Function (CDF): The CDF gives the probability that the random variable is less than or equal to a certain value.

Probability for Continuous Random Variables

Why is the probability for specific values zero and why can we calculate the probability within an interval?

Specific Value Probability: For continuous random variables, the probability of a specific value is zero because there are infinitely many possible values. Mathematically, the area under the probability density function (PDF) at a single point is zero.
Interval Probability: Although the probability for a specific value is zero, the probability within a range (interval) is non-zero. This probability is computed by finding the area under the PDF curve between two points.
Integral of PDF: The probability of a random variable falling between two values is given by the integral of the PDF over that interval. The area under the curve between two points gives the probability for the interval.
Example: For a continuous random variable like height, the probability of exactly 180 cm is zero, but the probability that the height is between 179 cm and 181 cm can be calculated and will be non-zero.

In summary, while individual values for continuous random variables have zero probability, the probability of falling within a range (interval) is meaningful and can be calculated by integrating the PDF over that interval.

Normal Distribution

A normal distribution is a probability distribution that is symmetric about the mean, with a bell-shaped curve. It is widely used in statistics because many natural phenomena follow this distribution.

Key Characteristics

Symmetry: The distribution is symmetrical around the mean, and the left side is a mirror image of the right side.
Mean, Median, Mode: In a normal distribution, the mean, median, and mode are all equal and located at the center.
Bell-Shaped Curve: The graph of a normal distribution is bell-shaped, with the highest point at the mean.
Standard Deviation: The spread of the distribution is determined by the standard deviation (σ). A smaller σ means a narrower distribution, and a larger σ means a wider distribution.
68-95-99.7 Rule:
- 68% of data falls within 1 standard deviation from the mean.
- 95% falls within 2 standard deviations.
- 99.7% falls within 3 standard deviations.
Total Area: The area under the curve represents the total probability, which equals 1 (100%).

When to Use a Normal Distribution

Symmetric Data: Use when the data is symmetric and centered around a mean value.
Large Sample Size: The central limit theorem ensures that with large samples, the distribution of sample means is normal.
Natural Phenomena: Many natural phenomena, like heights and IQ scores, follow a normal distribution.
Statistical Analysis: Many statistical techniques assume normality, such as t-tests and regression analysis.

Examples

Human Heights: Heights typically follow a normal distribution, with most people near the average height.
Exam Scores: Exam scores in a large class often follow a normal distribution, with most students scoring around the average.
Measurement Errors: Measurement errors in experiments often follow a normal distribution, with small errors being more common.

Conclusion

Normal distribution is essential for modeling natural and social phenomena. If data is symmetric and bell-shaped, it likely follows a normal distribution, and many statistical methods rely on this assumption.

When to Use

Range of Values: Use when the random variable can take any value within a certain range.
Measurement-based Phenomena: When dealing with measured quantities, such as height, weight, or time.
Precision in Data: When the data can be measured with high precision (e.g., decimals).

Examples

Temperature: Temperature is continuous because it can take any value within a given range (e.g., 23.5°C, 23.55°C, etc.).
Time Taken to Complete a Task: Time is continuous as it can be measured to any level of precision (e.g., seconds, milliseconds).
Height of Individuals: Height is continuous because it can take any value within a specified range (e.g., between 140 cm and 200 cm).

Conclusion

Continuous random variables are used for measurements where the data can take any value within a given range. These are modeled with probability density functions (PDFs) and are applicable in various fields like physics, engineering, and natural sciences.

Normal Distribution and Its Uses

A normal distribution is often used to model random variables when the data follows a symmetric, bell-shaped curve. It is suitable in the following contexts:

Contexts to Use Normal Distribution

Natural Phenomena: Many biological and physical traits, such as human heights, body temperatures, and blood pressure, follow a normal distribution.
Measurement Errors: In scientific experiments, measurement errors are often normally distributed, with smaller errors more common than large ones.
Test Scores: Large test score distributions, like in IQ tests or standardized exams, tend to follow a normal distribution.
Financial Models: Stock returns and investment portfolio returns are often modeled with a normal distribution, particularly when factors are numerous and independent.
Statistical Models: In regression analysis, the residuals or errors are often assumed to follow a normal distribution to facilitate hypothesis testing.
Social Sciences: Psychological traits such as IQ scores and personality traits typically follow a normal distribution.
Quality Control: In manufacturing, product sizes or weights often follow a normal distribution, with most products near the target specification.
Central Limit Theorem: When sampling from any population, the sampling distribution of the mean will tend to follow a normal distribution, especially with large sample sizes.

Conclusion

Normal distributions are used to model random variables in cases where the data is symmetric, influenced by many small, independent factors, and exhibits few extreme values. It is commonly applied in natural, social, and financial sciences, as well as in statistical analysis.

Random Sample

A random sample is a subset of individuals or observations selected from a larger population, where each member has an equal chance of being included. This method helps ensure that the sample is representative of the population.

Key Characteristics of a Random Sample

Equal Chance: Every member of the population has an equal probability of being selected.
Unbiased: It eliminates selection bias and provides reliable results.
Representative: The sample should reflect the characteristics of the larger population.

Types of Random Sampling

Simple Random Sampling: Every individual is selected entirely by chance.
Stratified Random Sampling: The population is divided into subgroups, and samples are taken from each.
Systematic Random Sampling: Individuals are selected at regular intervals from an ordered list.
Cluster Sampling: The population is divided into clusters, and a random sample of clusters is selected.

When to Use a Random Sample

To make inferences about a large population without surveying everyone.
To ensure fairness and minimize bias.
When using statistical methods that rely on random sampling (e.g., confidence intervals, hypothesis testing).

Example

If you're conducting a survey on the effectiveness of a new product, you might randomly sample 500 customers from a pool of 10,000, ensuring each customer has an equal chance of being selected. This random sample can then be used to make generalizations about the entire customer base.

Sample Proportion

A sample proportion refers to the proportion of individuals in a sample that exhibit a specific characteristic or outcome. It is denoted by •p and is calculated by dividing the number of successes by the total number of individuals in the sample.

Formula for Sample Proportion

The formula for the sample proportion is:

•p = x / n

•p: Sample proportion
x: Number of successes (individuals with the characteristic of interest)
n: Total number of individuals in the sample

Example

If 60 out of 100 people surveyed prefer blue as their favorite color, the sample proportion of people who prefer blue is:

•p = 60 / 100 = 0.60

This means 60% of the people in the sample prefer blue.

When to Use Sample Proportion

Estimating Population Proportion: When you're estimating the proportion of a population that has a certain characteristic.
Hypothesis Testing: Used to compare a sample proportion to a hypothesized population proportion.
Confidence Intervals: Used in constructing confidence intervals for population proportions.

Distribution of Sample Proportion

The sample proportion follows a normal distribution when the sample size is large enough and both np and n(1 - p) are greater than 5. This ensures that the normal approximation holds.

Approximating Normality for Sample Proportions

You can approximate normality for sample proportions when certain conditions are met, as per the Central Limit Theorem.

Conditions for Approximating Normality

Large Sample Size: The sample size must be large enough:
- n • p ≥ 10
- n • (1 - p) ≥ 10
Where:
- n: Sample size
- p: Population proportion
- 1 - p: Complement of the population proportion
Independent Sampling: The samples must be independent of each other.

Why These Conditions Matter

n • p ≥ 10: Ensures there are enough successes in the sample.
n • (1 - p) ≥ 10: Ensures there are enough failures in the sample.

Example

Suppose you are surveying 200 people, and 40% (p = 0.4) of the population prefers a certain brand of soda. To check if normality can be approximated:

Calculate n • p = 200 • 0.4 = 80
Calculate n • (1 - p) = 200 • 0.6 = 120

Since both values are greater than 10, the normal approximation is appropriate.

When Not to Use Normal Approximation

If n • p or n • (1 - p) is less than 10, use the binomial distribution instead.

What is a Confidence Interval?

A confidence interval (CI) is a statistical tool used to estimate the range within which a population parameter (such as the mean or proportion) is likely to fall. It provides an interval estimate, rather than a single value estimate, and reflects the uncertainty associated with sampling.

The confidence level represents how certain we are that the true population parameter lies within this interval. A 90% confidence interval, for example, suggests that if we were to take 100 different samples from the population, approximately 90 of those intervals would contain the true population parameter, while 10 might not.

Confidence Interval Formula for a Proportion:

The formula for a confidence interval for a population proportion p is given by:

CI = p̂ ± z * √(p̂(1 - p̂) / n)

Where:

p̂ is the sample proportion
z is the z-score associated with the desired confidence level
n is the sample size

Confidence Interval Formula for a Mean:

The formula for a confidence interval for a population mean μ is:

CI = x̄ ± z * (s / √n)

Where:

x̄ is the sample mean
s is the sample standard deviation
n is the sample size
z is the z-score corresponding to the desired confidence level

Confidence Levels and Z-Scores:

90% Confidence Level: The z-score is 1.645.
95% Confidence Level: The z-score is 1.96.
99% Confidence Level: The z-score is 2.576.

These z-scores come from the standard normal distribution and represent the number of standard deviations from the mean that correspond to the desired level of confidence.

Examples:

Example 1: 90% Confidence Interval

Imagine you are estimating the proportion of people who prefer a certain brand of soda. You take a sample of 100 people, and 60 of them say they prefer the brand. The sample proportion is:

p̂ = 60 / 100 = 0.60

To construct a 90% confidence interval for the population proportion:

CI = 0.60 ± 1.645 * √(0.60 * 0.40 / 100) = 0.60 ± 0.0804

So, the 90% confidence interval is approximately: [0.5196, 0.6804].

Example 2: 95% Confidence Interval

Now, imagine you want to estimate the average height of students in a class. From a sample of 50 students, you find the mean height is 170 cm with a sample standard deviation of 10 cm. To construct a 95% confidence interval for the population mean:

CI = 170 ± 1.96 * (10 / √50) = 170 ± 2.77

So, the 95% confidence interval is: [167.23, 172.77].

Example 3: 99% Confidence Interval

Finally, let’s estimate the proportion of people who approve of a new policy. You survey 200 people, and 130 of them approve. The sample proportion is:

p̂ = 130 / 200 = 0.65

To construct a 99% confidence interval for the population proportion:

CI = 0.65 ± 2.576 * √(0.65 * 0.35 / 200) = 0.65 ± 0.0870

So, the 99% confidence interval is: [0.5630, 0.7370].

Conclusion

A confidence interval gives us a range within which we expect a population parameter to lie, with a specified level of confidence. The wider the confidence interval, the less precise the estimate, but the more confident we are that the interval contains the true parameter. The level of confidence (e.g., 90%, 95%, or 99%) determines how likely it is that the interval will capture the true population parameter.

What is the Margin of Error?

The margin of error is a statistic that quantifies the amount of random sampling error in a survey's results. It provides a range within which the true value of a population parameter (such as a population mean or proportion) is expected to lie, given the sample data. The margin of error is typically expressed as a plus-or-minus figure that indicates the range around the sample estimate.

Relation to the Confidence Interval:

The confidence interval (CI) is a range of values that is used to estimate the true population parameter. The margin of error is directly related to the confidence interval as it defines how wide the interval will be.

The confidence interval is constructed by adding and subtracting the margin of error from the sample estimate (e.g., sample mean or sample proportion):

Confidence Interval = Sample Estimate ± Margin of Error

Formula for Margin of Error:

The margin of error is calculated using the sample data and the desired confidence level (e.g., 90%, 95%, or 99%). It is influenced by the sample size, the variability in the data, and the confidence level:

Margin of Error = z × (σ / √n)

Where:

z is the z-score corresponding to the chosen confidence level (e.g., 1.96 for 95% confidence),
σ is the standard deviation of the population (or an estimate from the sample),
n is the sample size.

Example:

For a 95% confidence level, the margin of error tells us how much we expect the sample statistic (such as a sample mean) to differ from the true population mean. If the margin of error is ±5, the confidence interval would be the sample estimate ± 5.

If a poll reports a 95% confidence interval of 50% ± 3%, the true proportion in the population is expected to fall between 47% and 53% with 95% confidence.

Summary:

The margin of error is a measure of uncertainty or variability in the sample estimate.
The confidence interval is a range of values based on the sample data that is likely to contain the true population parameter.
The margin of error defines how wide the confidence interval will be, and it is based on the sample size, variability, and the chosen confidence level.