Probability in statistics

Imed Krisna Gupta

January 14, 2024

Profile

  • I Made Krisna Gupta (Imed)

  • Politeknik APP Jakarta, Universitas Indonesia, Center for Indonesian Policy Studies

  • S3 di Australian National University, S2 di UI/VU Amsterdam

  • Fokus riset di perdagangan internasional dan kebijakan publik (particularly kebijakan industri)

  • more at krisna.or.id atau @imedkrisna

Tentang course

  • Dibuat berdasarkan materi “Introduction to Probability and Statistics” dari ocw.mit.edu oleh Jeremy Orlff dan Jonathan Bloom.

  • Kontennya sama dengan berbagai standard intro to probability for statistics & econometrics.

slides here

On today

  • Left out counting and motivation because these are trivial.

  • Didn’t have time to go through central tendency and variance properly.

  • Borrow materials from Dr, Uka Wikarya.

  • Use english and proper notation so you can have a nice transition to VU guys.

Frequentist vs Bayesian

  • Frequentists: probability measures the frequency of various outcomes of an experiment.

    • e.g.,: a 50% probability of heads = if we toss N number of times, roughly 0.5N are heads.
  • Bayesians: probability is an abstract concept that measures a state of knowledge of a degree of belief in a given preoposition.

    • no single value of \(P(Heads)\). Instead: what is the probability P(head)=0.5? 0.4? 0.1? etc.
  • Most of our tools are developed by frequentists, but increasingly powerful computers lead to the resurgence of bayesians.

Terminology

  • Experiment: a repeatable procedure with well-defined possible outcomes.

  • sample space: the set of all possible outcomes, denoted by sometimes \(\Omega\), sometimes by \(S\).

  • fair: all \(\omega \in \Omega\) has the same probability.

  • Event: a subset of the sample space.

  • Probability function: a function giving the probability of each outcome.

Examples

Tossing a fair coin.

Experiment: toss the coin, report if it lands heads or tails. Sample space: \(\Omega=\{H,T\}\). Probability function: \(P(H)=0.5, P(T)=0.5\).

Toss a fair coin 3 times.

Experiment: toss the coin 3 times, report outcomes. Sample space: \(\Omega=\{HHH,HHT,HTH,HTT,THH,THT,TTH,TTT\}\) Probability function: \(P(\Omega)=\frac{1}{8} \ \forall \ \omega \in \Omega\)

Examples

Taxis (An infinite discrete sample space)

Experiment: count the number of taxis that pass UI Salemba during class. Sample space: \(\Omega=\{0,1,2,3,4, \dots \}\) Probability function: Poisson distributin \(P(k)=e^{-\lambda} \frac{\lambda^k}{k!}\), where \(\lambda\) is the average number of taxis.

Outcome 0 1 2 3 \(\dots\) k
Probability \(e^{-\lambda}\) \(e^{-\lambda} \lambda\) \(e^{-\lambda} \frac{\lambda^2}{2}\) \(e^{-\lambda} \frac{\lambda^3}{3!}\) \(\dots\) \(P(k)=e^{-\lambda} \frac{\lambda^k}{k!}\)

Events

  • An event \(E\) is a collection of outcomes. i.e., a subset of the sample space \(\Omega\).

  • For example, using 3 coin experiment, what is the probability of exactly two heads show up?

  • We can write it as \(E=\)’exactly 2 heads’, or \(E=\{HHT,HTH,THH\}\). Note that \(E \subset \Omega\).

  • Since we know that \(P(\Omega)=\frac{1}{8} \ \forall \ \omega \in \Omega\), we can compute \(P(E)=3/8\)

Discrete sample space

  • A discrete sample space is one that is listable, can be either finite of infinite.

  • \(\{H,T\}, \{1,2,3,4,5,6\}, \{1,2,3,\dots \}\) are all discrete sets. The first two are finite, the last one is infinite.

  • The interval \(0\leq x \leq 1\) is not discrete. It is continuous.

Probability function

  • For a discrete sample space \(S\), a probability function \(P\) assigns to each outcome \(\omega\) a number \(P(\omega)\) called the probability of \(\omega\).

  • \(P\) must satisfy two rules:

    • Rule 1: \(0 \leq P(\omega) \leq 1\)

    • Rule 2: the sum of the probability of all possible outcomes is 1.

  • Rule 2: if \(S=\{\omega_1, \omega_2,...,\omega_n\}\), then \(\sum_{j=1}^n P(\omega_j)=1\)

  • \(P(E)=\sum_{\omega \in E} P(\omega) \leq 1\)

Probability rules

For events A, L and R contained in a sample space \(\Omega\).

Rule 1. \(P(A^c)=1-P(A)\)

Rule 2. If \(L\) and \(R\) are disjoint then \(P(L \cup R)=P(L)+P(R)\)

Rule 3. Inclusion-exclusion principle: if L and R are not disjoint (i.e., overlap), then \(P(L \cup R)=P(L)+P(R)-P(L \cap R)\)

Conditional probability

Conditional probability answers the question ‘how does the proobability of an event change if we have extra information’?

Example 1. Toss a fair coin 3 times.

  1. What is the possibility of 3 heads? \(\Omega=\{HHH,HHT,HTH,HTT,THH,THT,TTH,TTT\}\) Since we have only 1 \(\{HHH\}\) in our sample space, then P(HHH)=1/8.

Conditional probability

  1. What is P(HHH) if we know the first toss is H? We have a new, reduced sample space \(\Omega'=\{HHH,HHT,HTH,HTT\}\). Can you answer \(P(HHH | \text{first toss is H})\)?

This is called conditional probability, since it takes into account additional conditions.

Conditional probability

Rephrase (b) as events: Let \(A\) be the event ‘all three tosses are heads’ = \(\{HHH\}\). Let \(B\) be the event ‘the first toss is heads’ = \(\{HHH,HHT,HTH,HTT\}\).

The conditional probability of A knowing that B has happened is written \(P(A|B)\).

This is read as ‘the conditional probability of A given B’, or ‘the probability of A conditioned on B’, or simply ‘the probability of A given B’.

Conditional probability

We can assign a formal definition of conditional probability as such:

Let \(A\) and \(B\) be events. Conditional probability of A given B defined as

\[ P(A|B)=\frac{P(A \cap B)}{P(B)},\text{ provided } P(B) \neq 0 \]

Conditional probability

Let’s redo our previous calculation of 3 heads using this @eq1. Recall that \(A=HHH\), and \(B=\)the first toss is \(H\).

\[ P(A|B)=\frac{P(A \cap B)}{P(B)}=\frac{1/8}{1/2}=\frac{1}{4} \]

For more complicated events, using @eq1 is often preferred to counting.

Multiplication rule

From @eq1 we can manipulate the algebra to get the multiplication rule:

\[ P(A \cap B)=P(A|B) \cdot P(B) \]

example: Draw two cards from a deck. Using multiplication rule, show that the chance of drawing two spades is 3/51.

Law of total probability

Suppose the sample space \(\Omega\) is divided into 3 disjoint events \(B_1, B_2, B_3\), then for any event \(A\):

\[\begin{align*} P(A)&=P(A \cap B_1)+P(A \cap B_2)+P(A \cap B_3) \\ P(A)&=P(A|B_1)P(B_1)+P(A|B_2)P(B_2)+P(A|B_3)P(B_3) \end{align*}\]

If A is divided into 3 pieces, then \(P(A)\) is the sum of the probabilities of the pieces. The second equation is called the law of total probability.

Probability urn

An urn contains 5 red balls and 2 green balls. We draw 2 balls. What is the probability the second ball is red?

Sample space \(\Omega=\{rr,rg,gr,gg\}\). Let \(R_1\)=‘first ball red’, \(R_2\)=‘second ball red’, \(G_1\)=‘first ball green’, \(G_2\)=‘second ball green’. The question is \(P(R_2)\).

\[ P(R_2)=P(R_2 | R_1)P(R_1)+P(R_2|G_1)P(G_1)=\frac{30}{42} \]

Probability urn

under a slightly complex rule, we can’t count on counting.

Suppose if the first draw is green, a red ball is added to the urn, and if the first draw is red, a green ball is added. The first ball isn’t returned. find \(P(R_2)\).

\(P(R_2 | R_1)=4/7\), \(P(R_2|G_1)=6/7\) therefore

\[ P(R_2)=P(R_2 | R_1)P(R_1)+P(R_2|G_1)P(G_1)=\frac{32}{49} \]

Using a probability tree is useful in this type of question.

Independence

  • Two events are independent if knowledge that one occured does not change the probability that the other occurred.

  • A is independent of B if \(P(A|B)=P(A)\)

  • If A is independent of B, then \(P(A \cap B)=P(A|B)P(B)=P(A)P(B)\).

  • 2 events A and B are independent if \(P(A \cap B)=P(A) \cdot P(B)\).

  • A is independent of B if and only if B is independent of A.

Testing independence

Toss a fair coin twice. \(H_1\)=‘first toss is H’ and \(H_2\)=‘second toss is H’. Are \(H_1\) and \(H_2\) independent?

Toss a fair coin 3 times. Let \(A\)=‘total 2 heads’. Are \(H_1\) and \(A\) independent? Hint: find \(P(A)\) then check if \(P(A)=P(A|H_1)\).

Bayes’ rule

For two events A and B, Bayes’ rule says

\[ P(B|A)=\frac{P(A|B)\cdot P(B)}{P(A)} \]

Bayes’ rule tells us how to ‘invert’ conditional probabilities. In practice, \(P(A)\) is often computed using the law of total probability.

Bayes’ rule

It is common to confuse \(P(A|B)\) and \(P(B|A)\).

Toss a coin 5 times. Let \(H_1=\)’first toss is heads’ and let \(H_A=\) ‘all 5 tosses are heads’. \(P(H_1|H_A)=1\) but \(P(H_A|H_1)=1/16\).

\[ P(H_1 | H_A)=\frac{P(H_A | H_1)P(H_1)}{P(H_A)}=\frac{1/16 \cdot 1/2}{1/32}=1 \]

The base rate fallacy

Consider a routine screening test for a disease. Suppose the frequency of the disease in the population (the base rate) is 0.5%. The test is fairly accurate with 5% false positive rate and 10% false negative rate. You take the test and it comes back positive. What’s the probability you actually get the disease?

Lets define events: \(D^+=\) you have the disease, \(D^-=\)you dont get the disease, \(T^+=\) you tested positive, and \(T^-=\)you tested negative.

\(P(D+)=0.005\) therefore \(P(D-)=0.995\). The false positive and false negative are conditional probability.

The base rate fallacy

\(P(\text{false positive})=P(T+|D-)=0.05\) and \(P(\text{false negative})=P(T-|D+)=0.1\)

The complements are true negative and true positive rates, which are:

\(P(T-|D-)=1-(T+|D-)=0.95\) and \(P(T+|D+)=1-P(T-|D+)=0.9\)

You can actually put this in a probability tree.

The base rate fallacy

The question is what’s the probability that you have the disease that your test is positive. i.e., what is the value of \(P(D+|T+)\). We don’t have the value but we can use Bayes’ rule:

\[ P(D+|T+)=\frac{P(T+|D+) \cdot P(D+)}{P(T+)} \]

We use the law of total probability to compute \(P(T+)\) (or just use the tree)

\[ P(T+)=0.995 \times 0.05+0.005 \times 0.9=0.05425 \]

The answer is something like 8.3%.

THe base rate fallacy

This is called “the base rate fallacy” because the base rate of the disease in the population is so low that majority of people taking the test are actually healthy. To summarize:

95% of all tests are accurate does not imply 95% of positive tests are accurate

The base rate fallacy also often calculated using a table

The base rate fallacy

D+ D- total
T+ \(D+ \cap T+\) \(D- \cap T+\)
T- \(D+ \cap T-\) \(D- \cap T-\)
total 50 9950 10000
D+ D- total
T+ 45 498 543
T- 5 9452 9457
total 50 9950 10000

Discrete random variables

A random variable assigns a number to each outcome in a sample space.

Let \(\Omega\) be a sample space. A discrete random variable is a function

\[ X : \Omega \rightarrow \mathbb{R} \]

that takes a discrete set of values. It’s random because its value depends on a random outcome of an experiment.

A game of dice

For any value \(a\) we write \(X=a\) to mean event consisting of all outcomes \(\omega\) with \(X(\omega)=a\)

Roll a fair ice twice and record the outcome as \((i,j)\) where \(i\) is the outcome of the first roll while \(j\) is the outcome of the second roll. The sample space thus

\[ \Omega=\{(1,1),(1,2),...(6,6)\}=\{(i,j)|i,j=1,...6\} \]

A game of dice

In this game, you win $500 if the sum is 7 and lose $100 otherwise. The payoff function X: \[ X(i,j)= \begin{cases} 500 & if \ i+j=7 \\ -100 & if \ i+j \neq 7 \end{cases} \]

The event \(X=500\) is the set \(\{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\}\), so \(P(X=500)=1/6\).

Probability mass function

The probability mass function (pmf) of a discrete random variable is the function \(p(a)=P(X=a)\) that:

  1. always satisfy \(0\leq p(a) \leq 1\)

  2. \(a\) can be any number. for \(a\) that \(X\) never takes, then \(p(a)=0\).

Let \(\Omega\) is a sample space for rolling 2 dice. Let \(M\) be the max value of the 2 dice.

a 1 2 3 4 5 6
pmf p(a): 1/36 3/36 5/36 7/36 9/36 11/36

Cummulative distribution function

The cummulative distribution function (cdf) of a random variable \(X\) is the function \(F\) given by \(F(a)=P(X\leq a)\).

a 1 2 3 4 5 6
pmf \(p(a):\) 1/36 3/36 5/36 7/36 9/36 11/36
cdf \(F(a):\) 1/36 4/36 9/36 16/36 25/36 36/36

\(F(a)\) is called the cumulative distribution function because \(F(a)\) gives the total probability that accumulates by adding up the probabilities \(p(b)\) as \(b\) runs from \(-\infty\) to \(a\).

Various distribution