Optimization for Machine Learning: Bayesian Inference and MAP Estimation

Slides from Politecnico Di Torino about Optimization for Machine Learning. The Pdf explores concepts like MAP estimation and prior selection, illustrating the application of the Beta prior for Bernoulli likelihood. This University level Computer science document provides detailed explanations and formulas for self-study.

See more

28 Pages

G.C. Calafiore (Politecnico di Torino) 1 / 28
LECTURE 2
Bayesian Inference
Posterior Likelihood × Prior
T. Bayes
G.C. Calafiore (Politecnico di Torino) 2 / 28

Unlock the full PDF for free

Sign up to get full access to the document and start transforming it with AI.

Preview

Bayesian Inference

Posterior x Likelihood x Prior

T. Bayes

G.C. Calafiore (Politecnico di Torino) 2 / 28Outline

  1. Bayes' rule
  2. Bayesian estimators
    • Example: estimating a Bernoulli parameter

G.C. Calafiore (Politecnico di Torino) 3 / 28Introduction Bayesian inference is a mathematical procedure that applies probabilities to statistical problems. It provides the tools to update one's beliefs in the evidence of new data.

  • Bayes' Theorem

    p(A|B) = p(B) p(A)p(B|A) . A is some statement (e.g., "the subject is pregnant" ), and B is some data or evidence (e.g., "the human chorionic gonadotropin (HCG) test is positive" ).

  • We cannot observe A directly, but we observe some related "clue" B:

    p(pregnant|HCG+) = p(pregnant)p(HCG+|pregnant) p(HCG+) . p(pregnant|HCG+): the probability that the subject is pregnant, given the information that the HCG test is positive.

  • Notice that the HCG test is not infallible!

G.C. Calafiore (Politecnico di Torino) 4 / 28Introduction p(A)p(B|A) p(A|B) = p(B) → p(pregant |HCG+) = p(pregnant)p(HCG+|pregnant) p(HCG+) . p(pregnant): the probability of being pregnant, before looking at any evidence: it is the a-priori plausibility of statement A, for instance based on age, nationality, etc.

  • p(HCG+|pregnant) (likelihood) expresses the probability of the test responding positively to a positive (pregnant) subject. It is a characteristic of the test in use (sensitivity).
  • p(HCG+) is the total plausibility of the evidence, whereby we have to consider all possible scenarios to ensure that the posterior is a proper probability distribution:

    p(HCG+) = p(HCG+|preg.)p(preg.) + p(HCG+|not preg.)p(not preg.)

  • Effectively updates our initial beliefs about a proposition with some observation, yielding a final measure of the plausibility, given the evidence.

G.C. Calafiore (Politecnico di Torino) 5 / 28Bayes' Theorem In pictures ... A B . p(An B) is the joint probability of A and B, often also denoted by p(A, B)

  • The conditional probability of A given B is defined as

    p(A|B) = p(B) p(AnB)

  • Since p(An B) = p(Bn A), we also have that

    p(B|A) = p(A) p(BnA) = p(A|B)p(B) p(A) which is Bayes' rule.

G.C. Calafiore (Politecnico di Torino) 6 / 28Law of total probability

  • Let {Bi : i = 1,2, 3, ... } be a finite or countably infinite partition of a sample space (i.e., a set of pairwise disjoint events whose union is the entire sample space), and each event Bi is measurable.
  • Then for any event A of the same probability space it holds that

    p(A) = > p(An Bi)= >p(A|Bi)p(Bi) i i

  • We can hence write Bayes' rule as

    p(Bk|A) = p(A|Bk)p(Bk) p(A) = p(A|Bk)p(Bk) Ei p(A|B;)p(Bi) .

G.C. Calafiore (Politecnico di Torino) 7 / 28Bayes' rule for probability density functions (pdf)

  • Let 0 E R" be a vector of parameters we are interested in.
  • We describe our assumptions and a-priori knowledge about 0, before observing any data, in the form of a prior probability distribution (or, prior belief) p(0).
  • Let D denote a set of observed data related to 0. The statistical model relating D to 0 is expressed by the conditional distribution p(De), which is called the likelihood.
  • Bayes' rule for pdf then states that

    p(0|D) = p(D|0)p(0) p(D) , where p(0|D) is the posterior distribution, representing the updated state of knowledge about 0, after we see the data.

  • Since p(D) only acts as a normalization constant, we can state Bayes' rule as

    Posterior & Likelihood x Prior.

G.C. Calafiore (Politecnico di Torino) 8 / 28Bayes' rule for probability density functions (pdf)

  • The denominator p(D) can be expressed in terms of the prior and the likelihood, via the continuous version of the total probability rule

    p(D) = |p(DO)p(A)do.

  • The posterior distribution of 0 can be used to infer information about 0, i.e., to find a suitable point estimate 0 of 0 (more on this in the following slides).
  • The Bayesian approach is very suitable for on-line, recursive inference, since it can be applied recursively: as new data is gathered the current posterior becomes the new prior, and a new posterior is computed based on the new data, and so on ...

    D1 D2 D3 data data data p(0) p(e|D1) p(e|D1, D2) etc ... Bayes' Rule Bayes' Rule Bayes' Rule prior posterior (becomes new prior) posterior (becomes new prior)

Bayesian Estimators

G.C. Calafiore (Politecnico di Torino) 9 / 28Bayesian Estimators

  • In statistics, point estimation involves the use of sample data to calculate a single value (known as a statistic) which is to serve as a best guess or "best estimate" of an unknown population parameter 0.
  • Bayesian point-estimators are the central-tendency statistics of the posterior distribution (e.g., its mean, median, or mode):
    • The posterior mean, which minimizes the (posterior) risk (expected loss) for a squared-error loss function;
    • The posterior median, which minimizes the posterior risk for the absolute-value loss function;
    • The maximum a posteriori (MAP) estimate, which finds a maximum of the posterior distribution;
    • The Maximum Likelihood (ML) estimate, which coincides with the MAP under uniform prior probability.

G.C. Calafiore (Politecnico di Torino) 10 / 28Loss functions

  • For a given point estimate 0, a loss function £ measures the error in predicting 0 via Ô.
  • Typical loss functions are the following ones (assuming for simplicity 0 to be scalar)

    Quadratic: L(0-0) = (0- 0)2 Absolute-value: C(0 - 0) = 10 - 0) · Hit-or-miss: C(0 -0) = { 1 O if 0 - @ < 8 if | 0 - 0| > 8 Huber loss: L(O-Ô) ={ 1(0 - 0)2 if 0 - 0| < 8 8 ( 0 - 0) - 8 /2) if |0 - 0| > 8

G.C. Calafiore (Politecnico di Torino) 11 / 28Estimation

  • Expected loss: how much do I pay, on average, if I guess 0 by using the point estimate Ô.
  • Bayesian estimators are defined by minimizing the expected loss, under the posterior conditional density:

    Â = arg min | [(0 -0) p(O)D) do = arg min E{C(0 - 0) |D}.

  • We next discuss three relevant special cases.

Minimum Mean-Square Error (MMSE)

G.C. Calafiore (Politecnico di Torino) 12 / 28Estimation Minimum Mean-Square Error (MMSE)

  • For the quadratic loss £(0 - 0) = (0 - 0)2, we have

    E{C(0 - 0)|D} = (0 -0)2 p(OD) de

  • To compute the minimum w.r.t. 0, we compute the derivative of the objective

    ∂ ˆ (0 -0)2 p(OD) de = - ˆ ∂ (0-0)2p(OD) de = -2(0 -0) p(e)D) de

  • Setting the previous derivative to zero, we obtain

    -2(0-@)p(0|D)de = 0 4 ⇔ Ôp(O)D)do = |Op(0)D)de = [Op(0)D)do = E{0|D}

G.C. Calafiore (Politecnico di Torino) 13 / 28Estimation Minimum Mean-Square Error (MMSE)

  • Thus, we have the minimum:

    ÔMMSE = / Op(0|D)do = E{e|D}, i.e., the mean of the posterior pdf p(0|D).

  • This is called the minimum mean square error estimator (MMSE estimator), because it minimizes the average squared error.

Minimum Absolute Error (MAE)

G.C. Calafiore (Politecnico di Torino) 14 / 28Estimation Minimum Absolute Error (MAE)

  • For the absolute value loss C(0 - 0) = |0 - 0), we have

    E{C(0 - 0)|D} = |0 - Ôp(OD) do

  • It can be shown that the minimum w.r.t. 0 is obtained when

    1 O ·Ô 0 p(!|D)de = .00 p(e|D)de

  • In words, the estimate 0 is the value which divides the probability mass into equal proportions:

    - 0 p(e|D)de = 1 2', which is the definition of the median of the posterior pdf.

  • A well known fact, see, e.g., J.B.S. Haldane, "Note on the median of a multivariate distribution," Biometrika, vol. 35, pp. 414-417, 1948.

Maximum A-Posteriori Estimator (MAP)

G.C. Calafiore (Politecnico di Torino) 15 / 28Estimation Maximum A-Posteriori Estimator (MAP)

  • For the hit-or-miss loss

    O if 0 - @ < 8 1 if |0 - @ > s . we have E{C(0- 0)|D} = -00 0-8 p(!|D)de + JÔ+8 00 p(!|D)de 1 la 0-8 .Ô+8 = - p(e|D)de

  • This is minimized by maximizing

    1 JÔ-8 p(!|D)de.

  • For small & and suitable assumptions on p(0|D) (e.g., smooth, quasi-concave) the maximum occurs at the maximum of p(e|D).
  • Therefore, the estimator is the mode (the peak value) of the posteriori PDF. Thus the name Maximum a Posteriori (MAP) estimator.

MAP and Maximum Likelihood (ML) estimators

G.C. Calafiore (Politecnico di Torino) 16 / 28Estimation MAP and Maximum Likelihood (ML) estimators

  • The MAP estimate is

    θ ˆ OMAP = arg max p(((D) [by Bayes' rule] = arg max p(D|0)p(0), 0 where p(D|0) is the likelihood, and p(0) is the prior.

  • If the prior is uniform, i.e., p(0) = const., then maximizing p(0|D) is equivalent to maximizing the likelihood p(D|e).
  • The maximum likelihood estimator is

    ÔML = arg max p(D|0), and it is equivalent to the MAP estimator, under uniform prior.

Estimators Summary

G.C. Calafiore (Politecnico di Torino) 17 / 28Estimation Summary

  • Probabilistic model:
    • p(0): the prior
    • p(D|0): the likelihood
    • p(0|D) = const. x p(D|e)p(e): the posterior.
  • Estimators:
    • OMMSE is the mean of the posterior p(0|D). It is the value which minimizes the expected quadratic loss.
    • OMAE is the median of the posterior p(0|D). It is the value which minimizes the expected absolute loss.
    • OMAP is the maximum (i.e., peak, or mode) of the posterior p(OD). It is (approximately) the value which minimizes the expected hit-or-miss loss.
    • OML is the maximum (i.e., peak, or mode) of the likelihood function p(D|0).

Example: Bernoulli Experiment Outcome Probability

G.C. Calafiore (Politecnico di Torino) 18 / 28Example Estimating the outcome probability of a Bernoulli experiment

  • A Bernoulli experiment is an experiment whose outcome y is random and binary, say 0 (fail) or 1 (success).
  • The random outcome of a Bernoulli experiment is dictated by an underlying probability 0 € [0, 1], which is the success probability of the experiment, i.e.,

    p(y = 1) = 0.

  • We here assume that 0 is unknown, and we want to estimate it from observed data.
  • Let D = {y1, ... , yN} be the observed outcomes of N such experiments. Here yi, i = 1, ... , N, are i.i.d. trials, each having distribution yi ~ Ber(0), where Ber(x|0) is the Bernoulli distribution

    Ber(x|0)= 01(x)(1- 0)1(1-x) 1 1-0 θ for x = 1 for x = 0 where 1(x) is equal to one for x = 1 and it is zero for x = 0.

G.C. Calafiore (Politecnico di Torino) 19 / 28

Can’t find what you’re looking for?

Explore more topics in the Algor library or create your own materials with AI.