Page 1
1 1
PROBABILITY
Unit Structure:
1.0 Objectives
1.1 Introduction
1.2 A brief review of probability theory
1.2.1 Discrete random variables
1.2.2 Fundamental rules
1.2.3 Bayes rule
1.2.4 Independence and conditional independence
1.2.5 Continuous random var iables
1.2.6 Quantiles
1.2.7 Mean and variance
1.3 Some common discrete distributions
1.3.1 Bernoulli distributions
1.3.2 Binomial distributions
1.3.3 Hypergeometric distribution
1.3.4 Poisson distribution
1.3.5 Multinomial distribution
1.4 Some co mmon continuous distributions
1.4.1 Gaussian (normal) distribution
1.4.2 Degenerate pdf
1.4.3 The Laplace distribution
1.4.4 The gamma distribution
1.4.5 The beta distribution
1.5 Joint probability distributions
1.5.1 Covariance and correlation
1.5.2 The multivariate Gaussian
1.5.3 Multivariate Student t distribution munotes.in
Page 2
Track D: Machine
Learning –II (Advanced Machine Learning)
2 1.6 Transformations of random variables
1.6.1 Linear transformations
1.6.2 General transformations
1.6.3 Central limit theorem
1.7 Monte Carlo approximation
1.7.1 Example: change of variables, the MC way
1.7.2 Example: estimating π by Monte Carlo integration
1.7.3 Accuracy of Monte Carlo approximation
1.8 Information theory
1.8.1 Entropy
1.8.2 KL divergence
1.8.3 Mutual information
1.9 References
1.10 Questions
1.0 OBJECTIVES
After completing this chapter, you will be able to understand probability
theory, Some common discrete distributions, Some common continuous
distributions and Joint probability distributions as well as transformations
of random variables, Monte Carlo appr oximation, Information theory and
Directed graphical models and mixture models and EM algorithm like
Latent variable models, Mixture models, Parameter estimation for mixture
models, The EM algorithm.
1.1 INTRODUCTION
Probability can have several meanings i n everyday conversation.
Probability theory has been developed and applied in two major ways.
Simple games such as coins, cards, dice, and roulette wheels are examples
of games that interpret probabilities as relative frequencies. There is some
regularity to the results of many trials in a game of chance, although the
outcome of any given trial cannot be predicted with certainty.
As an example, if a coin is tossed with a probability of one -half, according
to the relative frequency interpretation, that means the probability of
receiving "heads" is about one -half in a large number of tosses, despite not
implying anything about the outcome of any given toss.
We all know that "the probability that a coin will land heads is 0.5". How
does that work? Probability can be interpreted in at least two different munotes.in
Page 3
Probability
3 ways. The frequentist interpretation is one. Probabilities, in this view,
describe the long run frequency of events. For example, the preceding
statement implies that if we flip the coin several times, we may an ticipate
it to fall heads around half of the time. 1 The Bayesian interpretation of
probability is the alternative interpretation. Probability, according to this
viewpoint, is used to assess our uncertainty about something; hence, it is
primarily tied to i nformation rather than repeated trials (Jaynes 2003).
According to the Bayesian perspective, the preceding sentence indicates
that we expect the coin will land heads or tails on the next toss.
Probability is the degree of certainty that an unknown event wi ll occur.
Eg: The probability of raining today is 0.3.
Event space: All -possible -outcomes space.
Eg: E= {1,2,3,4,5,6} for a dice; E={H, T} for a coin.
Random variable: a variable whose values depend on the outcome of a
random phenomenon.
Two types of ran dom variables: Discrete and continuous.
1.2 A BRIEF REVIEW OF PROBABILITY THEORY
Mathematical study of random phenomena is known as probability theory.
A random event's outcome can take any of a number of different forms; it
cannot be predicted before it happens. The final result is thought to have
been determined by chance.
1.2.1 Discrete random variables
The probability that the event A occurs is denoted by the term p(A). A
may be the logical phrase "it will rain tomorrow," for example. We require
that 0 ≤ p(A) ≤ 1, where p(A)=0 indicates that the event will almost
certainly not occur, and p(A)=1 indicates that the event will almost
certainly occur. The probability of the event not A is denoted by p(A),
which is defined as p(A)=1 − p(A). We will frequentl y write A = 1 to
indicate that event A is true and A = 0 to indicate that event A is false.
We can extend the concept of binary events by introducing a discrete
random variable X that can take any value from a finite or countably
infinite collection X. The probability that X = x is denoted by p(X = x), or
just p(x) for short. In this context, p() is called as a probability mass
function, or pmf. This meets the properties 0 ≤ p(x) ≤ 1 and !
1.2.2 Fundamental rules
1. Probability of a union of two events.
Given two events, A and B, we define the probability of A or B as
follows:
munotes.in
Page 4
Track D: Machine
Learning –II (Advanced Machine Learning)
4 p(A ∨ B) = p(A) + p(B) − p(A ∧ B)
= p(A) + p(B)
If A and B are mutually exclusive
2. Joint probabilities
We define the probability of the joint event A and B as follows:
p(A, B) = p(A ∧ B) = p(A|B)p(B)
This is known as the product rule. The marginal distribution is defined as
follows given a joint distribution on two occurrences p(A, B):
where we are summing across all possible states of B. Similarly, we can
define p(B). This is also known as the sum rule or the total probability
rule.
3. Conditional probability
The conditional probability of event A given that event B is true is defined
as follows:
1.2.3 Bayes rule
The Bayes rule, also known as the Bayes Theorem, is obtained by
combining the definition of conditional probability with the product and
sum rules.
Example.
Assume 15 men out of 300 and 25 women out of 1000 are good orators. A
random orator is chosen. Determine the probability that a man will be
chosen. Assuming there is an equal number of men and women.
Solution:
Let there be 1000 men and 1000 women.
Let E 1 and E 2 be the events of choosing a man and a woman respectively.
Then,
P(E 1) = 1000/2000 = 1/2 , and P(E 2) = 1000/2000 = 1/2
Let E be the event of choosing an orator. Then, munotes.in
Page 5
Probability
5 P(E|E 1) = 50/1000 = 1/20, and P(E|E 2) = 25/1000 = 1/40
Probability of selecting a male person, given that the person selected is a
good orator
P(E 1/E) = P(E|E 1) * P(E 1)/ P(E|E 1) * P(E 1) + P(E|E 2) * P(E 2)
= (1/2 * 1/20) /{(1/2 * 1/20) + (1/2 * 1/40)}
= 2/3
Hence the required probability is 2/3.
1.2.4 Independence and conditional independence
Let us first define conditional independence:
If there are two conditionally independent events A and B given a third
event Y, then the occurrence/non -occurrence of A provides no information
about the occurrence and non -occurrence of B (given Y), i.e., A and B are
conditionally independent iff knowledge of A's occurrence provides no
information on the likelihood of B occurring and vice versa.
In terms of probabilities:
1. P(A ∩ B|Y) = P(A|Y).P(B|Y)
2. P(A|B ∩Y) = P(A|Y)
Be clear that conditional independence does not imply independence, and
vice versa. Let's explore some of the basic definitions.
1. Independence:
The occurrence of o ne event for two occurrences A and B has no affect
whatever on the other event's occurrence.
Example: P(A ∩B)=P(A).P(B)
P(A|B)=P(A)
2. Conditional Independence:
Sometimes it's impossible to determine whether an event is independent of
another because a thir d occurrence causes them to become so.
Given that C means, A is conditionally independent of B
P(A|B,C) = P(A|C)
i.e., When C is observed, B has no impact on the value of A.
We say X and Y are unconditionally independent or marginally
independent, denoted X ⊥ Y, if we can represent the joint as the product of
the two marginal,
i.e., X ⊥ Y ⇐⇒ p(X, Y ) = p(X)p(Y ) munotes.in
Page 6
Track D: Machine
Learning –II (Advanced Machine Learning)
6 Theorem:
X ⊥ Y |Z if there exist function g and h such that
p(x, y|z) = g(x, z)h(y, z)
for all x, y, z such that p(z) > 0
1.2.5 Continuous random var iables
The two forms of random variables are continuous random variables and
discrete random variables. A random variable is one whose value varies on
every possibility that could occur during an experiment. A discrete
random variable is defined at a preci se value, whereas a continuous
random variable is defined throughout a range of values.
For instance, how long it takes to finish an exam for a 60 -minute test.
Possible values = all real numbers on the interval [0,60]
A random variable with an infinite num ber of possible values is referred to
as a continuous random variable. As a result, there is no chance that a
continuous random variable will have an exact value. A continuous
random variable's features are described using the probability density
function and the cumulative distribution function.
The probabilities connected to a continuous random variable are expressed
using the probability density function (pdf) and the cumulative distribution
function (CDF). Here are the formulas for continuous random var iables
for these functions.
pdf (probability density function):
We often use ( ) to denote the PDF of
( ) ≥ 0
whaer ( ) can be larger than 1
Example
munotes.in
Page 7
Probability
7 PDF of Continuous Random Variable
A function that estimates the probability that a continuous random
variable's value will fall within a given range of values is known as the
probability density function. Given that X is assumed to be a continuous
random variable, the pdf's formula, f(x), is as follows:
F(x) is the cumulative distribution functio n in this case.
The continuous random variable's pdf must meet the requirements listed
below in order to be valid:
This specifies that the entire area under the PDF's graph must be equal to
1.
f(x) > 0. This implies that a continuous random variable's pr obability
density function cannot be negative.
CDF of Continuous Random Variable
The probability density function can be integrated to obtain the cumulative
distribution function of a continuous random variable. It can be
characterised as the probability t hat the random variable, X, will have a
value less than or equal to a specific value, x. The following is the formula
for the cdf of a continuous random variable, evaluated between two points
a and b :
1.2.6 Quantiles
The inverse of the cdf F, which we w ill refer to as F 1, exists since it is a
monotonically increasing function. The the value of xα such that P(X ≤
xα) = α; if F is the cdf of X. Half of the probability mass is on the left and
half is on the right, making the value F −1(0.5) the median of t he
distribution. The lower and upper quartiles are represented by the values F
−1(0.25) and F −1(0.75).
The tail area probabilities can also be calculated using the inverse cdf. For
example, if Φ is the cdf of the Gaussian distribution N (0, 1), then point s
to the left of Φ −1(α)/2) contain α/2 probability mass, as illustrated in
Figure 2.3(b). By symmetry, points to the right of Φ −1(1− α/2) also contain
α/2 of the mass. Hence the central interval (Φ −1(α/2), Φ −1(1 − α/2))
contains 1 − α of the mass. If we set α = 0.05, the central 95% interval is
covered by the range(Φ −1(0.025), Φ−1(0.975)) = (−1.96, 1.96) munotes.in
Page 8
Track D: Machine
Learning –II (Advanced Machine Learning)
8 If the distribution is N (µ, σ2), then the 95% interval becomes (µ − 1.96 σ,
µ + 1.96σ). This is sometimes approximated by writing µ ± 2σ.
1.2.7 Mean and var iance
A distribution's mean, or expected value, is its most well -known
characteristic and is denoted by the symbol µ. It is described in terms of
discrete RVs -
and continuous RVs -
The mean is not finite if this integral is not.
The variance is a meas ure of the “spread” of a distribution, denoted by σ2.
This is defined as follows:
This gives us the beneficial outcome
This is how the standard deviation is described.
It has the same units as X itself, which makes it helpful.
1.3 SOME COMMON DISCRE TE DISTRIBUTIONS
There are numerous discrete probability distributions available for use in
different scenarios.
1.3.1. Bernoulli Distribution
When we do an experiment just once, this distribution is produced. There
are only two possible outcomes: success or failure. These kinds of trials
are known as Bernoulli trials, and they serve as the foundation for many of
the distributions detailed below. Let p represent the probability of success,
and 1 - p represent the probability of failure.
munotes.in
Page 9
Probability
9 PMF is provided as
One example of this would be a single coin flip. 1 - p is the probability of
having a tail, and p is the probability of moving ahead. Please take note
that how we define success and failure depends on the situation and is
subjective.
1.3.2. Binomial Dist ribution
For random variables with just two possible outcomes, this is generated.
Let p represent the probability that an event will succeed, which implies
that 1 - p represents the probability that the event would fail. We obtain
the Binomial distribution by repeating the experiment and charting the
probability each time.
The most typical illustration of the Binomial distribution is the calculation
of the probability of receiving a specific number of heads after tossing a
coin n times. Other real -world exa mples are a company's number of
productive sales calls or the efficacy of a medicine in treating a sickness.
PMF is provided as,
where
n is the number of trials,
p is the probability of success,
x is the number of successes.
1.3.3. Hypergeometric Dis tribution
Think about the scenario where you draw a red marble from a box of
marbles of various hues. The occurrence of drawing a red ball is
successful, whereas the event of not drawing one is unsuccessful. The
probability of drawing a marble in the follo wing trial is impacted by the
fact that each time a marble is drawn, it is not put back in the box. The
probability of k successes over n trials, where each trial is carried out
without replacement, is modelled by the hypergeometric distribution.
Contrary to the binomial distribution, where the probability changes little
during the course of the trials.
PMF is provided as,
munotes.in
Page 10
Track D: Machine
Learning –II (Advanced Machine Learning)
10 where
k is the number of possible successes
x is the desired number of successes
N is the size of the population
n is the number of t rials.
1.3.4. Poisson Distribution
The events that take place over a predetermined period of time or space
are described by this distribution. This could be illustrated with an
example. Think about the number of calls a customer service centre
receives pe r hour. The average number of calls made per hour can be
estimated, but the precise number and time of calls cannot be known.
Every instance of an event occurs independently of all other instances.
PMF is provided as,
where
λ is the average number of ti mes the event has occurred in a certain period
of time
x is the desired outcome
e is the Euler’s number
1.3.5. Multinomial Distribution
There are just two possible outcomes in the above distributions: success
and failure. Yet the random variables with num erous alternative outcomes
are described by the multinomial distribution. Because each potential
result is considered a distinct category, this is also frequently referred to as
a categorical distribution. Think about the case where you play a game n
times . We can calculate the probability that player 1 will win x 1 times,
player 2 will win x 2 times, and player k will win x k times using the
multinomial distribution.
PMF is provided as,
munotes.in
Page 11
Probability
11 where
n is the number of trials
p1,……p k denote the probabilities of th e outcomes x 1……x k respectively.
1.4 SOME COMMON CONTINUOUS DISTRIBUTIONS
We present a few common univariate (one -dimensional) continuous
probability distributions in this section.
1.4.1 Gaussian (normal) distribution
A probability distribution that is symm etric about the mean is the normal
distribution, sometimes referred to as the Gaussian distribution. It
demonstrates that data that are close to the mean occur more frequently
than data that are far from the mean.
The normal distribution appears as a "bell curve" on a graph.
The normal distribution is defined by a number of important
characteristics and attributes.
The first thing to note is that the data's mean, median, and mode (the most
common observation) are all identical to one another. Furthermore, e ach
of these values represents the distribution's peak, or highest point. The
distribution then deviates symmetrically from the mean, with the standard
deviation serving as a measure of its width.
The normal distribution uses the formula below. Keep in min d that only
the mean (μ ) and standard deviation (σ) numbers are required.
where:
x = value of the variable or data being examined and f(x) the probability
function
μ = the mean
σ = the standard deviation
1.4.2 Degenerate pdf
A degenerate random variabl e is a constant with probability of 1, and its
distribution is known as a degenerate distribution (also known as a
constant distribution). In other words, there is just one potential value for
the random variable X .
The Gaussian becomes an indefinitely tal l and infinitely thin "spike,"
centred at μ, in the limit that σ2 → 0. munotes.in
Page 12
Track D: Machine
Learning –II (Advanced Machine Learning)
12
where δ is called a Dirac delta function, and is defined as
like that
The sifting property of delta functions is a helpful characteristic since it
allows one term to be chosen from a sum or integral:
since x − μ = 0 is the only case when the integrand is not zero.
The log probability of the Gaussian distribution only decays quadratically
with distance from the centre, which makes it susceptible to outliers. The
Student t distribution is a more reliable distribution. Its pdf looks like this:
where
μ is the mean
σ2> 0 is the scale parameter
ν > 0 is called the degrees of freedom.
Some properties of the distribution -
1.4.3 The Laplace distribution
The Laplace distribution, commonly referred to as the double -sided
exponential distribution, is anot her distribution with heavy tails. This pdf
includes the following: munotes.in
Page 13
Probability
13
where
μ is a location parameter
b > 0 is a scale parameter.
Here are some properties of this distribution:
mean = μ
mode = μ
var = 2b2
Some properties of the distribution -
1.4.4 The g amma distribution
For positive real valued rv's, with x > 0, the gamma distribution is a
flexible distribution. The shape a > 0 and the rate b > 0 are the two
parameters used to define it:
where Γ(a) is the gamma function:
There are several distributio ns which are just special cases of the Gamma,
which we discuss below:
Exponential distribution
This is defined by
where λ is the rate parameter.
This distribution describes the times between events in a Poisson process,
i.e. a process in which events occur continuously and independently at a
constant average rate λ.
munotes.in
Page 14
Track D: Machine
Learning –II (Advanced Machine Learning)
14 Erlang distribution
This is the same as the Gamma distribution where a is an integer. It is
common to fix a = 2, yielding the one -parameter Erlang distribution -
Erlang(x|λ) =Ga(x|2,λ)
where λ is the rate parameter.
Chi-squared distribution
This is defined by
This is the distribution of the sum of squared Gaussian random variables.
More precisely, if Zi ∼N(0, 1), and
Some properties of the distribution -
1.4.5 The beta distribution
The definition of the beta distribution, which has support between [0, 1], is
as follows:
Here B(p, q) is the beta function -
Some properties of the distribution -
1.5 JOINT PROBABILITY DISTRIBUTIONS
We have mostly focused on modelling univariate probabil ity distributions
up to this point. The more difficult task of creating joint probability
distributions on numerous related random variables is introduced in this
section, which will serve as the book's main focus. A joint probability
distribution, which d escribes the (stochastic) correlations between the munotes.in
Page 15
Probability
15 variables, takes the form p(x1,...,xD) for a set of D > 1 variables. If each
variable is discrete, the joint distribution can be represented as a large,
multi -dimensional array with one variable per dimens ion. However,
O(KD), where K is the number of states for each variable, is the minimum
number of parameters required to define such a model.
1.5.1 Covariance and correlation
The covariance between two rv’s, X and Y, gauges how closely (linearly)
X and Y a re connected. In terms of covariance,
Image source:
https://en.wikipedia.org/wiki/File:Correlation_examples.png
In above image a number of sets of (x, y) points, together with th e x and y
correlation coefficients for each set. It should be noted that the correlation
(top row) indicates the noise and direction of a linear relationship but not
its slope or many other aspects of nonlinear relationships (bottom).
If x is a d -dimension al random vector, then the following symmetric,
positive definite matrix is its covariance matrix:
Covariance’s can range from 0 to infinity. Working with a normalised
measure with a finite upper bound is sometimes more convenient. The
formula for the (P earson) correlation coefficient between X and Y is
munotes.in
Page 16
Track D: Machine
Learning –II (Advanced Machine Learning)
16 A correlation matrix has the following
structure:
For example,
If X=(X 1,X2,X3)T and Y=(Y 1,Y2)T are random vectors, then RXY is
a3x2 matrix whose (i,j)-th entry is E[X iYj].
1.5.2 The multivariate Gaussian
The joint probability density function that is most frequently used for
continuous variables is the multivariate Gaussian or multivariate normal
(MVN). The pdf of the MVN in D dimensions is defined by the following:
where
is the mean vector
Σ = cov[x] is the D × D covariance matrix.
Sometimes we'll work in terms of the conce ntration or precision matrix.
Simply put, Λ = Σ −1 is the inverse covariance matrix. The pdf integrates
to 1 due to the normalization constant -
1.5.3 Multivariate Student t distribution
The multivariate Student t distribution is a more accurate alternativ e for
the MVN, and its pdf is given by -
where
Σ is called the scale matrix (since it is not exactly the covariance matrix)
V = νΣ. munotes.in
Page 17
Probability
17 Compared to a Gaussian, this has fatter tails. The tails get fatter as the ν →
∞ gets smaller.
The distribution is more li kely to be Gaussian, as stated. These are the
properties of the distribution.
1.6 TRANSFORMATIONS OF RANDOM VARIABLES
What is the distribution of y if x is some random variable, x ∼p(), and
y=f(x)?
The following will reveal the answer -
1.6.1 Linear trans formations
If f() is a linear function, then:
y=f(x)= Ax+b
The mean and covariance of y in this situation can be easily derived as
follows. First, we have for the mean:
E[y]= E[Ax+ b]= Aµ+b
where µ = E [x]. This is called the linearity of expectation.
If f ()is a scalar -valued function,
f(x)= aTx + b, the corresponding result is
For the covariance, we have
where Σ = cov [x].
We leave the proof of this as an exercise. If f() is scalar valued, the
result becomes
1.6.2 General transformations
The probability mass for all the x's such that f(x) = y can be summed
together to obtain the pmf for y if X is a discrete r v.
munotes.in
Page 18
Track D: Machine
Learning –II (Advanced Machine Learning)
18 1.6.3 Central limit theorem
Now imagine N random variables, each with mean and variance σ2, and
pdfs (not necessarily Gaussian) p(x i). We take for granted that all of the
variables have independent and identical distributed, or iid.
Let
be the sum o f the rv’s. This conversion of RVs is simple yet very
common. One can demonstrate how the distribution of this sum
approaches uniformity as N rises.
And hence, the quantity's distribution
the standard normal is reached, where
is the sample mean.
This is called the central limit theorem .
1.7 MONTE CARLO APPROXIMATION
In general, it might be challenging to compute the distribution of a
function of a rv using the change of variables formula. Here is a simple yet
effective alternative. We first create S samples from the distribution,
denoted by the letters x 1,..., x S. Using the empirical distribution of, we
may approximatively determine the distribution of f(X) given the data.
This is known as a Monte Carlo approximation, after the European city
famous for its plush gambling casinos. Although they were initially
created in the field of statistical physics, specifically during the
development of the atomic bomb, Monte Carlo techniques are now widely
used in both statistics and machine learning.
Any funct ion of a random variable can have an expected value, and we
can use Monte Carlo to approximate it. We only create samples, after
which we calculate the function's arithmetic mean when it is applied to the
samples. The following can be written: munotes.in
Page 19
Probability
19
where x s∼ p(X).
This process, known as Monte Carlo integration, has the benefit of only
evaluating the function at locations where there is a non -zero probability,
as opposed to numerical integration, which is predicated on doing so at a
set grid of points.
Many va lues of interest can be approximated by changing the value of the
function f(), such as:
1.7.1 Example: change of variables, the MC way
Let y = x2 and x ~ Unif( -1, 1). By taking many samples from p(x),
squaring them, and then estimating the resulting emp irical distrib ution, we
can approximate p(y), e xplained in the image below.
Image source: Machine Learning: A Probabilistic Perspective: Kevin P
Murphy, The MIT Press Cambridge (2012).
Monte Carlo integration is used to estimate π. The circle has red cro sses
outside and blue points inside.
1.7.2 Example: estimating π by Monte Carlo integration
Not only for statistical purposes, but also for a wide range of other uses.
Let's say we wish to calculate π. We are aware that the area of a circle munotes.in
Page 20
Track D: Machine
Learning –II (Advanced Machine Learning)
20 with radius r i s equal to πr2, but it is also equal to the following definite
integral -
Hence π = I/(r2). Let us approximate this by Monte Carlo integration. Let
f(x, y) =
I(x2 + y2 ≤ r2) be an indicator function that is 1 for points inside the
circle, and 0 outside, a nd let p(x) and p(y) be uniform distributions on [ −r,
r], so p(x) = p(y) = 1 /(2r). Then
We find ˆ π = 3.1416 with standard error 0.09
1.7.3 Accuracy of Monte Carlo approximation
Example 1:
The probability that the actual return will be within one standa rd deviation
of the rate that is considered to be the most likely ("expected") is 68%.
There is a 95% chance that it will be within two standard deviations and a
99.7% chance that it will be within three.
Yet, there is no guarantee that the outcome will be as expected or that real
movements won't exceed the most extreme predictions.
Example 2:
If we denote the exact mean by μ = E [ f(X)], and the MC approximation
by ˆμ, one can show that, with independent samples,
where
This is a consequence of the central -limit theorem. Of course, σ2is
unknown in the above expression, but it can also be estimated by MC: munotes.in
Page 21
Probability
21
Then we have
The term
is called the (numerical or empirical) standard error , and
is an estimate of our uncertainty about our estimate of μ.
1.8 INFORMATION THEORY
Data compression, also known as source coding, is the process of
encoding data in a little amount of space . Information theory is also
concerned with how to transport and store data in a way that is robust to
errors (a task known as error correction or channel coding). Although at
first glance it might appear that this has little to do with probability
theory' s concerns, there is actually a close connection. Consider the fact
that compactly expressing data necessitates allocating short code words to
highly probable bit strings and reserving longer code words for less
probable bit strings to demonstrate this. Si milar to how frequent words (as
such "a," "the," and "and") likely to be much shorter than rare ones in
natural language.
1.8.1 Entropy
A random va riable's entropy, indicated by (X) or sometimes (p), is a
measure of its uncertainty. It is defined by, in specifically, for a discrete
variable with K states -
Typically, log base 2 is used, in which case the units are referred to as bits
(short for binary digits). The units are referred to as nats if we use log base
e. For instance, we find = 2.2855 if X ∈ {1, . . . , 5} with histogram
distribution p = [0.25, 0.25, 0.2, 0.15, 0.15]. The uniform distribution is
the discrete distribution with the highest entropy. The entropy is therefore
maximum for a K -ary random variable if p(x = k) = 1/K; in this case,