Introduction to Hidden Markov Model article offered primary understanding of the Hidden Markov Model. We also went by means of the introduction of the three principal issues of HMM (Evaluation, Learning and Decoding). In this Understanding Forward and Backward Algorithm in Hidden Markov Model article we’ll dive deep into the Evaluation Drawback. We’ll undergo the mathematical understanding & then will use Python and R to build the algorithms by ourself.

Hidden Markov Model is a Markov Chain which is especially used in problems with temporal sequence of knowledge. Markov Model explains that the subsequent step depends solely on the earlier step in a temporal sequence. In Hidden Markov Model the state of the system is hidden (invisible), nevertheless every state emits a logo at each time step. HMM works with both discrete and steady sequences of knowledge. (Right here we’ll solely see the instance of discrete knowledge)

Contents

## Primary Structure of HMM:

As we have now mentioned earlier, Hidden Markov Model (( theta )) has with following parameters :

- Set of M Hidden States (( S^M))
- A Transaction Chance Matrix (A)
- A sequence of T observations (( V^T))
- A Emission Chance Matrix (Also called Statement Probability) (B)
- An Initial Chance Distribution (( pi ))

In case you are not positive of any of above terminology, please refer my earlier article on Introduction to Hidden Markov Model:

Introduction to Hidden Markov Model

As we have now seen earlier, the Evaluation Drawback might be said as following,

[[[[

textGiven theta, V_T rightarrow textual contentEstimate p(V_T|theta)

textual contentWhere theta rightarrow s, v, a_ij,b_jk

]

## Answer:

- First we need to discover all potential sequences of the state ( S^M) the place M is the variety of Hidden States.
- Then from all these sequences of ( S^M), discover the chance of which sequence generated the seen sequence of symbols ( V^T)
- Mathematically, ( p(V_T|theta) ) may be estimated as,

[[[[

p(V^T|theta)= sum_r=1^R p(V^T|S_r^T)p(S_r^T)

textwhere S_r^T = s_1(1), s_2(2)… s_r(T)

]and R=Most Number of potential sequences of the hidden state

So, if there are M number of hidden state, then we will define R as :

[[[[

R=M^T

]

In an effort to compute the chance of the mannequin generated by the particular sequence of T visible symbols ( V^T), we should always take each conceivable sequence of hidden state, calculate the chance that they’ve produced (V^T) and then add up these chances.

### Query :

Query you may be having is find out how to proof that the above equation is legitimate? Let’s try to understand this in a unique means.

Keep in mind our instance? So here is the diagram of a selected sequence of 3 states. The transition between the Hidden Layers have been grayed out intentionally, we’ll come back to that in a second.

In case in the above instance we already know the sequence of the Hidden states (i.e sun, sun, cloud) which generated the three seen symbols pleased, sad & completely happy, then will probably be very straightforward to calculate the chance of the visible symbols/states given the hidden state. So we will write chance of ( V^T) given (S^T) as:

p(joyful, unhappy, glad | solar, sun, rain ) = p(comfortable|sun) x p(sad|sun) x p(glad|rain)

Mathematically,

[[[[

p(V^T|S_r^T)=prod_t=1^T p(v(t) | s(t))

]

Unfortunately we really do not know the precise sequence of hidden states which generated the seen symbols pleased, unhappy & comfortable.Hence we need to compute the chance of temper modifications glad, unhappy & completely happy by summing over all potential climate sequences, weighted by their chance (transition chance).

We now have the same state diagram, nevertheless now the transition chances have been given here.

We will calculate the joint chance of the sequence of visible symbol (V^T) generated by a selected sequences of hidden state (S^T) as:

p(completely satisfied,unhappy,comfortable,sun,sun,rain) = p(sun|preliminary state) x p(solar|sun) x p(cloud|sun) x p(pleased|solar) x x p(sad|solar) x p(completely happy|rain)

Mathematically,

[[[[p(V^T,S^T)=p(V^T | S^T)p(S^T)

]

Since we’re utilizing First-Order Markov model, we will say that the chance of a sequence of T hidden states is the multiplication of the chance of each transition.

[[[[

p(S^T)=prod_t=1^T p(s(t) | s(t-1))

]

Write the joint chance as following,

startalign

p(V^T,S^T) &=p(V^T | S^T)p(S^T)

&=prod_t=1^T p(v(t) | s(t)) prod_t=1^T p(s(t) | s(t-1))

endalign

]

As you’ll be able to see, we’re slowly getting near our unique equation. Just one extra step is left now. The above equation is for a selected sequence of hidden state that we thought may need generated the seen sequence of symbols/states. We will now compute the in all probability of all of the totally different potential sequences of hidden states by summing over all the joint chances of (V^T) and (S^T).

In our example, we have now a sequence of three seen symbols/states, we also have 2 totally different states to symbolize. So there could be (2^3 = 8) potential sequences. We will write them as:

p(joyful,unhappy,completely satisfied|model) = p(completely satisfied,unhappy,completely satisfied,solar,sun,solar) + p(pleased,unhappy,pleased,sun,solar,rain) + p(glad,sad,pleased,sun,rain,rain)+ . . .

We will write the generalized equation as:

[[[[startalign

p(V^T|theta) &=sum_textAll Seq of S p(V^T, S^T)

&=sum_textual contentAll Seq of S p(V^T | S^T)p(S^T)

&=sum_r=1^R prod_t=1^T p(v(t) | s(t)) prod_t=1^T p(s(t) | s(t-1))

&=sum_r=1^R prod_t=1^T p(v(t) | s(t)) p(s(t) | s(t-1))

endalign

]

Once more, R=Maximum Variety of potential sequences of the hidden state.

The above answer is straightforward, nevertheless the computation complexity is ( O(N^T.T) ), which could be very high for sensible situations. So even if we’ve derived the answer to the Analysis Drawback, we have to find an alternate which ought to be straightforward to compute.

We’ll a recursive dynamic programming strategy to overcome the exponential computation we had with the answer above. There are two such algorithms, Forward Algorithm and Backward Algorithm.

In Forward Algorithm (as the identify advised), we’ll use the computed chance on current time step to derive the chance of the subsequent time step. Hence the it is computationally extra environment friendly (O(N^2.T)).

We need to discover the answer of the next query to make the algorithm recursive:

Given a a sequence of Visible state (V^T) , what would be the chance that the Hidden Markov Model will probably be in a specific hidden state s at a specific time step t.

If we write the above query mathematically it could be extra simpler to know.

[[[[alpha_j(t) = p(v(1)…v(t),s(t)= j)

]

First, we’ll derive the equation utilizing just chance & then will remedy once more utilizing trellis diagram. So don’t fear in case you are not capable of absolutely understand the subsequent section, simply read alongside and come again after going by means of the trellis diagram.

## Answer utilizing Chances:

### When t = 1 :

Rewrite the above equation when t=1

[[[[startalign

alpha_j(1) &= p(v_k(1),s(1)= j)

&= p(v_k(1)|s(1)=j)p(s(1)=j)

&= pi_j p(v_k(1)|s(1)=j)

&= pi_j b_jk

textual contentwhere pi &= text initial distribution,

b_jkv(1) &= textual content Emission Chance at t = 1

endalign

]

### When t = 2 :

So we’ve the solution when t=1. Now lets rewrite the same when t=2. Our objective right here can be to provide you with an equation the place (alpha_j(1)) is part of it, in order that we will use recursion.

[[[[beginalign

alpha_j(2) &= p Huge( v_k(1),v_k(2),s(2)= j Huge)

&= colourBluesum_i=1^M p Massive( v_k(1),v_k(2),shadeBlues(1)= i, s(2)= j Huge)

&= sum_i=1^M p Huge( v_k(2) | s(2)= j, v_k(1),s(1)= i Huge) p Massive( v_k(1),s(2),s(1)= i Massive)

&= sum_i=1^M p Huge( v_k(2) | s(2)= j, shadePinkv_k(1), s(1)= i Huge) p Massive( s(2) | shadePinkv_k(1),s(1)= i Massive) p Massive(v_k(1),s(1)= i Huge)

&= sum_i=1^M p Massive( v_k(2) | s(2)= j Huge) p Huge(s(2) | s(1)= i Huge) p Huge(v_k(1),s(1)= i Huge)

&= colourDarkRedp Massive( v_k(2) | s(2)= j Huge) sum_i=1^M p Huge( s(2) | s(1)= i Huge) shadeBluep Massive( v_k(1),s(1)= i Massive)

&= shadeDarkRedb_jk v(2) sum_i=1^M a_i2 shadeBlue alpha_i(1)

textwhere a_i2 &= text Transition Chance

b_jk v(2) &= text Emission Chance at t=2

alpha_i(1) &= textual content Forward chance at t=1

endalign

]

Let me attempt to clarify some a part of it. We’ve simply used the Joint Chance Rule and have damaged the equation in totally different elements.

In Line 2 we have now added (s(1)=i) for which we have now added the summation since there are M totally different hidden states. The pink highlighted section in Line four may be removed. Finally line 6 has three elements that are highlighted in colours. Since ( p ( v_k(2) | s(2)= j ) ) doesn’t depend upon i, we will transfer it outdoors of the summation. The ultimate equation consists of ( alpha_i(1) ) which we have now already calculated when t=1.

### Generalized Equation :

Let’s generalize the equation now for any time step t+1:

[[[[beginalign

alpha_j(t+1) &= p Huge( v_k(1) … v_k(t+1),s(t+1)= j Huge)

&= colourBluesum_i=1^M pBig(v_k(1) … v_k(t+1),colourBlues(t)= i, s(t+1)= j Massive)

&= sum_i=1^M pBig(v_k(t+1) | s(t+1)= j, v_k(1) … v_k(t),s(t)= iBig)

& pBig(v_k(1)…v_k(t),s(t+1),s(t)= i Massive)

&= sum_i=1^M pBig(v_k(t+1) | s(t+1)= j, shadePinkv_k(1)…v_k(t), s(t)= iHuge)

& pBig(s(t+1) | shadePurplev_k(1)…v_k(t),s(t)= iBig) pBig(v_k(t),s(t)= iBig)

&= sum_i=1^M pBig(v_k(t+1) | s(t+1)= jBig) pBig(s(t+1) | s(t)= iBig) pBig(v_k(t),s(t)= iBig)

&= colourDarkRed s(t+1)= jBig) sum_i=1^M pBig(s(t+1) | s(t)= iBig) colourBluepBig(v_k(t),s(t)= iBig)

&= shadeDarkRedb_jk v(t+1) sum_i=1^M a_ij shadeBluealpha_i(t)

finishalign

]

The above equation follows the same derivation as we did for t=2. This equation will probably be really easy to implement using any programming language. We gained’t use recursion perform, simply use the pre-calculated values in a loop (Extra on this later).

## Intuition utilizing Trellis:

We’ll use Trellis Diagram to get the instinct behind the Forward Algorithm. I case you have not understood the derivation utilizing joint chance rule, this section will definitely make it easier to to know the equation.

I’m repeating the same query again right here:

Given a a sequence of Visible state (V^T) , what will be the chance that the Hidden Markov Model shall be in a specific hidden state s at a specific time step t.

### Step by Step Derivation:

Please refer the under Trellis diagram and assume the chance that the system/machine is at hidden state (s_1) at time ( (t-1) ) is ( alpha_1(t-1) ). The chance of transition to hidden state ( s_2 ) at time step t might be now written as,

alpha_1(t-1) a_12

]

Likewise, if we sum all the possibilities where the machine transition to state ( s_2 ) at time t from any state at time ((t-1)), it provides the whole chance that there will a transition from any hidden state at ((t-1)) to ( s_2 ) at time step t.

Mathematically,

[[[[

sum_i=1^M alpha_i(t-1) a_i2

]

Finally, we will say the chance that the machine is at hidden state ( s_2 ) at time t, after emitting first t number of seen symbol from sequence (V^T) is given however the next, (We simply multiply the emission chance to the above equation)

[[[[b_2k sum_i=1^M alpha_i(t-1) a_i2

]

Now we will prolong this to a recursive algorithm to seek out the chance that sequence (V^T) was generated by HMM (theta). Right here is the generalized version of the equation.

alpha_j(t)= begininstances

pi_jb_jk & textual content when t = 1

b_jk sum_i=1^M alpha_i(t-1) a_ij & textual content when t text higher than 1

endinstances

]

Right here (alpha_j(t)) is the chance that the machine shall be at hidden state (s_j) at time step t, after emitting first t visible sequence of symbols.

## Implementation of Forward Algorithm:

Now lets work on the implementation. We’ll use both Python and R for this.

### Knowledge:

In our instance we’ve got 2 Hidden States (A,B) and 3 Seen States (0,1,2) ( in R file, it is going to be (1,2,3) ). Assume that we already know our a and b.

[[[[A=

beginbmatrix

0.54 & 0.46

0.49 & zero.51

finishbmatrix

] [[[[

B= startbmatrix

zero.16 & zero.26 & zero.58

0.25 & zero.28 & zero.47

finishbmatrix

]

The data_python.csv & data_r.csv has two columns named, Hidden and Visible. The one distinction between the Python and R is just the beginning index of the Seen column. Python file has 0,1,2 where as R has 1,2,three.

### Python:

First Load the info.

import pandas as pd

import numpy as np

knowledge = pd.read_csv(‘data_python.csv’)

V = knowledge[‘Visible’].values

import pandas as pd

import numpy as np

knowledge = pd.read_csv(‘data_python.csv’)

V = knowledge[‘Visible’].values

Then set the values for transition chance, emission chances and preliminary distribution.

# Transition Chances

a = np.array(((zero.54, zero.46), (zero.49, 0.51)))

# Emission Chances

b = np.array(((0.16, 0.26, zero.58), (zero.25, 0.28, zero.47)))

# Equal Chances for the initial distribution

initial_distribution = np.array((0.5, 0.5))

# Transition Chances a = np.array(((zero.54, 0.46), (0.49, zero.51))) # Emission Chances b = np.array(((zero.16, 0.26, zero.58), (zero.25, zero.28, 0.47))) # Equal Chances for the initial distribution initial_distribution = np.array((0.5, 0.5)) |

In python the index begins from 0, therefore our t will start from zero to T-1.

Subsequent, we may have the ahead perform. Here we’ll retailer and return all of the (alpha_0(zero), alpha_1(zero) … alpha_0(T-1),alpha_1(T-1))

def forward(V, a, b, initial_distribution):

alpha = np.zeros((V.form[0], a.form[0]))

alpha[0, :] = initial_distribution * b[:V[:V[:V[:V[0]]for t in range(1, V.shape[0]):

for j in vary(a.shape[0]):

# Matrix Computation Steps

# ((1×2) . (1×2)) * (1)

# (1) * (1)

alpha[t, j] = alpha[t – 1].dot(a[:, j]) * b[j, V

alpha = forward(V, a, b, initial_distribution)

print(alpha)

def forward(V, a, b, initial_distribution): alpha = np.zeros((V.form[0], a.form[0])) alpha[0, :] = initial_distribution * b[:V[:V[:V[:V[0]] for t in vary(1, V.shape[0]): for j in vary(a.shape[0]): # Matrix Computation Steps # ((1×2) . (1×2)) * (1) # (1) * (1) alpha[t, j] = alpha[t – 1].dot(a[:, j]) * b[j, V |

First we’ll create the alpha matrix with 2 Columns and T Rows.

As per our equation multiply initial_distribution with the ( b_jkv(0) ) to calculate (alpha_0(0) , alpha_1(0) ). This will probably be a simple vector multiplication since both initial_distribution and ( b_kv(zero) ) are of similar measurement.

- We’ll loop by way of the time steps now, ranging from 1 ( keep in mind python index begins from 0 ).
- One other loop for each hidden step j.
- Use the same formulation for calculating the ( alpha ) values.
- Return all the alpha values.

### Output:

[[8.00000000e-002 1.25000000e-001] [[8.00000000e-002 1.25000000e-001] [2.71570000e-002 2.81540000e-002] [1.65069392e-002 1.26198572e-002] [8.75653677e-003 6.59378003e-003]

…

…

[8.25847348e-221 6.30684489e-221]
[4.37895921e-221 3.29723269e-221]
[1.03487332e-221 1.03485477e-221]
[6.18228050e-222 4.71794300e-222]]

[2.71570000e-002 2.81540000e-002]

[1.65069392e-002 1.26198572e-002]

[8.75653677e-003 6.59378003e-003]

…

…

[8.25847348e-221 6.30684489e-221]

[4.37895921e-221 3.29723269e-221]

[1.03487332e-221 1.03485477e-221]

[6.18228050e-222 4.71794300e-222]]

### R Code:

Right here is identical Forward Algorithm carried out in R. In the event you notice, we now have removed the 2nd for loop in R code. You are able to do the same in python too.

knowledge = learn.csv(“data_r.csv”)

a = matrix(c(zero.54, zero.49, zero.46, 0.51),nrow = 2,ncol = 2)

b = matrix(c(0.16, 0.25, 0.26, 0.28, 0.58, zero.47),nrow = 2,ncol = three)

initial_distribution = c(half, half)

forward = perform(v, a, b, initial_distribution)

T = size(v)

m = nrow(a)

alpha = matrix(0, T, m)

alpha[1, ] = initial_distribution*b[, v[1]]for(t in 2:T)

tmp = alpha[t-1, ] %*% a

alpha[t, ] = tmp * b[, v

return(alpha)

ahead(knowledge$Seen,a,b,initial_distribution)

1 2 three four 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | knowledge = read.csv(“data_r.csv”) a = matrix(c(0.54, 0.49, zero.46, zero.51),nrow = 2,ncol = 2) b = matrix(c(zero.16, 0.25, 0.26, 0.28, zero.58, zero.47),nrow = 2,ncol = 3) initial_distribution = c(1/2, 1/2) forward = perform(v, a, b, initial_distribution) T = size(v) m = nrow(a) alpha = matrix(0, T, m) alpha[1, ] = initial_distribution*b[, v[1]] for(t in 2:T) tmp = alpha[t-1, ] %*% a alpha[t, ] = tmp * b[, v |

Backward Algorithm is the time-reversed model of the Forward Algorithm. In Backward Algorithm we have to discover the chance that the machine shall be in hidden state ( s_i ) at time step t and will generate the remaining part of the sequence of the seen image (V^T).

## Derivation of Backward Algorithm:

Please discover the Derivation of the Backward Algorithm utilizing Chance Concept. The concepts are similar because the ahead algorithm.

[[[[beginalign

beta_i(t) &= p Huge( v_k(t+1) …. v_k(T) | s(t) = i Massive)

&= sum_j=zero^M pBig( v_k(t+1) …. v_k(T), s(t+1) = j | s(t) = i Massive)

&= sum_j=zero^M pBig( v_k(t+2) …. v_k(T) | v_k(t+1) , s(t+1) = j , s(t) = i Huge)

& p Huge( v_k(t+1) , s(t+1) = j | s(t) = i Huge)

&= sum_j=0^M pBig( v_k(t+2) …. v_k(T) | v_k(t+1) , s(t+1) = j , s(t) = i Massive)

& p Massive( v_k(t+1) | s(t+1) = j , s(t) = i Massive) p Massive( s(t+1) = j | s(t) = i Massive)

&= sum_j=0^M pBig( v_k(t+2) …. v_k(T) | s(t+1) = j Massive) p Massive( v_k(t+1) | s(t+1) = j Massive) & p Huge( s(t+1) = j | s(t) = i Huge)

&= sum_j=zero^M beta_j(t+1) b_jkv(t+1) a_ij

textual contentwhere a_i2 &= textual content Transition Chance

b_jk v(t+1) &= text Emission Chance at t=t+1

beta_i(t+1) &= text Backward chance at t=t+1

endalign

]

## Instinct using Trellis:

Here is the Trellis diagram of the Backward Algorithm. Mathematically, the algorithm could be written in following method:

[[[[beta_i(t)= startinstances

1 & text when t = T

sum_j=zero^M a_ij b_jkv(t+1)beta_j(t+1) & textual content when t text less than T

finishinstances

]

## Implementation of Backward Algorithm:

We’ll use the same knowledge file and parameters as outlined for Forward Algorithm.

### Python Code :

import pandas as pd

import numpy as np

knowledge = pd.read_csv(‘data_python.csv’)

V = knowledge[‘Visible’].values

# Transition Chances

a = np.array(((zero.54, 0.46), (zero.49, 0.51)))

# Emission Chances

b = np.array(((0.16, 0.26, 0.58), (0.25, 0.28, 0.47)))

def backward(V, a, b):

beta = np.zeros((V.form[0], a.form[0]))

# setting beta(T) = 1

beta[Vshape[Vshape[Vshape[Vshape[0] – 1]= np.ones((a.form[0]))

# Loop in backward approach from T-1 to

# Because of python indexing the actual loop shall be T-2 to 0

for t in vary(V.shape[0] – 2, -1, -1):

for j in vary(a.form[0]):

beta[t, j] = (beta[t + 1] * b[:V[:V[:V[:V[t + 1]]).dot(a[j, :])

return beta

beta = backward(V, a, b)

print(beta)

1

2

3

4

5

6

7

eight

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

import pandas as pd

import numpy as np

knowledge = pd.read_csv(‘data_python.csv’)

V = knowledge[‘Visible’].values

# Transition Chances

a = np.array(((0.54, zero.46), (0.49, 0.51)))

# Emission Chances

b = np.array(((0.16, zero.26, 0.58), (0.25, 0.28, 0.47)))

def backward(V, a, b):

beta = np.zeros((V.shape[0], a.form[0]))

# setting beta(T) = 1

beta[Vshape[Vshape[Vshape[Vshape[0] – 1]= np.ones((a.form[0]))

# Loop in backward method from T-1 to

# As a consequence of python indexing the actual loop can be T-2 to 0

for t in range(V.form[0] – 2, -1, -1):

for j in vary(a.form[0]):

beta[t, j] = (beta[t + 1] * b[:V[:V[:V[:V[t + 1]]).dot(a[j, :])

return beta

beta = backward(V, a, b)

print(beta)

### R Code :

knowledge = learn.csv(“data_r.csv”)

a = matrix(c(0.54, 0.49, 0.46, 0.51),nrow = 2,ncol = 2)

b = matrix(c(0.16, 0.25, zero.26, zero.28, 0.58, zero.47),nrow = 2,ncol = three)

backward = perform(V, A, B)

T = length(V)

m = nrow(A)

beta = matrix(1, T, m)

for(t in (T-1):1)

tmp = as.matrix(beta[t+1, ] * B[, V[t+1]])

beta[t, ] = t(A %*% tmp)

return(beta)

backward(knowledge$Seen,a,b)

1 2 three 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | knowledge = read.csv(“data_r.csv”) a = matrix(c(0.54, 0.49, 0.46, 0.51),nrow = 2,ncol = 2) b = matrix(c(0.16, 0.25, zero.26, 0.28, 0.58, zero.47),nrow = 2,ncol = 3) backward = perform(V, A, B) T = length(V) m = nrow(A) beta = matrix(1, T, m) for(t in (T-1):1) tmp = as.matrix(beta[t+1, ] * B[, V[t+1]]) beta[t, ] = t(A %*% tmp) return(beta) backward(knowledge$Visible,a,b) |

### Output :

[[5.30694627e-221 5.32373319e-221] [1.98173335e-220 1.96008747e-220] [3.76013005e-220 3.71905927e-220] [7.13445025e-220 7.05652279e-220]

…

…

[7.51699476e-002 7.44006456e-002]
[1.41806080e-001 1.42258480e-001]
[5.29400000e-001 5.23900000e-001]
[1.00000000e+000 1.00000000e+000]]

[[5.30694627e-221 5.32373319e-221] [1.98173335e-220 1.96008747e-220] [3.76013005e-220 3.71905927e-220] [7.13445025e-220 7.05652279e-220] … … [7.51699476e-002 7.44006456e-002] [1.41806080e-001 1.42258480e-001] [5.29400000e-001 5.23900000e-001] [1.00000000e+000 1.00000000e+000]] |

In our next article we’ll use both the ahead and backward algorithm to unravel the training drawback. Here I’ve offered a really detailed overview of the Forward and Backward Algorithm. The output of the program might not make lot of sense now, nevertheless next article will present extra insight.

Here is the link to the code and knowledge file in github.

Also, listed here are the listing of all of the articles in this collection:

- Introduction to Hidden Markov Model
- Forward and Backward Algorithm in Hidden Markov Model
- Derivation and implementation of Baum Welch Algorithm for Hidden Markov Model
- Implement Viterbi Algorithm in Hidden Markov Model utilizing Python and R

Be happy to publish any question you could have.