Idle

Configuration

Advanced Initialization Using random initialization

HMM Theory & Algorithm Reference

1. Hidden Markov Model — Formal Definition

A Hidden Markov Model (HMM) is a probabilistic generative model for sequential data. The system is assumed to follow a Markov chain of hidden (unobservable) states, each of which emits an observable symbol according to a state-dependent probability distribution.

An HMM is fully specified by the parameter tuple λ = (A, B, π), where:

Symbol Name Definition
A State Transition Matrix A[i][j] = P(qt+1 = Sj | qt = Si). Each row sums to 1.
B Emission (Observation) Matrix B[j][k] = P(Ot = vk | qt = Sj). Each row sums to 1.
π Initial State Distribution π[i] = P(q1 = Si). Elements sum to 1.

The model assumes the Markov property: the probability of transitioning to state Sj depends only on the current state Si, not on the history of states. It also assumes output independence: the observation at time t depends only on the current state at time t.


2. Three Fundamental Problems of HMMs

Problem 1 — Evaluation

Given λ and an observation sequence O, compute P(O|λ).

Solved by the Forward Algorithm.

Problem 2 — Decoding

Given λ and O, find the most likely hidden state sequence Q*.

Solved by the Viterbi Algorithm.

Problem 3 — Learning

Given O, find λ* = argmaxλ P(O|λ).

Solved by the Baum–Welch Algorithm (this tool).


3. Forward Algorithm

The forward variable αt(i) represents the probability of observing the partial sequence O1, O2, …, Ot and being in state Si at time t.

Initialization:

α1(i) = πi · B[i][O1]

Induction (t = 2, …, T):

αt(j) = [ Σi=1N αt−1(i) · A[i][j] ] · B[j][Ot]

Termination:

P(O|λ) = Σi=1N αT(i)

Complexity: O(N²T) vs brute-force O(NT)

4. Backward Algorithm

The backward variable βt(i) represents the probability of the partial observation sequence from t+1 to T, given state Si at time t.

Initialization:

βT(i) = 1   ∀ i

Induction (t = T−1, …, 1):

βt(i) = Σj=1N A[i][j] · B[j][Ot+1] · βt+1(j)

Both α and β are computed in the E-Step of Baum–Welch.


5. Baum–Welch Algorithm (EM for HMMs)

The Baum–Welch algorithm is a special case of the Expectation–Maximization (EM) framework applied to HMMs. It iteratively alternates between computing the expected sufficient statistics (E-Step) and re-estimating the model parameters (M-Step).

E-Step: Compute Responsibilities

  • Compute αt(i) via the forward pass
  • Compute βt(i) via the backward pass
  • γt(i) = P(qt=Si | O, λ) = αt(i) · βt(i) / P(O|λ)
  • ξt(i,j) = P(qt=Si, qt+1=Sj | O, λ)

M-Step: Re-estimate Parameters

Transition: Â[i][j] = Σt=1T−1 ξt(i,j) / Σt=1T−1 γt(i)
Emission: B̂[j][k] = Σt: Ot=vk γt(j) / Σt=1T γt(j)
Initial: π̂[i] = γ1(i)

6. Convergence Guarantees

The EM framework guarantees that the log-likelihood is monotonically non-decreasing across iterations:

log P(O|λ(n+1)) ≥ log P(O|λ(n))

Convergence is to a local maximum (or saddle point) of the likelihood surface. The algorithm terminates when |Δ log L| < ε (the user-specified tolerance) or when the maximum iteration count is reached.

In practice, multiple random restarts are recommended to mitigate sensitivity to initialization.

7. Numerical Stability — Scaling

For long observation sequences, the forward and backward variables can underflow to zero. This implementation uses log-space scaling (Rabiner, 1989):

  • At each time step t, compute a scaling coefficient ct
  • ct = 1 / Σi α̃t(i), applied to normalize α
  • The same coefficients are used to scale β
  • Log-likelihood recovered as log P(O|λ) = −Σt log ct

This avoids floating-point underflow while preserving exact computation of γ and ξ.


8. Applications

🧬

Bioinformatics

Gene finding, protein family classification & sequence alignment.

🗣️

Speech Recognition

Acoustic modeling, phoneme recognition & large vocabulary models.

📈

Quantitative Finance

Regime switching detection, volatility modeling & market analysis.

🌦️

Pattern Recognition

Climate pattern analysis, POS tagging & handwriting recognition.

References

  • L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
  • A. P. Dempster, N. M. Laird, D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” JRSS-B, vol. 39, no. 1, pp. 1–38, 1977.
  • C. M. Bishop, Pattern Recognition and Machine Learning, Chapter 13: Sequential Data. Springer, 2006.