Tom Liu

I am a PhD student in the Department of Biostatistics, School of Public Health, University of Michigan. My research interests are in causal inference, and randomized clinical trials.

0%

Variational EM and LDA for MIMIC-IV Clinical Notes

This technical report details the Variational Expectation-Maximization (EM) algorithm and its application in Latent Dirichlet Allocation (LDA) for MIMIC-IV clinical notes.

We will:

  1. Discuss how Python (e.g., via Gensim) implements Variational EM within LDA.
  2. Present MIMIC-IV as an example data source, describing how unstructured clinical text can be modeled.
  3. Provide a Results section with our latest LDA topic assignment output for a small sample of data.

Below is the entire discussion in Markdown format for easy reference and reading.


1. Introduction

Latent Dirichlet Allocation (LDA) is a popular unsupervised algorithm for topic modeling, aiming to identify latent thematic clusters (topics) within a corpus of text documents. Each document is represented as a mixture of multiple topics, and each topic is characterized by a distribution over words in the vocabulary.

1.1 The Intractability Problem

Exact inference of the hidden (latent) topic structure in LDA is intractable due to the exponential number of possible topic assignments. As a result, approximate inference methods must be used. Two well-known approaches are:

  • Gibbs Sampling (a Markov Chain Monte Carlo technique)
  • Variational Inference (a deterministic alternative using Variational EM)

In this report, we focus on Variational EM, which provides a scalable and often faster convergence method for large text datasets like MIMIC-IV.


2. Data Description: MIMIC-IV Clinical Notes

The MIMIC-IV database is a large, deidentified dataset containing:

  • Structured data: Diagnoses, labs, vitals, demographic info.
  • Unstructured data: Clinical notes (discharge summaries, radiology reports, etc.).

For topic modeling, we specifically look at free-text clinical notes. These can be long, contain medical abbreviations, and vary significantly in structure. Below is our workflow to prepare this text:

  1. Cleaning:
    • Remove placeholders (___), punctuation, numbers, or domain-specific artifacts.
  2. Normalization:
    • Convert text to lowercase, unify spacing.
  3. Stopword Removal:
    • Exclude common words (e.g., “the,” “and,” “of”) and any custom domain tokens that do not aid in meaning (e.g., “left,” “right,” “rrrr”).
  4. Bag-of-Words Representation:
    • Transform each note into a histogram of word occurrences for downstream LDA processing.

3. Theoretical Foundations

3.1 Latent Dirichlet Allocation (LDA)

LDA proposes a generative story for documents:

  1. Each corpus has a fixed number of topics (K).
  2. Each document ( d ) is associated with a topic proportion (\theta_d).
  3. For each word (w) in document (d):
    • Draw a topic assignment (z) from (\theta_d).
    • Then draw word ( w ) from the selected topic’s word distribution (\phi_{z}).

Because these topic assignments and proportions are unobserved, LDA is a latent variable model.

3.2 Variational EM

The Expectation-Maximization (EM) algorithm in a latent-variable setting alternates:

  • E-step: Compute the expected posterior distribution over hidden variables.
  • M-step: Update model parameters to maximize the likelihood under that distribution.

However, the exact posterior in LDA is intractable, so we use Variational EM:

  • Variational E-step:
    • We introduce a simpler distribution (q(\theta, z)) to approximate the true posterior (p(\theta, z \mid w)).
    • Update the parameters of (q) to minimize KL divergence between (q) and the true posterior.
  • Variational M-step:
    • Update the global parameters (e.g., topic-word distributions (\phi)) by maximizing the evidence lower bound (ELBO) using the current (q).

This two-step iteration is repeated until convergence. Variational EM is more deterministic than sampling methods (like Gibbs Sampling) and often converges faster on large corpora.


4. How Python Implements Variational EM in LDA

In Python, a common implementation for LDA is via the Gensim library:

  • Gensim’s LdaModel documentation explains that the underlying approach is a variational Bayes (VB) method or online variational Bayes for large corpora.
  • The library:
    1. Initializes random or heuristic distributions for topics.
    2. Performs the Variational E-step by updating, for each document, the approximate posterior of topics for each word.
    3. In the M-step, updates the global topic-word probabilities (\phi) to incorporate the newly estimated document-level topic assignments.
    4. Often uses an “online” variant of variational EM, processing mini-batches of documents instead of the entire corpus at once, to improve scalability.

Hence, Python (through Gensim) relies on variational EM to approximate the LDA posterior in a memory-efficient and iterative manner, suitable for large MIMIC-IV notes.


5. Application to Our Data

5.1 Setup

  1. Preprocess 5 sample MIMIC-IV notes (a small demonstration subset).
  2. Set number of topics (K).
  3. Train the LDA model with the Python approach described above:
    • Gensim’s LdaModel(corpus=..., num_topics=K, ...) function.
    • This function automatically performs the variational E- and M- steps under the hood.

5.2 Results: Latest Topic Assignment Output

Below is an example of 5-topic LDA output (the top 10 words in each topic) after running variational EM on a sample dataset. This is representative of how each topic lumps together common terms:

1
2
3
4
5
6
=== LDA Topics ===
Topic 0: 0.017*"head" + 0.012*"ct" + 0.011*"contrast" + 0.010*"images" + 0.010*"evidence" + 0.009*"hemorrhage" + 0.009*"acute" + 0.009*"axial" + 0.008*"mass" + 0.008*"technique"
Topic 1: 0.017*"normal" + 0.013*"cm" + 0.011*"ct" + 0.011*"contrast" + 0.011*"evidence" + 0.010*"abdomen" + 0.009*"pelvis" + 0.009*"within" + 0.009*"mm" + 0.008*"liver"
Topic 2: 0.039*"chest" + 0.020*"pleural" + 0.016*"effusion" + 0.016*"comparison" + 0.014*"pulmonary" + 0.013*"pneumothorax" + 0.012*"lung" + 0.011*"tube" + 0.010*"radiograph" + 0.009*"unchanged"
Topic 3: 0.018*"fracture" + 0.010*"comparison" + 0.009*"joint" + 0.008*"views" + 0.008*"seen" + 0.008*"femoral" + 0.007*"knee" + 0.007*"veins" + 0.007*"patient" + 0.006*"catheter"
Topic 4: 0.015*"spine" + 0.014*"breast" + 0.010*"disc" + 0.010*"narrowing" + 0.009*"spinal" + 0.009*"ultrasound" + 0.009*"mild" + 0.009*"lumbar" + 0.008*"comparison" + 0.008*"cm"

An interpretation might be:

  • Topic 0: Head/brain imaging (words “head,” “hemorrhage,” “acute,” “axial,” “ct”).
  • Topic 1: Abdominal/pelvis imaging references (words “abdomen,” “pelvis,” “cm,” “mm,” “liver”).
  • Topic 2: Chest imaging (words “chest,” “pleural,” “effusion,” “pulmonary,” “pneumothorax”).
  • Topic 3: Musculoskeletal or orthopedic imaging (“fracture,” “joint,” “knee,” “femoral”).
  • Topic 4: Spine imaging references (words “spine,” “disc,” “narrowing,” “lumbar,” “ultrasound”).

Note: This is an unsupervised grouping. The topics are discovered solely by analyzing word co-occurrences.


6. Conclusion

  1. LDA: A generative model that posits each clinical note in MIMIC-IV is a mixture of hidden topics, each topic being a word distribution.
  2. Variational EM: An efficient approximate inference technique. Python libraries like Gensim implement Variational Bayes to handle large corpora, iteratively refining the posterior until convergence.
  3. Relevance: In MIMIC-IV, such topic modeling can uncover underlying themes across discharge summaries, radiology reports, and more—e.g., “HIV medication,” “ascites management,” “pulmonary diagnoses,” or “neurological imaging.” Even with small samples, the approach is pertinent because it:
    • Does not require labeled data.
    • Quickly yields interpretable clusters of text.

Next Steps:

  • Scale the approach to hundreds or thousands of MIMIC-IV notes.
  • Combine domain-specific tokens (like ICD-10 codes) or do concept extraction to further refine the topics.
  • Evaluate or validate whether discovered topics align with known clinical categories.

References