Date of Award

Summer 8-2025

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computational and Data Sciences

First Advisor

Cyril Rakovski, Ph.D

Second Advisor

Adrian Vajiac, Ph.D

Third Advisor

Ehsan Yaghmaei, Ph.D

Fourth Advisor

Sheth Parthiv, MD

Abstract

Many clinically important variables are found only in medical notes, leaving gaps for prediction and confounding adjustment. We analyzed medical discharge notes for acute-thrombosis admissions in MIMIC-IV. Privacy was preserved by generating synthetic medical notes that mirror the content and structure of the original medical discharge notes, labelling them with reasoning steps and classifications using the DeepSeek-R1 API, and fine-tuning 8–14 billion parameter versions of Llama-3, Qwen-3 and Gemma-3 models with Group Relative Policy Optimization (GRPO). Qwen-3 achieved the highest accuracy on a hold-out set of synthetic notes and de-identified real medical notes.

Variables extracted from medical notes, including family history of clot and provoked versus unprovoked status, were combined with structured fields and passed to a Super Learner formed from non-parametric, semi-parametric and tree learners. The ensemble attained the lowest negative log-likelihood (NLL) when predicting major bleeding and thrombosis recurrence within 3 and 6 months and mortality within 12 months; adding discharge medical note-level covariates reduced loss by 0.2% – 1.7% across outcomes.

Causal effects of Vitamin K Antagonist (VKA) which was primarily Warfarin, and Factor Xa inhibitors (Apixaban, Edoxaban, Rivaroxaban) were estimated with targeted maximum likelihood estimation (TMLE), which uses initial Super Learner fits for the outcome and treatment mechanisms followed by a targeting step that achieves doubly robust, efficient inference. After adjustment for both structured and medical notes-text covariates, VKA increased the risk of major bleeding by 4.5% with 95% C.I of (3.4% – 5.7%) at 3 months and 5.8% with 95% C.I (4.6% – 7.0%) at 6 months and raised thrombosis recurrence by 3.2% with 95% C.I (1.8% – 4.6%) and 3.5% with 95% C.I (2.1% – 5.0%) over the same horizons. No significant difference in 12-month mortality was detected with 95% C.I (–1.7 % to 0.6%).

These findings show that large language models (LLMs) extraction of discharge summaries can improve risk prediction and strengthen confounding control in targeted learning. Incorporating such variables reveals lower short-term risks of bleeding and recurrence with Factor Xa inhibitors compared with VKA while long-term survival remains similar.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Available for download on Sunday, June 20, 2027

Share

COinS