Advances in NLP Algorithms on Unstructured Medical Notes Data and Approaches to Handling Class Imbalance Issues
Date of Award
Doctor of Philosophy (PhD)
Computational and Data Sciences
Chun Hsien Chiang
This dissertation is composed of three research projects, each addressing one aspect of the text classification tasks on unstructured medical notes.
The first study investigated the model performance of sequence deep learning models that are widely used in NLP tasks such as RNN, GRU, LSTM, Bi-LSTM, as well as CNN and the novel and more advanced attention-based algorithms such as the Transformer Encoder and BERT-Base. The model performances of these algorithms were evaluated with and without pre-trained word embeddings. The Transformer Encoder model stood out as the best model for all tasks and the CNN model produced comparable performance when the classes were relatively balanced.
As an extension of the first study, the second study explored the effects of 20 text data augmentation methods on the same data to handle the issues of class imbalance and small sample size. In addition, the effects of different strategies in terms of the amount of augmentation were also investigated. The results showed that the Splitting Augmenter consistently improved the model performance in all strategies for most tasks, and the largest improvement was 0.13 in F1 score and an impressive 0.34 in AUC-ROC. For highly imbalanced tasks, the strategy that augments the minority class until balanced improved model performance by the largest margin. For other tasks, the best-performing strategy was the one that augments the minority class until balanced and then augments both classes by an additional 10%.
The third study was carried out to predict suicidal or self-injurious events to help improve the efficiency of triage for health care services and prevent suicidal and injurious events from happening in the Orange County Jails. This study showed that the medical and mental health progress notes data contain more information about the inmates’ mental health state pertaining to their suicidal or self-injurious tendency than the structured data available in the database. Two different ways of incorporating the information from the notes data in the model building were introduced, and under-sampling was used to effectively mitigate the impact of extremely imbalanced classes.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
H. Lu, "Advances in NLP algorithms on unstructured medical notes data and approaches to handling class imbalance issues," Ph.D. dissertation, Chapman University, Orange, CA, 2022. https://doi.org/10.36837/chapman.000414
Available for download on Friday, November 22, 2024
Community Health and Preventive Medicine Commons, Diagnosis Commons, Health Services Research Commons