Date of Award

Fall 12-2022

Document Type


Degree Name

Doctor of Philosophy (PhD)


Computational and Data Sciences

First Advisor

Cyril Rakovski

Second Advisor

Daniel Alpay

Third Advisor

Chun Hsien Chiang

Fourth Advisor

Alex Barrett


This dissertation is composed of three research projects, each addressing one aspect of the text classification tasks on unstructured medical notes.

The first study investigated the model performance of sequence deep learning models that are widely used in NLP tasks such as RNN, GRU, LSTM, Bi-LSTM, as well as CNN and the novel and more advanced attention-based algorithms such as the Transformer Encoder and BERT-Base. The model performances of these algorithms were evaluated with and without pre-trained word embeddings. The Transformer Encoder model stood out as the best model for all tasks and the CNN model produced comparable performance when the classes were relatively balanced.

As an extension of the first study, the second study explored the effects of 20 text data augmentation methods on the same data to handle the issues of class imbalance and small sample size. In addition, the effects of different strategies in terms of the amount of augmentation were also investigated. The results showed that the Splitting Augmenter consistently improved the model performance in all strategies for most tasks, and the largest improvement was 0.13 in F1 score and an impressive 0.34 in AUC-ROC. For highly imbalanced tasks, the strategy that augments the minority class until balanced improved model performance by the largest margin. For other tasks, the best-performing strategy was the one that augments the minority class until balanced and then augments both classes by an additional 10%.

The third study was carried out to predict suicidal or self-injurious events to help improve the efficiency of triage for health care services and prevent suicidal and injurious events from happening in the Orange County Jails. This study showed that the medical and mental health progress notes data contain more information about the inmates’ mental health state pertaining to their suicidal or self-injurious tendency than the structured data available in the database. Two different ways of incorporating the information from the notes data in the model building were introduced, and under-sampling was used to effectively mitigate the impact of extremely imbalanced classes.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Available for download on Friday, November 22, 2024