Date of Award

Summer 8-2021

Document Type


Degree Name

Master of Science (MS)


Computational and Data Sciences

First Advisor

Erik Linstead

Second Advisor

Elizabeth Stevens

Third Advisor

Elizabeth Davison


Advancements in genetic sequencing methods for microbiomes in recent decades have permitted the collection of taxonomic and functional profiles of microbial communities, accelerating the discovery of the functional aspects of the microbiome and generating an increased interest among clinicians in applying these techniques with patients. This advancement has coincided with software and hardware improvements in the field of machine learning and deep learning. Combined, these advancements implicate further potential for progress in disease diagnosis and treatment in humans. The ability to classify a human microbiome profile into a disease category, and additionally identify the differentiating factors within the profile between diseased and healthy individuals are valuable missions for both disease diagnosis and understanding the pathology. This can be particularly important in diseases with unknown etiology, providing potential to develop and offer accurate diagnostic tools to clinicians who currently diagnose based on the limited research available or as a diagnosis of exclusion. Human microbiome studies like the Human Microbiome Project generate data that can help produce important findings related to health care and disease diagnosis and treatment. The nature of this data produces a large feature space relative to the number of samples and high sparsity, which can make it challenging to use in machine learning models, especially when the number of samples is small and much smaller than the number of features. Here, the IBD microbiome profiles VIII from the Human Microbiome Project are used to classify disease. We show the use of dimensionality reduction and variational autoencoders (VAE) in generating synthetic microbiome profiles as a potential method to deal with this issue and increase existing disease classification model performance. Results are compared across various baseline machine learning models with traditional supervised and unsupervised dimensionality reduction techniques. We show that using a dataset supplemented with VAE-generated artificial microbiome data improves classification results for small datasets with large feature space compared to sample size, and highly imbalanced class sizes, and may be used as a method to increase classification accuracy in microbiome-based diagnostic tools.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.