Date of Award

Fall 1-2021

Document Type


Degree Name

Doctor of Philosophy (PhD)


Computational and Data Sciences

First Advisor

Erik Linstead

Second Advisor

Elizabeth Stevens

Third Advisor

Ruben Ramirez-Padron


The use of machine learning has risen in recent years, though many areas remain unexplored due to lack of data or lack of computational tools. This dissertation explores machine learning approaches in case studies involving image classification and natural language processing. In addition, a software library in the form of two-way bridge connecting deep learning models in Keras with ones available in the Fortran programming language is also presented.

In Chapter 2, we explore the applicability of transfer learning utilizing models pre-trained on non-software engineering data applied to the problem of classifying software unified modeling language diagrams where data is scarce. Our experimental results show training reacts positively to transfer learning as related to sample size, even though the pre-trained model was not exposed to training instances from the software domain. We contrast the transferred network with other networks to show its advantage on different sized training sets.

Implementing artificial neural networks is commonly achieved via high-level programming languages like Python and easy-to-use deep learning libraries like Keras. These libraries come pre-loaded with a variety of network architectures, provide autodifferentiation, and support GPUs for fast and efficient computation. Many large-scale scientific computation projects are written in Fortran, making it difficult to integrate with modern deep learning methods. To alleviate this problem, we introduce a software library, the Fortran-Keras Bridge (FKB), that connects environments where deep learning resources are plentiful, with those where they are scarce. Chapter 3 describes several unique features offered by FKB, such as customizable layers, loss functions, and network ensembles.

In Chapter 4, Latent Dirichlet Allocation (LDA) is leveraged to analyze R and MATLAB source code from 10,051 R packages and 27,000 open source MATLAB modules in order to provide empirical insight on the topic space of scientific computing. This method is able to identify several generic programming concepts and, more importantly, concepts that are highly specific to scientific and high performance computing applications. We are also able to directly compare these topics using document entropy and topic uniformity scoring.

Available for download on Wednesday, June 01, 2022