Theses and Dissertations

Innovative Heuristics to Improve the Latent Dirichlet Allocation Methodology for Textual Analysis and a New Modernized Topic Modeling Approach

Jamie T. Zimmerman

Date of Award

6-2022

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Department of Operational Sciences

First Advisor

Lance E. Champagne, PhD

Abstract

Natural Language Processing is a complex method of data mining the vast trove of documents created and made available every day. Topic modeling seeks to identify the topics within textual corpora with limited human input into the process to speed analysis. Current topic modeling techniques used in Natural Language Processing have limitations in the pre-processing steps. This dissertation studies topic modeling techniques, those limitations in the pre-processing, and introduces new algorithms to gain improvements from existing topic modeling techniques while being competitive with computational complexity. This research introduces four contributions to the field of Natural Language Processing and topic modeling. First, this research identifies a requirement for a more robust “stopwords” list and proposes a heuristic for creating a more robust list. Second, a new dimensionality-reduction technique is introduced that exploits the number of words within a document to infer importance to word choice. Third, an algorithm is developed to determine the number of topics within a corpus and demonstrated using a standard topic modeling data set. These techniques produce a higher quality result from the Latent Dirichlet Allocation topic modeling technique. Fourth, a novel heuristic utilizing Principal Component Analysis is introduced that is capable of determining the number of topics within a corpus that produces stable sets of topic words.

AFIT Designator

AFIT-ENS-DS-22-J-059

DTIC Accession Number

AD1177713

Recommended Citation

Zimmerman, Jamie T., "Innovative Heuristics to Improve the Latent Dirichlet Allocation Methodology for Textual Analysis and a New Modernized Topic Modeling Approach" (2022). Theses and Dissertations. 5493.
https://scholar.afit.edu/etd/5493

Download

Included in

Data Science Commons

COinS

Theses and Dissertations

Innovative Heuristics to Improve the Latent Dirichlet Allocation Methodology for Textual Analysis and a New Modernized Topic Modeling Approach

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

AFIT Designator

DTIC Accession Number

Recommended Citation

Included in

Search

Browse

Author Corner

Theses and Dissertations

Innovative Heuristics to Improve the Latent Dirichlet Allocation Methodology for Textual Analysis and a New Modernized Topic Modeling Approach

Author

Date of Award

Document Type

Degree Name

Department

First Advisor

Abstract

AFIT Designator

DTIC Accession Number

Recommended Citation

Included in

Share

Search

Browse

Author Corner