Digital forensics practitioners face a continual increase in the volume of data they must analyze, which exacerbates the problem of finding relevant information in a noisy domain. Current technologies make use of keyword based search to isolate relevant documents and minimize false positives with respect to investigative goals. Unfortunately, selecting appropriate keywords is a complex and challenging task. Latent Dirichlet Allocation (LDA) offers a possible way to relax keyword selection by returning topically similar documents. This research compares regular expression search techniques and LDA using the Real Data Corpus (RDC). The RDC, a set of over 2400 disks from real users, is first analyzed to craft effective tests. Three tests are executed with the results indicating that, while LDA search should not be used as a replacement to regular expression search, it does offer benefits. First, it is able to locate documents when few, if any, of the keywords exist within them. Second, it improves data browsing and deals with keyword ambiguity by segmenting the documents into topics.
Noel, G. E., & Peterson, G. L. (2014). Applicability of Latent Dirichlet Allocation to multi-disk search. Digital Investigation, 11(1), 43–56. https://doi.org/10.1016/j.diin.2014.02.001