Date of Award

3-2020

Document Type

Thesis

Degree Name

Master of Science in Cyber Operations

Department

Department of Electrical and Computer Engineering

First Advisor

Mark E. DeYoung, PhD

Abstract

Machine Learning (ML) is rapidly becoming integrated in critical aspects of cybersecurity today, particularly in the area of network intrusion/anomaly detection. However, ML techniques require large volumes of data to be effective. The available data is a critical aspect of the ML process for training, classification, and testing purposes. One solution to the problem is to generate synthetic data that is realistic. With the application of ML to this area, one promising application is the use of ML to perform the data generation. With the ability to generate synthetic data comes the need to evaluate the “realness” of the generated data. This research focuses specifically on the problem of evaluating the evaluation criteria. Quantitative analysis of evaluation criteria is important so that future research can have quantitative evidence for the evaluation criteria they utilize. The goal of this research is to provide a framework that can be used to inform and improve the process of generating synthetic semi-structured sequential data. A series of experiments evaluating a chosen set of metrics on discriminative ability and efficiency is performed. This research shows that the choice of feature space in which distances are calculated in is critical. The ability to discriminate between real and generated data hinges on the space that the distances are calculated in. Additionally, the choice of metric significantly affects the sample distance distributions in a suitable feature space. There are three main contributions from this work. First, this work provides the first known framework for evaluating metrics for semi-structured sequential synthetic data generation. Second, this work provides a “black box” evaluation framework which is generator agnostic. Third, this research provides the first known evaluation of metrics for semi-structured sequential data.

AFIT Designator

AFIT-ENG-MS-20-M-048

DTIC Accession Number

AD1104221

Share

COinS