Deep Learning Seminar  /  22. April 2021, 10:00 – 11:00 Uhr

Representation of Categorical Variables for Machine Learning Based Anomaly Detection Using Embeddings

Bachelorthesis; Referent: Malte Silbernagel (Fraunhofer ITWM, Abteilung Finanzmathematik)

Englisches Abstract: 

Most of the machine learning algorithms are only capable of handling numerical data. Hence, categorical values must be encoded into numeric values that represent the initial data. In this thesis, a neural network is discussed, which learns a mapping of the categorical values onto a two-dimensional manifold, according to the neighborhood relationships between samples in the input space. As a byproduct of the learned mapping, a higher dimensional embedding of the values is produced. The performance of the embedding and the two-dimensional representation is then compared with the commonly used one-hot encoding.

This thesis proposes a neighborhood Probability Hamming whose embedding yields a more accurate classification between fraudulent and non-fraudulent data. Comparing the best scores of the different downstream classifiers, this method has increased the accuracy by 3.66 percentage points over the one-hot encoding.