Deep Learning Seminar  /  April 22, 2021, 10:00 – 11:00

Representation of Categorical Variables for Machine Learning Based Anomaly Detection Using Embeddings

(Bachelor Thesis) Speaker: Malte Silbernagel (Fraunhofer ITWM, Department »Financial Mathematics«)


Most of the machine learning algorithms are only capable of handling numerical data. Hence, categorical values must be encoded into numeric values that represent the initial data. In this thesis, a neural network is discussed, which learns a mapping of the categorical values onto a two-dimensional manifold, according to the neighborhood relationships between samples in the input space. As a byproduct of the learned mapping, a higher dimensional embedding of the values is produced. The performance of the embedding and the two-dimensional representation is then compared with the commonly used one-hot encoding.

This thesis proposes a neighborhood Probability Hamming whose embedding yields a more accurate classification between fraudulent and non-fraudulent data. Comparing the best scores of the different downstream classifiers, this method has increased the accuracy by 3.66 percentage points over the one-hot encoding.