Database Systems

Machine Learning for Data Management

Seminar  INFO-4663a
Degree  Master

🙅 Canceled!

Too few people have signed up for the discussion rounds to go as planned and cover at least a reasonable number of the topics. As a result, we regretfully have to call off this seminar.

This master seminar is held by Eberhard Hechler (formerly at IBM). Please contact Eberhard Hechler by e-mail (eberhard.hechler@t-online.de) for any questions regarding the scope, organizational matters, or signup (see below).

Content and Objectives

While data management and machine learning had distinct long histories, their relationship and particularly the exploitation of machine learning to enhance chosen data management domains has enjoyed great interest in research and industry for several years.

In this seminar, we are going to discuss exciting approaches to augment three traditional data management domains with machine learning methods and algorithms. The first area applies natural language processing based embedding techniques to relational database management systems, enabling semantic SQL programming. The second area uses locality sensitive hashing including novel developments for entity resolution, further improving and optimizing traditional similarity computation techniques in data management systems. Finally, the third area focuses on detecting, measuring and mitigating data drift to address model fairness/bias and subsequent model drift.

The seminar offers the opportunity to effectively approach and understand chosen research articles, gaining deeper and more relevant skills in your chosen topic by adding examples and visualization based on Python. All seminar assignments give the explicit opportunity to be introduced and review the current state of research and application possibilities in the industry.

Prerequisites

All seminar assignments are designed so that described concepts and methods should be explained and visualized via examples programmed in Python language using Kaggle datasets.

The seminar assignments are mainly derived from the lecture Utilizing Machine Learning in Data Management, held in the WS23/24 and WS24/25. Having attend this lecture is advantageous but not required. Ideally, participants should have attended the lecture Datenbanksysteme 1 (DB1) or Tabular Database Systems (TaDa) and should have foundational knowledge of machine learning principles and methods.

The seminar is addressing master students and is limited to 10 participants.

📑 Topics and References

  1. Word embedding for semantic SQL in RDBMS

    Unsupervised neural network models using a Natural Language Processing (NLP) technique called word embedding (e.g., based on Word2Vec) can be used to augment relational database management systems (RDBMS) enabling semantic SQL programming (1 st and 2nd article). Describe the approach as outlined in both articles, including the vector embeddings for DB rows to discover hidden inter-/intra-column relationships to enable semantic SQL functions (e.g., similarity between DB rows). Develop a Jupyter notebook with Python and visualize results of semantic queries on a churn dataset from Kaggle with different Word2Vec parameters.

    References:

  2. Skip-gram with negative sampling (SGNS)

    Vector embedding makes it possible to translate semantic similarity of data points. Word embedding models and corresponding training methods are essential to enable semantic SQL programming in RDBMS systems. Implement the skip-gram with negative sampling (SGNS) method with Python (using a Jupyter notebook) by using a simple set of data records and visualize the learning phase and result by training several low-dimensional target and context records with various learning rates.

    References:

  3. Continuous bag of words (CBOW)

    The goal of continuous bag of words is to train a deep neural network to predict a target word given a defined number of context words. This is a key method to enable semantic SQL programming in RDBMS systems. Using a Jupyter notebook with Python, implement the CBOW algorithm based on sample data records. Visualize the learning behavior and adjustments of several target and context records with various learning rates, and depict the error reduction rate by training 1, 2, and 3 target words (records).

    References:

  4. Locality sensitive hashing (LSH) approaches for entity resolution

    Entity matching is a key capability of Master Data Management (MDM) systems. Traditional similarity computation techniques (e.g., rule-based, probabilistic matching, Jaccard index…) in the entity matching process are complemented with Machine Learning methods to enhance similarity computation. Describe how locality sensitive hashing (LSH) can be used for duplicate detection (1 st and 2 nd article) within the blocking step of entity resolution and visualize the family of hash functions for blocking with different parameters using a simple example. Use a Jupyter notebook with Python and determine blocks based on a sample data set, e.g., from Kaggle.

    References:

  5. Multi-probe locality sensitive hashing (LSH) approaches for entity resolution

    Multi-probe LSH aims to reduce the number of hash tables for high-dimensional similarity search, thus simplifying the application of LSH by keeping false positives low and still decreasing false negatives. As outlined in the 1 st article, the key idea of the multi-probe LSH method is to use a carefully derived probing sequence to check multiple buckets that are likely to contain the nearest neighbors of a query object. Explain the key idea of multi-probe LSH indexing (article) and visualize the multi-probe LSH with 2 hash tables using a Gaussian density distribution function using a Jupyter notebook with Python.

    References:

    Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search

  6. Comparing ML model fairness metrics

    ML model fairness has become a predominant concern in developing and deploying ML models. Fairness aims to detect, measure, and mitigate favoritism and discrimination in ML models. Explain the key fairness/bias metrics, such as demographic parity (DP), equality of opportunity (EOp), equality of odds (EOd), predicted value parity (𝑃𝑉𝑃), accuracy parity (𝐴𝐶𝐶𝑃), and visualize the differences using sample probability distributions and a Kaggle dataset. Visualize (using Python) key properties and relationships between these metrics and describe under which circumstances these metrics can or cannot coexist.

    References:

  7. Model fairness via fairness reprogramming

    Using Python, implement the approach to detect and mitigate bias without retraining ML models via ML model reprogramming (2 nd reference), including the corresponding loss functions, and how mutual information is used by the FairReprogram methodology to guarantee two key metrics: demographic parity (DP) and equality of odds (EOd). Illustrate the impact of the fairness trigger using a Kaggle dataset (e.g. 1 st reference) by changing some data records (features) and depict the adjustments of the fairness metrics before and after applying the fairness trigger.

    References:

  8. Model fairness via infinitesimal Jackknife

    Explain and visualize the Jackknife method (using a Jupyter notebook with Python) to estimate an unknown parameter related to a statistics metric, e.g., related to a data probability distribution. Describe an alternative approach to detect and mitigate unfairness via a post- processing approach on pre-trained models by mitigating the influence of biased training data points without refitting (article). Describe the key idea of the infinitesimal Jackknife approach, using the Jackknife (Jacobian), and how demographic parity (DP) and equality of odds (EOd) can be detected and mitigated is a post-hoc fashion.

    References:

    Fair Infinitesimal Jackknife: Mitigating the Influence of Biased Training Data Points Without Refitting

  9. Data and model drift via polynomial relations

    Describe the approach to detect data and model drift, which occurs because of different data distributions during model training and operationalization (article). Explain the detection of data drift using polynomial relations among features as outlined in the article, including the usage of the Pearson correlation coefficient and coefficient of determination metric to select features. Describe how the Bayesian information criterion (𝐵𝐼𝐶) and Bayes factor (𝐵𝐹) are used to compare a linear model with 2 different input feature sets. Use two binomial probability distributions to illustrate and visualize (using Python) how the Bayes factor compares prior and posterior probability distributions.

    References:

    Detecting model drift using polynomial relations

  10. Detecting and measuring data drift

    Detecting and measuring data drift is vital to maintain accuracy and precision of ML models after deployment. Describe, illustrate and compare key methods, such as the Jensen-Shannon distance (based on the Kullback-Leibler divergence), Hellinger distance, total variation distance and the Kolmogorov-Smirnov (KS) statistic. Use Gaussian probability distributions and a Kaggle dataset to illustrate the difference and suitability of the methods regarding data drift metrics, e.g., change of mean, variance, and single features. Illustrate (via Python) that the KS-statistic for sufficiently large one-sample and two-sample empirical datasets approaches the Kolmogorov probability distribution.

    References: