Detached Provenance Analysis

PhD Thesis, Universität Tübingen, 2020.

Data provenance is the research field of the algorithmic derivation of the source and processing history of data. In this work, the derivation of Where— and Why—provenance in sub—cell—level granularity is pursued for a rich SQL dialect. For example, we support the provenance analysis for individual elements of nested rows and/or arrays. The SQL dialect incorporates window functions and correlated subqueries.

We accomplish this goal using a novel method called detached provenance analysis. This method carries out a SQL—level rewrite of any user query Q, yielding (Q¹, Q²). Employing two queries facilitates a low—invasive provenance analysis, i.e. both queries can be evaluated using an unmodified DBMS as backend. The queries implement a split of responsibilities: Q¹ carries out a runtime analysis and Q² derives the actual data provenance. One drawback of this method is that a synchronization overhead between Q¹ and Q² is induced. Experiments quantify the overheads based on the TPC‒H benchmark and the PostgreSQL DBMS.

A second set of experiments carried out in row—level granularity compares our approach with the PERM approach (as described by B. Glavic et al.). The aggregated results show that basic queries (typically, a single SFW expression with aggregations) perform slightly better in the PERM approach while complex queries (nested SFW expressions and correlated subqueries) perform considerably better in our approach.