We explore new ways to derive the provenance (or lineage) of data items that flow through programs or queries. Once this provenance information has been derived, we know
- exactly which input items led the program (or query) to emit which output items (Why and Where Provenance), as well as
- which program parts were involved in the computation of each single item (How Provenance).
Our exploration started with the analysis and instrumentation of Python programs used in Scientific Data Processing (in the context of the ScienceCampus Tübingen). We now tweak and transfer the resulting techniques such that they apply to the derivation of data provenance for relational queries, SQL in particular. There is the potential to derive very fine-grained provenance information for substantially larger SQL dialects than were considered up to now.
Tobias Müller • Pascal Engel
Proceedings of the 38th IEEE Int’l Conference on Data Engineering (ICDE 2022), Kuala Lumpur, Malaysia, May 2022.