Database Systems

Research

Current Research Projects

Recursive Computation Over Relational Data

Move your computation close to the data! This age-old mantra of the database community asserts that we can expect a SQL query engine with immediate access to the data to perform significantly better than an external processor to which we have to ship the data first. The lore holds up if the computation is query-like and primarily involves filtering, data (re-)combination, grouping, or aggregation.

It is much less clear how complex algorithms that rely on arbitrary iterative control flow or recursion can be efficiently evaluated inside the same SQL engines. With the advent of SQL:1999, contemporary database engines started to support forms of recursion. The associated language constructs, however, exhibit syntactic restrictions, are based on semantics that are tough to grasp for regular developers, or exhibit sobering runtime performance that often render iterative or recursive SQL impractical.

Rethinking Fixpoint Computation

SQL:1999 introduced WITH RECURSIVE—or recursive common table expressions (CTEs)—a true game changer which turned SQL into a Turing-complete programming language. WITH RECURSIVE is powerful and versatile, but proved to be notoriously hard to grasp and master.

In this line of work, we derive new CTE variants from the simple loop-based operational semantics of SQL:1999’s WITH RECURSIVE. In the absence of fixpoint-based semantics and monotonicity restrictions, these CTE variants enable a SQL authoring style that mimics imperative algorithms. This fresh look at CTEs has a beneficial impact on the readability and performance of iterative SQL queries.

UDF Compilation

“Move your computation close to the data” is an age-old mantra for developing efficient database-driven applications. However, following this advice requires a very intimate knowledge of the database query language at hand. In this branch of our research, we use techniques developed in other areas of computer science, such as programming languages, to enable you to write complex in-database computation without having to read the SQL standard from cover to cover.

Past Research Projects

Data Provenance for SQL

We explore new ways to derive the provenance (or lineage) of data items that flow through programs or queries. Once this provenance information has been derived, we know

  1. exactly which input items led the program (or query) to emit which output items (Why and Where Provenance), as well as
  2. which program parts were involved in the computation of each single item (How Provenance).

Our exploration started with the analysis and instrumentation of Python programs used in Scientific Data Processing (in the context of the ScienceCampus Tübingen). We now tweak and transfer the resulting techniques such that they apply to the derivation of data provenance for relational queries, SQL in particular. There is the potential to derive very fine-grained provenance information for substantially larger SQL dialects than were considered up to now.

Database Supported Haskell

Database-Supported Haskell, DSH for short, is a Haskell library for database-supported program execution. Using the DSH library, a relational database management system (RDBMS) can be used as a coprocessor for the Haskell programming language, especially for those program fragments that carry out data-intensive and data-parallel computations. Rather than embedding a relational language into Haskell, DSH turns idiomatic Haskell programs into SQL queries.

DSH in the Real World

We have used DSH for large scale data analysis. Specifically, in collaboration with researchers working in social and economic sciences, we used DSH to analyse the entire history of Wikipedia (terabytes of data) and a number of online forum discussions (gigabytes of data). Because of the scale of the data, it would be unthinkable to conduct the data analysis in Haskell without using the database-supported program execution technology featured in DSH. We have formulated several DSH queries directly in SQL as well and found that the equivalent DSH queries were much more concise, easier to write and maintain (mostly due to DSH’s support for nesting, Haskell’s abstraction facilities and the monad comprehension notation, see below). One long-term goal is to allow researchers who are not necessarily expert programmers or database engineers to conduct large scale data analysis themselves.

Ferry

Database-Supported Program Execution

With project Ferry we try to establish a connection between two somewhat distant shores: programming languages and database technology. Ferry explores how far we can push the idea of relational database engines that directly and seamlessly participate in program evaluation to support the super-fast execution of data-intensive programs written in a variety of (general purpose) programming languages. Ferry builds on technology developed in the context of our project Pathfinder.

Habitat

True Language-Level SQL Debugging

We present Habitat, a declarative observational debugger for SQL. Habitat facilitates true language-level (not: plan-level) debugging of, probably flawed, SQL queries that yield unexpected results. Users may mark arbitrary SQL subexpressions-ranging from literals, over fragments of predicates, to entire subquery blocks-to observe whether these evaluate as expected. The SQL query is transformed to additionally record the values of marked subexpressions during the process of evaluation. The notion of SQL subexpressions as functions of free variables allows to locate the observed values in the query evaluation process, in order to merge multiple observations into a single (nested) tabular display. This way a user of Habitat can explore the complex relationship of various observations. Some additional features furthermore ease the interpretation of large results. Currently we are working on a prototype of Habitat on top of the open source DBMS PostgreSQL. A first release of this prototype will soon be available for download on this website.

Pathfinder

A Purely Relational XQuery Processor

We believe that relational databases are the most researched and best engineered query processing infrastructures available today. They are able to efficiently query tons of data. By using a relational database as runtime environment for an XQuery processor we can port 30+ years of research to the XQuery domain and build a processor that is able to scale well with increasing input sizes. Pathfinder is a re-targetable query compiler that turns XQuery expressions into table algebra queries. While Pathfinder is tightly coupled with MonetDB we also provide a SQL code generator that allows any database to become a faithful XQuery processor.

Software/Hardware Codesign

These days software developers as well as hardware designers tend to live in their own worlds: Advancing abstraction let’s software developers forget about underlying hardware details.

Hardware designers in turn get out of touch with recent software development approaches.

As a consequence - even big companies having hardware as well as software development under one roof - are wasting valuable performance.

At the interface between hardware and software development occur opportunities to gain significant advantages through development of methods, best practices and tools.