A SQL Engine for dask
dask-sql adds a SQL query layer on top of dask.
This allows you to query and transform your dataframes using common SQL operations and enjoy the fast and scaling processing of
We use the timeseries random data from dask.datasets as an example, but any other data (from disk, S3, API, hdfs) can be used.
from dask_sql import Context from dask.datasets import timeseries # Create a context to hold the registered tables c = Context() # If you have a cluster of dask workers, # initialize it now # Load the data and register it in the context # This will give the table a name df = timeseries() c.create_table("timeseries", df) # Now execute an SQL query. The result is a dask dataframe # The query looks for the id with the highest x for each name # (this is just random test data, but you could think of looking # for outliers in the sensor data) result = c.sql(""" SELECT lhs.name, lhs.id, lhs.x FROM timeseries AS lhs JOIN ( SELECT name AS max_name, MAX(x) AS max_x FROM timeseries GROUP BY name ) AS rhs ON lhs.name = rhs.max_name AND lhs.x = rhs.max_x """) # Show the result... print(result.compute()) # ... or use it for any other dask calculation # (just an example, could also be done via SQL) print(result.x.mean().compute())
The API of
dask-sql is very similar to the one from blazingsql,
which makes interchanging distributed CPU and GPU calculation easy.
dask-sql is currently under development and does so far not understand all SQL commands.
We are actively looking for feedback, improvements and contributors!