dask-sql is a distributed SQL query engine in Python.
It allows you to query and transform your data using a mixture of
common SQL operations and Python code and also scale up the calculation easily
if you need it.
Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With
dask-sqlyou can mix the well known Python dataframe API of pandas and
Daskwith common SQL operations, to process your data in exactly the way that is easiest for you.
Infinite Scaling: using the power of the great
Daskecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if
Dasksupports it, so will
Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
Easy to install and maintain:
dask-sqlis just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups -
dask-sqlwill run out of the box on your machine and can be easily connected to your computing cluster.
Use SQL from wherever you like:
dask-sqlintegrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
For this example, we use some data loaded from disk and query them with a SQL command from our python code.
Any pandas or dask dataframe can be used as input and
dask-sql understands a large amount of formats (csv, parquet, json,…) and locations (s3, hdfs, gcs,…).
import dask.dataframe as dd from dask_sql import Context # Create a context to hold the registered tables c = Context() # Load the data and register it in the context # This will give the table a name, that we can use in queries df = dd.read_csv("...") c.create_table("my_data", df) # Now execute a SQL query. The result is again dask dataframe. result = c.sql(""" SELECT my_data.name, SUM(my_data.x) FROM my_data GROUP BY my_data.name """) # Show the result print(result) # Show the result... print(result.compute()) # ... or use it for any other dask calculation print(result.x.mean().compute())
The API of
dask-sql is very similar to the one from blazingsql,
which makes interchanging distributed CPU and GPU calculation easy.
dask-sql is currently under development and does so far not understand all SQL commands.
We are actively looking for feedback, improvements and contributors!