dask-sql¶
dask-sql
is a distributed SQL query engine in Python.
It allows you to query and transform your data using a mixture of
common SQL operations and Python code and also scale up the calculation easily
if you need it.
Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With
dask-sql
you can mix the well known Python dataframe API of pandas andDask
with common SQL operations, to process your data in exactly the way that is easiest for you.Infinite Scaling: using the power of the great
Dask
ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - ifDask
supports it, so willdask-sql
.Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
Easy to install and maintain:
dask-sql
is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups -dask-sql
will run out of the box on your machine and can be easily connected to your computing cluster.Use SQL from wherever you like:
dask-sql
integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.
Example¶
For this example, we use some data loaded from disk and query them with a SQL command from our python code.
Any pandas or dask dataframe can be used as input and dask-sql
understands a large amount of formats (csv, parquet, json,…) and locations (s3, hdfs, gcs,…).
import dask.dataframe as dd
from dask_sql import Context
# Create a context to hold the registered tables
c = Context()
# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)
# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
SELECT
my_data.name,
SUM(my_data.x)
FROM
my_data
GROUP BY
my_data.name
""")
# Show the result
print(result)
# Show the result...
print(result.compute())
# ... or use it for any other dask calculation
print(result.x.mean().compute())
The API of dask-sql
is very similar to the one from blazingsql,
which makes interchanging distributed CPU and GPU calculation easy.
Note
dask-sql
is currently under development and does so far not understand all SQL commands.
We are actively looking for feedback, improvements and contributors!