dask-sql
Contents
dask-sql¶
dask-sql
is a distributed SQL query engine in Python.
It allows you to query and transform your data using a mixture of
common SQL operations and Python code and also scale up the calculation easily
if you need it.
Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With
dask-sql
you can mix the well known Python dataframe API of pandas andDask
with common SQL operations, to process your data in exactly the way that is easiest for you.Infinite Scaling: using the power of the great
Dask
ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - ifDask
supports it, so willdask-sql
.Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
Easy to install and maintain:
dask-sql
is just a pip/conda install away (or a docker run if you prefer).Use SQL from wherever you like:
dask-sql
integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.GPU Support:
dask-sql
has support for running SQL queries on CUDA-enabled GPUs by utilizing RAPIDS libraries like cuDF , enabling accelerated compute for SQL.
Example¶
For this example, we use some data loaded from disk and query it with a SQL command.
dask-sql
accepts any pandas, cuDF, or dask dataframe as input and is able to read data directly from a variety of storage formats (CSV, Parquet, JSON) and file systems (S3, hdfs, gcs):
import dask.datasets
from dask_sql import Context
# create a context to register tables
c = Context()
# create a table and register it in the context
df = dask.datasets.timeseries()
c.create_table("timeseries", df)
# execute a SQL query; the result is a "lazy" Dask dataframe
result = c.sql("""
SELECT
name, SUM(x) as "sum"
FROM
timeseries
GROUP BY
name
""")
# actually compute the query...
result.compute()
# ...or use it for another computation
result["sum"].mean().compute()
import dask.datasets
from dask_sql import Context
# create a context to register tables
c = Context()
# create a table and register it in the context
df = dask.datasets.timeseries()
c.create_table("timeseries", df, gpu=True)
# execute a SQL query; the result is a "lazy" Dask dataframe
result = c.sql("""
SELECT
name, SUM(x) as "sum"
FROM
timeseries
GROUP BY
name
""")
# actually compute the query...
result.compute()
# ...or use it for another computation
result["sum"].mean().compute()
Note
dask-sql
is currently under development and does so far not understand all SQL commands.
We are actively looking for feedback, improvements and contributors!