After Installation, you can start querying your data using SQL.
Run the following code in an interactive Python session, a Python script or a Jupyter Notebook.
0. Cluster Setup¶
If you just want to try out
dask-sql quickly, this step can be skipped.
However, the real magic of
dask-sql) comes from the ability to scale the computations over multiple cores and/or machines.
For local development and testing, a Distributed
LocalCluster (or, if using GPUs, a Dask-CUDA
LocalCUDACluster) can be deployed and a client connected to it like so:
from distributed import Client, LocalCluster cluster = LocalCluster() client = Client(cluster)
from dask_cuda import LocalCUDACluster from distributed import Client cluster = LocalCUDACluster() client = Client(cluster)
There are several options for deploying clusters depending on the platform being used and the resources available; see Dask - Deploying Clusters for more information.
1. Data Loading¶
Before querying the data, you need to create a
dask data frame containing the data.
dask understands many different input formats and sources.
In this example, we do not read in external data, but use test data in the form of random event time series:
import dask.datasets df = dask.datasets.timeseries()
Read more on the data input part in Data Loading and Input.
2. Data Registration¶
If we want to work with the data in SQL, we need to give the data frame a unique name.
We do this by registering the data in an instance of a
from dask_sql import Context c = Context() c.create_table("timeseries", df)
from dask_sql import Context c = Context() c.create_table("timeseries", df, gpu=True)
From now on, the data is accessible as the
timeseries table of this context.
It is possible to register multiple data frames in the same context.
If you plan to query the same data multiple times, it might make sense to persist the data before:
c.create_table("timeseries", df, persist=True)
c.create_table("timeseries", df, persist=True, gpu=True)
3. Run your queries¶
Now you can go ahead and query the data with normal SQL!
result = c.sql(""" SELECT name, SUM(x) AS "sum" FROM timeseries WHERE x > 0.5 GROUP BY name """) result.compute()
dask-sql understands a large fraction of SQL commands, but there are still some missing.
Have a look into the SQL Syntax description for more information.
If you are using
dask-sql from a Jupyter notebook, you might be interested in the
sql magic function:
c.ipython_magic() %%sql SELECT name, SUM(x) AS "sum" FROM timeseries WHERE x > 0.5 GROUP BY name
If you have found an SQL feature, which is currently not supported by
please raise an issue on our issue tracker.