Quickstart
Contents
Quickstart¶
After Installation, you can start querying your data using SQL.
Run the following code in an interactive python session, a python script or a jupyter notebook.
0. Cluster Setup¶
If you just want to try out dask-sql
quickly, this step can be skipped.
However, the real magic of dask
(and dask-sql
) comes from the ability to scale the computations over multiple cores and/or machines.
For local development and testing, a Distributed LocalCluster
(or, if using GPUs, a Dask-CUDA LocalCUDACluster
) can be deployed and a client connected to it like so:
from distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
from dask_cuda import LocalCUDACluster
from distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
There are several options for deploying clusters depending on the platform being used and the resources available; see Dask - Deploying Clusters for more information.
1. Data Loading¶
Before querying the data, you need to create a dask
data frame containing the data.
dask
understands many different input formats and sources.
In this example, we do not read in external data, but use test data in the form of random event time series:
import dask.datasets
df = dask.datasets.timeseries()
Read more on the data input part in Data Loading and Input.
2. Data Registration¶
If we want to work with the data in SQL, we need to give the data frame a unique name.
We do this by registering the data in an instance of a Context
:
from dask_sql import Context
c = Context()
c.create_table("timeseries", df)
from dask_sql import Context
c = Context()
c.create_table("timeseries", df, gpu=True)
From now on, the data is accessible as the timeseries
table of this context.
It is possible to register multiple data frames in the same context.
Hint
If you plan to query the same data multiple times, it might make sense to persist the data before:
c.create_table("timeseries", df, persist=True)
c.create_table("timeseries", df, persist=True, gpu=True)
3. Run your queries¶
Now you can go ahead and query the data with normal SQL!
result = c.sql("""
SELECT
name, SUM(x) AS "sum"
FROM timeseries
WHERE x > 0.5
GROUP BY name
""")
result.compute()
dask-sql
understands a large fraction of SQL commands, but there are still some missing.
Have a look into the SQL Syntax description for more information.
If you are using dask-sql
from a Jupyter notebook, you might be interested in the sql
magic function:
c.ipython_magic()
%%sql
SELECT
name, SUM(x) AS "sum"
FROM timeseries
WHERE x > 0.5
GROUP BY name
Note
If you have found an SQL feature, which is currently not supported by dask-sql
,
please raise an issue on our issue tracker.