Quickstart¶
After Installation, you can start querying your data using SQL.
Run the following code in an interactive python session, a python script or a jupyter notebook.
0. Cluster Setup¶
If you just want to try out dask-sql
quickly, you can skip this step at first.
However, the real magic of dask
(and dask-sql
) comes from the ability to scale the computations over multiple machines.
There are plenty of possibilities to setup a dask
cluster.
For local development and testing, you can setup a distributed version of dask
with
from dask.distributed import Client
client = Client()
1. Data Loading¶
Before querying the data, you need to create a dask
data frame containing the data.
dask
understands many different input formats and sources.
In this example, we do not read in external data, but use test data in the form of random event time series.
import dask.datasets
df = dask.datasets.timeseries()
Read more on the data input part in Data Loading and Input.
2. Data Registration¶
If we want to work with the data in SQL, we need to give the data frame a unique name.
We do this by registering the data at an instance of a Context
.
Typically, you only have a single context per application.
from dask_sql import Context
c = Context()
c.create_table("timeseries", df)
From now on, the data is accessible as the “timeseries” table of this context. It is possible to register multiple data frames at the same context.
Hint
If you plan to query the same data multiple times, it might make sense to persist the data before:
df = df.persist()
c.create_table("timeseries", df)
3. Run your queries¶
Now you can go ahead and query the data with normal SQL!
result = c.sql("""
SELECT
name, SUM(x) AS "sum"
FROM timeseries
WHERE x > 0.5
GROUP BY name
""")
result.compute()
dask-sql
understands a large fraction of SQL commands, but there are still some missing.
Have a look into the SQL Syntax description for more information.
If you are using dask-sql
from a Jupyter notebook, you might be interested in the sql
magic function:
c.ipython_magic()
%%sql
SELECT
name, SUM(x) AS "sum"
FROM timeseries
WHERE x > 0.5
GROUP BY name
Note
If you have found an SQL feature, which is currently not supported by dask-sql
,
please raise an issue on our issue tracker.