Quickstart

After Installation, you can start querying your data using SQL.

Run the following code in an interactive Python session, a Python script or a Jupyter Notebook.

0. Cluster Setup

If you just want to try out dask-sql quickly, this step can be skipped. However, the real magic of dask (and dask-sql) comes from the ability to scale the computations over multiple cores and/or machines. For local development and testing, a Distributed LocalCluster (or, if using GPUs, a Dask-CUDA LocalCUDACluster) can be deployed and a client connected to it like so:

from distributed import Client, LocalCluster

cluster = LocalCluster()
client = Client(cluster)

There are several options for deploying clusters depending on the platform being used and the resources available; see Dask - Deploying Clusters for more information.

1. Data Loading

Before querying the data, you need to create a dask data frame containing the data. dask understands many different input formats and sources. In this example, we do not read in external data, but use test data in the form of random event time series:

import dask.datasets

df = dask.datasets.timeseries()

Read more on the data input part in Data Loading and Input.

2. Data Registration

If we want to work with the data in SQL, we need to give the data frame a unique name. We do this by registering the data in an instance of a Context:

from dask_sql import Context

c = Context()
c.create_table("timeseries", df)

From now on, the data is accessible as the timeseries table of this context. It is possible to register multiple data frames in the same context.

Hint

If you plan to query the same data multiple times, it might make sense to persist the data before:

c.create_table("timeseries", df, persist=True)

3. Run your queries

Now you can go ahead and query the data with normal SQL!

result = c.sql("""
    SELECT
        name, SUM(x) AS "sum"
    FROM timeseries
    WHERE x > 0.5
    GROUP BY name
""")
result.compute()

dask-sql understands a large fraction of SQL commands, but there are still some missing. Have a look into the SQL Syntax description for more information.

If you are using dask-sql from a Jupyter notebook, you might be interested in the sql magic function:

c.ipython_magic()

%%sql
SELECT
    name, SUM(x) AS "sum"
FROM timeseries
WHERE x > 0.5
GROUP BY name

Note

If you have found an SQL feature, which is currently not supported by dask-sql, please raise an issue on our issue tracker.