SQL Syntax

dask-sql understands SQL in (mostly) presto SQL syntax. So far, not every valid SQL operator and keyword is already implemented in dask-sql, but a large fraction of it. Have a look into our issue tracker to find out what is still missing.

dask-sql understands queries for data retrieval (SELECT), queries on metadata information (SHOW and DESCRIBE), queries for table creation (CREATE TABLE) and machine learning (CREATE MODEL and PREDICT). In the following, general information for these queries are given - the sub-pages give details on each of the implemented keywords or operations. The information on these pages apply to all ways SQL queries can be handed over to dask-sql: via Python (API Documentation), the SQL server (SQL Server) or the command line (Command Line Tool).

General

Data in dask-sql is - similar to most SQL systems - grouped in named tables, which consist of columns (with names and data types) and rows. The tables are again grouped into schemas. For simplicity, there only exists a single schema, named “schema”.

For many queries, it is necessary to refer to a schema, table, or column. Identifiers can be specified with double quotes or without quotes (if there is no ambiguity with SQL keywords). Casing will be kept (with or without quotes).

SELECT
    "date", "name"
FROM
    "df"

"date" definitely needs quotation marks (as DATE is also an SQL keyword), but name and df can also be specified without quotation marks.

To prevent ambiguities, the full table identifier can be used:

SELECT
    "df"."date"
FROM
    "schema"."df"

In many cases however, the bare name is enough:

SELECT
    "date"
FROM
    "df"

String literals get single quotes:

SELECT 'string literal'

Note

dask-sql can only understand a single SQL query per call to Context.sql. Therefore, there should also be no semicolons after the query.

Some SQL statements, like CREATE MODEL WITH and CREATE TABLE WITH expect a list of key-value arguments, which resemble (not accidentally) a Python dictionary. They are in the form

(
    key = value
    [ , ... ]
)

with an arbitrary number of key-value pairs and always are enclosed in brackets. The keys are (similar to Pythons dict constructor) unquoted. A value can be any valid SQL literal (e.g. 3, 4.2, 'string'), a key-value parameter list itself or a list (ARRAY) or a set (MULTISET) (or another way of writing dictionaries with MAP).

This means, the following is a valid key-value parameter list:

(
    first_argument = 3,
    second_argument = MULTISET [ 1, 1, 2, 3 ],
    third_argument = (
        sub_argument_1 = ARRAY [ 1, 2, 3 ],
        sub_argument_2 = 'a string'
    )
)

Please note that, in contrast to python, no comma is allowed after the last argument.

Implemented Types

dask-sql needs to map between SQL and dask (Python) types. For this, it uses the following mapping:

From Python Type

To SQL Type

np.bool_

BOOLEAN

np.datetime64

TIMESTAMP

np.float32

FLOAT

np.float64

DOUBLE

np.int16

SMALLINT

np.int32

INTEGER

np.int64

BIGINT

np.int8

TINYINT

np.object_

VARCHAR

np.uint16

SMALLINT

np.uint32

INTEGER

np.uint64

BIGINT

np.uint8

TINYINT

pd.BooleanDtype

BOOLEAN

pd.Int16Dtype

SMALLINT

pd.Int32Dtype

INTEGER

pd.Int64Dtype

BIGINT

pd.Int8Dtype

TINYINT

pd.StringDtype

VARCHAR

pd.UInt16Dtype

SMALLINT

pd.UInt32Dtype

INTEGER

pd.UInt64Dtype

BIGINT

pd.UInt8Dtype

TINYINT

From SQL Type

To Python Type

BIGINT

pd.Int64Dtype

BOOLEAN

pd.BooleanDtype

CHAR(*)

pd.StringDtype

DATE

np.dtype("<M8[ns]")

DECIMAL(*)

np.float64

DOUBLE

np.float64

FLOAT

np.float32

INTEGER

pd.Int32Dtype()

INTERVAL

np.dtype("<m8[ns]")

SMALLINT

pd.Int16Dtype()

TIME(*)

np.dtype("<M8[ns]")

TIMESTAMP(*)

np.dtype("<M8[ns]")

TINYINT

pd.Int8Dtype

VARCHAR

pd.StringDtype

VARCHAR(*)

pd.StringDtype

Limitatons

dask-sql is still in early development, therefore exist some limitations:

Not all operations and aggregations are implemented already.

Note

Whenever you find a not already implemented operation, keyword or functionality, please raise an issue at our issue tracker with your use-case.

Dask/pandas and SQL treat null-values (or nan) differently on sorting, grouping and joining. dask-sql tries to follow the SQL standard as much as possible, so results might be different to what you expect from Dask/Pandas.

Apart from those functional limitations, there is a operation which need special care: ORDER BY`. Normally, dask-sql calls create a dask data frame, which gets only computed when you call the .compute() member. Due to internal constraints, this is currently not the case for ORDER BY. Including this operation will trigger a calculation of the full data frame already when calling Context.sql().

Warning

There is a subtle but important difference between adding LIMIT 10 to your SQL query and calling sql(...).head(10). The data inside dask is partitioned, to distribute it over the cluster. head will only return the first N elements from the first partition - even if N is larger than the partition size. As a benefit, calling .head(N) is typically faster than calculating the full data sample with .compute(). LIMIT on the other hand will always return the first N elements - no matter on how many partitions they are scattered - but will also need to precalculate the first partition to find out, if it needs to have a look into all data or not.