SQL Syntax¶
dask-sql understands SQL in (mostly) presto SQL syntax.
So far, not every valid SQL operator and keyword is already
implemented in dask-sql, but a large fraction of it.
Have a look into our issue tracker
to find out what is still missing.
dask-sql understands queries for data retrieval (SELECT), queries on metadata information (SHOW and DESCRIBE), queries for table creation (CREATE TABLE) and machine learning (CREATE MODEL and PREDICT).
In the following, general information for these queries are given - the sub-pages give details on each of the implemented keywords or operations.
The information on these pages apply to all ways SQL queries can be handed over to dask-sql: via Python (API Documentation), the SQL server (SQL Server) or the command line (Command Line Tool).
General¶
Data in dask-sql is - similar to most SQL systems - grouped in named tables, which consist of columns (with names and data types) and rows.
The tables are again grouped into schemas.
For simplicity, there only exists a single schema, named “schema”.
For many queries, it is necessary to refer to a schema, table, or column. Identifiers can be specified with double quotes or without quotes (if there is no ambiguity with SQL keywords). Casing will be kept (with or without quotes).
SELECT
"date", "name"
FROM
"df"
"date" definitely needs quotation marks (as DATE is also an SQL keyword), but name and df can also be specified without quotation marks.
To prevent ambiguities, the full table identifier can be used:
SELECT
"df"."date"
FROM
"schema"."df"
In many cases however, the bare name is enough:
SELECT
"date"
FROM
"df"
String literals get single quotes:
SELECT 'string literal'
Note
dask-sql can only understand a single SQL query per call to Context.sql.
Therefore, there should also be no semicolons after the query.
Some SQL statements, like CREATE MODEL WITH and CREATE TABLE WITH expect a list of key-value arguments,
which resemble (not accidentally) a Python dictionary. They are in the form
(
key = value
[ , ... ]
)
with an arbitrary number of key-value pairs and always are enclosed in brackets. The keys are (similar to Pythons dict constructor) unquoted.
A value can be any valid SQL literal (e.g. 3, 4.2, 'string'), a key-value parameter list itself
or a list (ARRAY) or a set (MULTISET) (or another way of writing dictionaries with MAP).
This means, the following is a valid key-value parameter list:
(
first_argument = 3,
second_argument = MULTISET [ 1, 1, 2, 3 ],
third_argument = (
sub_argument_1 = ARRAY [ 1, 2, 3 ],
sub_argument_2 = 'a string'
)
)
Please note that, in contrast to python, no comma is allowed after the last argument.
Query Types and Reference¶
Implemented Types¶
dask-sql needs to map between SQL and dask (python) types.
For this, it uses the following mapping:
From Python Type |
To SQL Type |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
From SQL Type |
To Python Type |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Limitatons¶
dask-sql is still in early development, therefore exist some limitations:
Not all operations and aggregations are implemented already.
Note
Whenever you find a not already implemented operation, keyword or functionality, please raise an issue at our issue tracker with your use-case.
Dask/pandas and SQL treat null-values (or nan) differently on sorting, grouping and joining.
dask-sql tries to follow the SQL standard as much as possible, so results might be different to what you expect from Dask/pandas.
Apart from those functional limitations, there is a operation which need special care: ORDER BY`.
Normally, dask-sql calls create a dask data frame, which gets only computed when you call the .compute() member.
Due to internal constraints, this is currently not the case for ORDER BY.
Including this operation will trigger a calculation of the full data frame already when calling Context.sql().
Warning
There is a subtle but important difference between adding LIMIT 10 to your SQL query and calling sql(...).head(10).
The data inside dask is partitioned, to distribute it over the cluster.
head will only return the first N elements from the first partition - even if N is larger than the partition size.
As a benefit, calling .head(N) is typically faster than calculating the full data sample with .compute().
LIMIT on the other hand will always return the first N elements - no matter on how many partitions they are scattered -
but will also need to precalculate the first partition to find out, if it needs to have a look into all data or not.