Data Retrieval¶
Query data from already created tables. The SELECT
call follows mostly the standard SQL conventions,
including all typical ingredients (such as WHERE
, GROUP BY
, ORDER BY
etc.).
SELECT [ ALL | DISTINCT ]
* | <expression> [ [ AS ] <alias> ] [ , ... ]
[ FROM <from> [ , ... ] ]
[ WHERE <filter-condition> ]
[ GROUP BY <group-by> ]
[ HAVING <having-condition> ]
[ UNION [ ALL | DISTINCT ] <select> ]
[ ORDER BY <order-by> [ ASC | DESC ] [ , ... ] ]
[ LIMIT <end> ]
[ OFFSET <start> ]
Note
If you would like to help, please see [our issue tracker](https://github.com/nils-braun/dask-sql/issues/43).
Example:
SELECT
name, SUM(x) AS s
FROM
data
WHERE
y < 3 AND x > 0.5
GROUP BY
name
HAVING
SUM(x) < 5
UNION SELECT
'myself' AS name, 42 AS s
ORDER BY
s
LIMIT 100
Also (all kind of) joins and (complex) subqueries are possible:
SELECT
lhs.name, lhs.id, lhs.x
FROM
data AS lhs
JOIN
(
SELECT
name AS max_name,
MAX(x) AS max_x
FROM timeseries
GROUP BY name
) AS rhs
ON
lhs.name = rhs.max_name AND
lhs.x = rhs.max_x
For complex queries with many subqueries, it might be beneficial to use WITH
for temporary table definitions:
WITH tmp AS (
SELECT MAX(b) AS maxb from df GROUP BY a
)
SELECT
maxb
FROM tmp
Implemented operations¶
The following list includes all operations understood and implemented in dask-sql
.
Scalar functions can be used to turn a column (or multiple) into a column of the same length (such as x + y
or sin(x)
)
whereas aggregation functions can only be used in GROUP BY
clauses, as they
turn a column into a single value.
For more information on the semantic of the different functions, please have a look into the
Apache Calcite documentation.
Scalar Functions¶
Binary Operations: AND
, OR
, >
, >=
, <
, <=
, =
, <>
, +
, -
, /
, *
Unary Math Operations: ABS
, ACOS
, ASIN
, ATAN
, ATAN2
, CBRT
, CEIL
, COS
, COT
, DEGREES
, EXP
, FLOOR
, LOG10
, LN
, POWER
, RADIANS
, ROUND
, SIGN
, SIN
, TAN
, TRUNCATE
String operations: LIKE
, SIMILAR TO
, ||
, CHAR_LENGTH
, UPPER
, LOWER
, POSITION
, TRIM
, OVERLAY
, SUBSTRING
, INITCAP
Date operations: EXTRACT
, YEAR
, QUARTER
, MONTH
, WEEK
, DAYOFYEAR
, DAYOFMONTH
, DAYOFWEEK
, HOUR
, MINUTE
, SECOND
, LOCALTIME
, LOCALTIMESTAMP
, CURRENT_TIME
, CURRENT_DATE
, CURRENT_TIMESTAMP
Note
Due to a bug/inconsistency in Apache Calcite, both the CURRENTTIME
and LOCALTIME
return a time without timezone and are therefore the same functionality.
Special Operations: CASE
, NOT
, IS NULL
, IS NOT NULL
, IS TRUE
, IS NOT TRUE
, IS FALSE:
, IS NOT FALSE
, IS UNKNOWN
, IS NOT UNKNOWN
, EXISTS
, RAND
, RAND_INTEGER
Example:
SELECT
SIN(x)
FROM "data"
WHERE MONTH(t) = 4
Note
It is also possible to implement custom functions. See Custom Functions and Aggregations.
Aggregations¶
ANY_VALUE
, AVG
, BIT_AND
, BIT_OR
, BIT_XOR
, COUNT
, EVERY
, MAX
, MIN
, SINGLE_VALUE
, STDDEV_POP
, STDDEV_SAMP
, SUM
, VAR_POP
, VAR_SAMP
, VARIANCE
Example:
SELECT
SUM(x)
FROM "data"
GROUP BY y
Note
It is also possible to implement custom aggregations. See Custom Functions and Aggregations.
Windowing/Over¶
ROW_NUMBER
, SUM
, AVG
, COUNT
, MAX
, MIN
, SINGLE_VALUE
, FIRST_VALUE
, LAST_VALUE
Example:
SELECT
y,
SUM(x) OVER (PARTITION BY z ORDER BY a NULLS FIRST)
FROM "data"
Note
Again, it is also possible to implement custom windowing functions.
Table Functions¶
TABLESAMPLE SYSTEM
and TABLESAMPLE BERNOULLI
:
Example:
SELECT * FROM "data" TABLESAMPLE BERNOULLI (20) REPEATABLE (42)
TABLESAMPLE
allows to draw random samples from the given table and should be the preferred way
to select samples. BERNOULLI
will select a row in the original table with a probability
given by the number in the brackets (in percentage). The optional flag REPEATABLE
defines
the random seed to use.
SYSTEM
is similar, but acts on partitions (so blocks of data) and is therefore much more
inaccurate and should only ever be used on really large data samples where BERNOULLI
is not
fast enough (which is very unlikely).