Query data from already created tables. The
SELECT call follows mostly the standard SQL conventions,
including all typical ingredients (such as
ORDER BY etc.).
SELECT [ ALL | DISTINCT ] * | <expression> [ [ AS ] <alias> ] [ , ... ] [ FROM <from> [ , ... ] ] [ WHERE <filter-condition> ] [ GROUP BY <group-by> ] [ HAVING <having-condition> ] [ UNION [ ALL | DISTINCT ] <select> ] [ ORDER BY <order-by> [ ASC | DESC ] [ , ... ] ] [ LIMIT <end> ] [ OFFSET <start> ]
If you would like to help, please see [our issue tracker](https://github.com/dask-contrib/dask-sql/issues/43).
SELECT name, SUM(x) AS s FROM data WHERE y < 3 AND x > 0.5 GROUP BY name HAVING SUM(x) < 5 UNION SELECT 'myself' AS name, 42 AS s ORDER BY s LIMIT 100
Also (all kind of) joins and (complex) subqueries are possible:
SELECT lhs.name, lhs.id, lhs.x FROM data AS lhs JOIN ( SELECT name AS max_name, MAX(x) AS max_x FROM timeseries GROUP BY name ) AS rhs ON lhs.name = rhs.max_name AND lhs.x = rhs.max_x
For complex queries with many subqueries, it might be beneficial to use
for temporary table definitions:
WITH tmp AS ( SELECT MAX(b) AS maxb from df GROUP BY a ) SELECT maxb FROM tmp
The following list includes all operations understood and implemented in
Scalar functions can be used to turn a column (or multiple) into a column of the same length (such as
x + y or
whereas aggregation functions can only be used in
GROUP BY clauses, as they
turn a column into a single value.
For more information on the semantic of the different functions, please have a look into the
Apache Calcite documentation.
Unary Math Operations:
Due to a bug/inconsistency in Apache Calcite, both the
LOCALTIME return a time without timezone and are therefore the same functionality.
IS NOT NULL,
IS NOT TRUE,
IS NOT FALSE,
IS NOT UNKNOWN,
SELECT SIN(x) FROM "data" WHERE MONTH(t) = 4
It is also possible to implement custom functions. See Custom Functions and Aggregations.
SELECT SUM(x) FROM "data" GROUP BY y
Statistical Aggregation Function which takes two columns as input are follows:
SELECT REGR_COUNT(y,x), REGR_SXX(y,x), COVAR_POP(y,x) FROM "data" GROUP BY z
It is also possible to implement custom aggregations. See Custom Functions and Aggregations.
SELECT y, SUM(x) OVER (PARTITION BY z ORDER BY a NULLS FIRST) FROM "data"
Again, it is also possible to implement custom windowing functions.
TABLESAMPLE SYSTEM and
SELECT * FROM "data" TABLESAMPLE BERNOULLI (20) REPEATABLE (42)
TABLESAMPLE allows to draw random samples from the given table and should be the preferred way
to select samples.
BERNOULLI will select a row in the original table with a probability
given by the number in the brackets (in percentage). The optional flag
the random seed to use.
SYSTEM is similar, but acts on partitions (so blocks of data) and is therefore much more
inaccurate and should only ever be used on really large data samples where
BERNOULLI is not
fast enough (which is very unlikely).