Data Retrieval

Query data from already created tables. The SELECT call follows mostly the standard SQL conventions, including all typical ingredients (such as WHERE, GROUP BY, ORDER BY etc.).

SELECT [ ALL | DISTINCT ]
    * | <expression> [ [ AS ] <alias> ] [ , ... ]
    [ FROM <from> [ , ... ] ]
    [ WHERE <filter-condition> ]
    [ GROUP BY <group-by> ]
    [ HAVING <having-condition> ]
    [ UNION [ ALL | DISTINCT ] <select> ]
    [ ORDER BY <order-by> [ ASC | DESC ] [ , ... ] ]
    [ LIMIT <end> ]
    [ OFFSET <start> ]

Note

If you would like to help, please see [our issue tracker](https://github.com/dask-contrib/dask-sql/issues/43).

Example:

SELECT
    name, SUM(x) AS s
FROM
    data
WHERE
    y < 3 AND x > 0.5
GROUP BY
    name
HAVING
    SUM(x) < 5
UNION SELECT
    'myself' AS name, 42 AS s
ORDER BY
    s
LIMIT 100

Also (all kind of) joins and (complex) subqueries are possible:

SELECT
     lhs.name, lhs.id, lhs.x
  FROM
     data AS lhs
  JOIN
     (
           SELECT
              name AS max_name,
              MAX(x) AS max_x
           FROM timeseries
           GROUP BY name
     ) AS rhs
  ON
     lhs.name = rhs.max_name AND
     lhs.x = rhs.max_x

For complex queries with many subqueries, it might be beneficial to use WITH for temporary table definitions:

WITH tmp AS (
    SELECT MAX(b) AS maxb from df GROUP BY a
)
SELECT
    maxb
FROM tmp

Implemented operations

The following list includes all operations understood and implemented in dask-sql. Scalar functions can be used to turn a column (or multiple) into a column of the same length (such as x + y or sin(x)) whereas aggregation functions can only be used in GROUP BY clauses, as they turn a column into a single value. For more information on the semantic of the different functions, please have a look into the Apache Calcite documentation.

Scalar Functions

Binary Operations: AND, OR, >, >=, <, <=, =, <>, +, -, /, *

Unary Math Operations: ABS, ACOS, ASIN, ATAN, ATAN2, CBRT, CEIL, COS, COT, DEGREES, EXP, FLOOR, LOG10, LN, POWER, RADIANS, ROUND, SIGN, SIN, TAN, TRUNCATE

String operations: LIKE, SIMILAR TO, ||, CHAR_LENGTH, UPPER, LOWER, POSITION, TRIM, OVERLAY, SUBSTRING, INITCAP

Date operations: EXTRACT, YEAR, QUARTER, MONTH, WEEK, DAYOFYEAR, DAYOFMONTH, DAYOFWEEK, HOUR, MINUTE, SECOND, LOCALTIME, LOCALTIMESTAMP, CURRENT_TIME, CURRENT_DATE, CURRENT_TIMESTAMP

Note

Due to a bug/inconsistency in Apache Calcite, both the CURRENTTIME and LOCALTIME return a time without timezone and are therefore the same functionality.

Special Operations: CASE, NOT, IS NULL, IS NOT NULL, IS TRUE, IS NOT TRUE, IS FALSE:, IS NOT FALSE, IS UNKNOWN, IS NOT UNKNOWN, EXISTS, RAND, RAND_INTEGER

Example:

SELECT
    SIN(x)
FROM "data"
WHERE MONTH(t) = 4

Note

It is also possible to implement custom functions. See Custom Functions and Aggregations.

Aggregations

ANY_VALUE, AVG, BIT_AND, BIT_OR, BIT_XOR, COUNT, EVERY, MAX, MIN, SINGLE_VALUE, STDDEV_POP, STDDEV_SAMP, SUM, VAR_POP, VAR_SAMP, VARIANCE

Example:

SELECT
    SUM(x)
FROM "data"
GROUP BY y

Statistical Aggregation Function which takes two columns as input are follows:

REGR_COUNT, REGR_SXX, REGR_SYY, COVAR_POP, COVAR_SAMP

SELECT
    REGR_COUNT(y,x),
    REGR_SXX(y,x),
    COVAR_POP(y,x)
FROM "data"
GROUP BY z

Note

It is also possible to implement custom aggregations. See Custom Functions and Aggregations.

Windowing/Over

ROW_NUMBER, SUM, AVG, COUNT, MAX, MIN, SINGLE_VALUE, FIRST_VALUE, LAST_VALUE

Example:

SELECT
    y,
    SUM(x) OVER (PARTITION BY z ORDER BY a NULLS FIRST)
FROM "data"

Note

Again, it is also possible to implement custom windowing functions.

Table Functions

TABLESAMPLE SYSTEM and TABLESAMPLE BERNOULLI:

Example:

SELECT * FROM "data" TABLESAMPLE BERNOULLI (20) REPEATABLE (42)

TABLESAMPLE allows to draw random samples from the given table and should be the preferred way to select samples. BERNOULLI will select a row in the original table with a probability given by the number in the brackets (in percentage). The optional flag REPEATABLE defines the random seed to use. SYSTEM is similar, but acts on partitions (so blocks of data) and is therefore much more inaccurate and should only ever be used on really large data samples where BERNOULLI is not fast enough (which is very unlikely).