# Data Retrieval

## Contents

# Data Retrieval¶

Query data from already created tables. The `SELECT`

call follows mostly the standard SQL conventions,
including all typical ingredients (such as `WHERE`

, `GROUP BY`

, `ORDER BY`

etc.).

```
SELECT [ ALL | DISTINCT ]
* | <expression> [ [ AS ] <alias> ] [ , ... ]
[ FROM <from> [ , ... ] ]
[ WHERE <filter-condition> ]
[ GROUP BY <group-by> ]
[ HAVING <having-condition> ]
[ UNION [ ALL | DISTINCT ] <select> ]
[ ORDER BY <order-by> [ ASC | DESC ] [ , ... ] ]
[ LIMIT <end> ]
[ OFFSET <start> ]
```

Note

If you would like to help, please see [our issue tracker](https://github.com/dask-contrib/dask-sql/issues/43).

Example:

```
SELECT
name, SUM(x) AS s
FROM
data
WHERE
y < 3 AND x > 0.5
GROUP BY
name
HAVING
SUM(x) < 5
UNION SELECT
'myself' AS name, 42 AS s
ORDER BY
s
LIMIT 100
```

Also (all kind of) joins and (complex) subqueries are possible:

```
SELECT
lhs.name, lhs.id, lhs.x
FROM
data AS lhs
JOIN
(
SELECT
name AS max_name,
MAX(x) AS max_x
FROM timeseries
GROUP BY name
) AS rhs
ON
lhs.name = rhs.max_name AND
lhs.x = rhs.max_x
```

For complex queries with many subqueries, it might be beneficial to use `WITH`

for temporary table definitions:

```
WITH tmp AS (
SELECT MAX(b) AS maxb from df GROUP BY a
)
SELECT
maxb
FROM tmp
```

## Implemented operations¶

The following list includes all operations understood and implemented in `dask-sql`

.
Scalar functions can be used to turn a column (or multiple) into a column of the same length (such as `x + y`

or `sin(x)`

)
whereas aggregation functions can only be used in `GROUP BY`

clauses, as they
turn a column into a single value.
For more information on the semantic of the different functions, please have a look into the
Apache Calcite documentation.

### Scalar Functions¶

Binary Operations: `AND`

, `OR`

, `>`

, `>=`

, `<`

, `<=`

, `=`

, `<>`

, `+`

, `-`

, `/`

, `*`

Unary Math Operations: `ABS`

, `ACOS`

, `ASIN`

, `ATAN`

, `ATAN2`

, `CBRT`

, `CEIL`

, `COS`

, `COT`

, `DEGREES`

, `EXP`

, `FLOOR`

, `LOG10`

, `LN`

, `POWER`

, `RADIANS`

, `ROUND`

, `SIGN`

, `SIN`

, `TAN`

, `TRUNCATE`

String operations: `LIKE`

, `SIMILAR TO`

, `||`

, `CHAR_LENGTH`

, `UPPER`

, `LOWER`

, `POSITION`

, `TRIM`

, `OVERLAY`

, `SUBSTRING`

, `INITCAP`

Date operations: `EXTRACT`

, `YEAR`

, `QUARTER`

, `MONTH`

, `WEEK`

, `DAYOFYEAR`

, `DAYOFMONTH`

, `DAYOFWEEK`

, `HOUR`

, `MINUTE`

, `SECOND`

, `LOCALTIME`

, `LOCALTIMESTAMP`

, `CURRENT_TIME`

, `CURRENT_DATE`

, `CURRENT_TIMESTAMP`

Note

Due to a bug/inconsistency in Apache Calcite, both the `CURRENTTIME`

and `LOCALTIME`

return a time without timezone and are therefore the same functionality.

Special Operations: `CASE`

, `NOT`

, `IS NULL`

, `IS NOT NULL`

, `IS TRUE`

, `IS NOT TRUE`

, `IS FALSE:`

, `IS NOT FALSE`

, `IS UNKNOWN`

, `IS NOT UNKNOWN`

, `EXISTS`

, `RAND`

, `RAND_INTEGER`

Example:

```
SELECT
SIN(x)
FROM "data"
WHERE MONTH(t) = 4
```

Note

It is also possible to implement custom functions. See Custom Functions and Aggregations.

### Aggregations¶

`ANY_VALUE`

, `AVG`

, `BIT_AND`

, `BIT_OR`

, `BIT_XOR`

, `COUNT`

, `EVERY`

, `MAX`

, `MIN`

, `SINGLE_VALUE`

, `STDDEV_POP`

, `STDDEV_SAMP`

, `SUM`

, `VAR_POP`

, `VAR_SAMP`

, `VARIANCE`

Example:

```
SELECT
SUM(x)
FROM "data"
GROUP BY y
```

Statistical Aggregation Function which takes two columns as input are follows:

`REGR_COUNT`

, `REGR_SXX`

, `REGR_SYY`

, `COVAR_POP`

, `COVAR_SAMP`

```
SELECT
REGR_COUNT(y,x),
REGR_SXX(y,x),
COVAR_POP(y,x)
FROM "data"
GROUP BY z
```

Note

It is also possible to implement custom aggregations. See Custom Functions and Aggregations.

### Windowing/Over¶

`ROW_NUMBER`

, `SUM`

, `AVG`

, `COUNT`

, `MAX`

, `MIN`

, `SINGLE_VALUE`

, `FIRST_VALUE`

, `LAST_VALUE`

Example:

```
SELECT
y,
SUM(x) OVER (PARTITION BY z ORDER BY a NULLS FIRST)
FROM "data"
```

Note

Again, it is also possible to implement custom windowing functions.

### Table Functions¶

`TABLESAMPLE SYSTEM`

and `TABLESAMPLE BERNOULLI`

:

Example:

```
SELECT * FROM "data" TABLESAMPLE BERNOULLI (20) REPEATABLE (42)
```

`TABLESAMPLE`

allows to draw random samples from the given table and should be the preferred way
to select samples. `BERNOULLI`

will select a row in the original table with a probability
given by the number in the brackets (in percentage). The optional flag `REPEATABLE`

defines
the random seed to use.
`SYSTEM`

is similar, but acts on partitions (so blocks of data) and is therefore much more
inaccurate and should only ever be used on really large data samples where `BERNOULLI`

is not
fast enough (which is very unlikely).