API Documentation¶
-
class
dask_sql.
Context
¶ Main object to communicate with
dask_sql
. It holds a store of all registered data frames (= tables) and can convert SQL queries to dask data frames. The tables in these queries are referenced by the name, which is given when registering a dask dataframe.Example
from dask_sql import Context c = Context() # Register a table c.register_dask_table(df, "my_table") # Now execute an SQL query. The result is a dask dataframe result = c.sql("SELECT a, b FROM my_table") # Trigger the computation (or use the data frame for something else) result.compute()
Usually, you will only ever have a single context in your program.
See also
-
register_aggregation
(f, name, parameters, return_type)¶ Register a custom aggregation with the given name. The aggregation can be used (with this name) in every SQL queries from now on - but only for aggregation operations (no scalar function calls). This means, if you register a aggregation “fagg”, you can now call
SELECT fagg(y) FROM df GROUP BY x
Please note that you can always only have one function with the same name; no matter if it is an aggregation or scalar function.
For the registration, you need to supply both the list of parameter and parameter types as well as the return type. Use numpy dtypes if possible.
More information: Custom Functions and Aggregations
Example
The following code registers a new aggregation “fagg”, which computes the sum of a column and uses it on the
y
column.fagg = dd.Aggregation("fagg", lambda x: x.sum(), lambda x: x.sum()) c.register_aggregation(fagg, "fagg", [("x", np.float64)], np.float64) sql = "SELECT fagg(y) FROM df GROUP BY x" df_result = c.sql(sql)
- Parameters
f (
dask.dataframe.Aggregate
) – The aggregate to register. See the dask documentation for more information.name (
str
) – Under which name should the new aggregate be addressable in SQLparameters (
List[Tuple[str, type]]
) –A list ot tuples of parameter name and parameter type. Use numpy dtypes if possible.
return_type (
type
) – The return type of the function
See also
-
register_dask_table
(df, name)¶ Registering a dask table makes it usable in SQL queries. The name you give here can be used as table name in the SQL later.
Please note, that the table is stored as it is now. If you change the table later, you need to re-register.
Example
This code registers a data frame as table “data” and then uses it in a query.
c.register_dask_table(df, "data") df_result = c.sql("SELECT a, b FROM data")
- Parameters
df (
dask.dataframe.DataFrame
) – The data frame to registername – (
str
): Under which name should the new table be addressable
-
register_function
(f, name, parameters, return_type)¶ Register a custom function with the given name. The function can be used (with this name) in every SQL queries from now on - but only for scalar operations (no aggregations). This means, if you register a function “f”, you can now call
SELECT f(x) FROM df
Please note that you can always only have one function with the same name; no matter if it is an aggregation or scalar function.
For the registration, you need to supply both the list of parameter and parameter types as well as the return type. Use numpy dtypes if possible.
More information: Custom Functions and Aggregations
Example
This example registers a function “f”, which calculates the square of an integer and applies it to the column
x
.def f(x): return x ** 2 c.register_function(f, "f", [("x", np.int64)], np.int64) sql = "SELECT f(x) FROM df" df_result = c.sql(sql)
- Parameters
f (
Callable
) – The function to registername (
str
) – Under which name should the new function be addressable in SQLparameters (
List[Tuple[str, type]]
) –A list ot tuples of parameter name and parameter type. Use numpy dtypes if possible.
return_type (
type
) – The return type of the function
See also
-
sql
(sql)¶ Query the registered tables with the given SQL. The SQL follows approximately the postgreSQL standard - however, not all operations are already implemented. In general, only select statements (no data manipulation) works.
For more information, see SQL Syntax.
Example
In this example, a query is called using the registered tables and then executed using dask.
result = c.sql("SELECT a, b FROM my_table") print(result.compute())
- Parameters
sql (
str
) – The query string to executedebug (
bool
) – Turn on printing of debug information.
- Returns
the created data frame of this query.
- Return type
dask.dataframe.DataFrame
-
-
dask_sql.
run_server
(context=None, host='0.0.0.0', port=8080)¶ Run a HTTP server for answering SQL queries using
dask-sql
. It uses the Presto Wire Protocol for communication. This means, it has a single POST endpoint v1/statement, which answers SQL queries (as string in the body) with the output as a JSON (in the format described in the documentation above). Every SQL expression thatdask-sql
understands can be used here.Note
The presto protocol also includes some statistics on the query in the response. These statistics are currently only filled with placeholder variables.
- Parameters
context (
dask_sql.Context
) – If set, use this context instead of an empty one.host (
str
) – The host interface to listen on (defaults to all interfaces)port (
int
) – The port to listen on (defaults to 8080)
Example
It is possible to run an SQL server by using the CLI script in
dask_sql.server.app
or by calling this function directly in your user code:from dask_sql import run_server # Create your pre-filled context c = Context() ... run_server(context=c)
After starting the server, it is possible to send queries to it, e.g. with the presto CLI or via sqlalchemy (e.g. using the PyHive package):
from sqlalchemy.engine import create_engine engine = create_engine('presto://localhost:8080/') import pandas as pd pd.read_sql_query("SELECT 1 + 1", con=engine)
Of course, it is also possible to call the usual
CREATE TABLE
commands.