Configuration in Dask-SQL

Configuration in Dask-SQL

dask-sql supports a list of configuration options to configure behavior of certain operations. dask-sql uses Dask’s config module and configuration options can be specified with YAML files, via environment variables, or directly, either through the dask.config.set method or the config_options argument in the dask_sql.Context.sql() method.

Configuration Reference

sql.aggregate.split_out   1

Number of output partitions from an aggregation operation

sql.aggregate.split_every   None

Number of branches per reduction step from an aggregation operation.

sql.identifier.case_sensitive   True

Whether sql identifiers are considered case sensitive while parsing.

sql.join.broadcast   None

If boolean, it determines whether all joins should use the broadcast join algorithm. If float, it's a value denoting dask's likelihood of selecting a broadcast join based codepath over a shuffle based join. Concretely, dask will select a broadcast based join algorithm if small_table.npartitions < log2(big_table.npartitions) * broadcast_bias Note: Forcing a broadcast join might lead to perf issues or OOM errors in cases where the broadcasted table is too large to fit on a single worker.

sql.limit.check-first-partition   True

Whether or not to check the first partition length when computing a LIMIT without an OFFSET on a table with a relatively simple Dask graph (i.e. only IO and/or partition-wise layers); checking partition length triggers a Dask graph computation which can be slow for complex queries, but can signicantly reduce memory usage when querying a small subset of a large table. Default is ``true``.

sql.optimize   True

Whether the first generated logical plan should be further optimized or used as is.

sql.predicate_pushdown   True

Whether to try pushing down filter predicates into IO (when possible).

sql.sort.topk-nelem-limit   1000000

Total number of elements below which dask-sql should attempt to apply the top-k optimization (when possible). ``nelem`` is defined as the limit or ``k`` value times the number of columns. Default is 1000000, corresponding to a LIMIT clause of 1 million in a 1 column table.