Configuration in Dask-SQL
Contents
Configuration in Dask-SQL¶
dask-sql
supports a list of configuration options to configure behavior of certain operations.
dask-sql
uses Dask’s config
module and configuration options can be specified with YAML files, via environment variables,
or directly, either through the dask.config.set method
or the config_options
argument in the dask_sql.Context.sql()
method.
Configuration Reference¶
-
sql.aggregate.split_out
1 ¶ Number of output partitions from an aggregation operation
-
sql.aggregate.split_every
None ¶ Number of branches per reduction step from an aggregation operation.
-
sql.identifier.case_sensitive
True ¶ Whether sql identifiers are considered case sensitive while parsing.
-
sql.join.broadcast
None ¶ If boolean, it determines whether all joins should use the broadcast join algorithm. If float, it's a value denoting dask's likelihood of selecting a broadcast join based codepath over a shuffle based join. Concretely, dask will select a broadcast based join algorithm if small_table.npartitions < log2(big_table.npartitions) * broadcast_bias Note: Forcing a broadcast join might lead to perf issues or OOM errors in cases where the broadcasted table is too large to fit on a single worker.
-
sql.limit.check-first-partition
True ¶ Whether or not to check the first partition length when computing a LIMIT without an OFFSET on a table with a relatively simple Dask graph (i.e. only IO and/or partition-wise layers); checking partition length triggers a Dask graph computation which can be slow for complex queries, but can signicantly reduce memory usage when querying a small subset of a large table. Default is ``true``.
-
sql.optimize
True ¶ Whether the first generated logical plan should be further optimized or used as is.
-
sql.predicate_pushdown
True ¶ Whether to try pushing down filter predicates into IO (when possible).
-
sql.dynamic_partition_pruning
True ¶ Whether to apply the dynamic partition pruning optimizer rule.
-
sql.sort.topk-nelem-limit
1000000 ¶ Total number of elements below which dask-sql should attempt to apply the top-k optimization (when possible). ``nelem`` is defined as the limit or ``k`` value times the number of columns. Default is 1000000, corresponding to a LIMIT clause of 1 million in a 1 column table.
-
sql.mappings.decimal_support
pandas ¶ Decides how to handle decimal scalars/columns. ``"pandas"`` handling will treat decimals scalars and columns as floats and float64 columns, respectively, while ``"cudf"`` handling treats decimal scalars as ``decimal.Decimal`` objects and decimal columns as ``cudf.Decimal128Dtype`` columns, handling precision/scale accordingly. Default is ``"pandas"``, but ``"cudf"`` should be used if attempting to work with decimal columns on GPU.