Configuration in Dask-SQL
Contents
Configuration in Dask-SQL¶
dask-sql
supports a list of configuration options to configure behavior of certain operations.
dask-sql
uses Dask’s config
module and configuration options can be specified with YAML files, via environment variables,
or directly, either through the dask.config.set method
or the config_options
argument in the dask_sql.Context.sql()
method.
Configuration Reference¶
-
sql.aggregate.split_out
1 ¶ Number of output partitions from an aggregation operation
-
sql.aggregate.split_every
None ¶ Number of branches per reduction step from an aggregation operation.
-
sql.identifier.case_sensitive
True ¶ Whether sql identifiers are considered case sensitive while parsing.
-
sql.join.broadcast
None ¶ If boolean, it determines whether all joins should use the broadcast join algorithm. If float, it's a value denoting dask's likelihood of selecting a broadcast join based codepath over a shuffle based join. Concretely, dask will select a broadcast based join algorithm if small_table.npartitions < log2(big_table.npartitions) * broadcast_bias Note: Forcing a broadcast join might lead to perf issues or OOM errors in cases where the broadcasted table is too large to fit on a single worker.
-
sql.limit.check-first-partition
True ¶ Whether or not to check the first partition length when computing a LIMIT without an OFFSET on a table with a relatively simple Dask graph (i.e. only IO and/or partition-wise layers); checking partition length triggers a Dask graph computation which can be slow for complex queries, but can signicantly reduce memory usage when querying a small subset of a large table. Default is ``true``.
-
sql.optimize
True ¶ Whether the first generated logical plan should be further optimized or used as is.
-
sql.predicate_pushdown
True ¶ Whether to try pushing down filter predicates into IO (when possible).
-
sql.dynamic_partition_pruning
True ¶ Whether to apply the dynamic partition pruning optimizer rule.
-
sql.optimizer.verbose
False ¶ The dynamic partition pruning optimizer rule can sometimes result in extremely long c.explain() outputs which are not helpful to the user. Setting this option to true allows the user to see the entire output, while setting it to false truncates the output. Default is false.
-
sql.fact_dimension_ratio
None ¶ Ratio of the size of the dimension tables to fact tables. Parameter for dynamic partition pruning and join reorder optimizer rules.
-
sql.max_fact_tables
None ¶ Maximum number of fact tables to allow in a join. Parameter for join reorder optimizer rule.
-
sql.preserve_user_order
None ¶ Whether to preserve user-defined order of unfiltered dimensions. Parameter for join reorder optimizer rule.
-
sql.filter_selectivity
None ¶ Constant to use when determining the number of rows produced by a filtered relation. Parameter for join reorder optimizer rule.
-
sql.sort.topk-nelem-limit
1000000 ¶ Total number of elements below which dask-sql should attempt to apply the top-k optimization (when possible). ``nelem`` is defined as the limit or ``k`` value times the number of columns. Default is 1000000, corresponding to a LIMIT clause of 1 million in a 1 column table.
-
sql.mappings.decimal_support
pandas ¶ Decides how to handle decimal scalars/columns. ``"pandas"`` handling will treat decimals scalars and columns as floats and float64 columns, respectively, while ``"cudf"`` handling treats decimal scalars as ``decimal.Decimal`` objects and decimal columns as ``cudf.Decimal128Dtype`` columns, handling precision/scale accordingly. Default is ``"pandas"``, but ``"cudf"`` should be used if attempting to work with decimal columns on GPU.