Configuration in Dask-SQL
Contents
Configuration in Dask-SQL¶
dask-sql supports a list of configuration options to configure behavior of certain operations.
dask-sql uses Dask’s config
module and configuration options can be specified with YAML files, via environment variables,
or directly, either through the dask.config.set method
or the config_options argument in the dask_sql.Context.sql() method.
Configuration Reference¶
-
sql.aggregate.split_out1 ¶ Number of output partitions from an aggregation operation
-
sql.aggregate.split_everyNone ¶ Number of branches per reduction step from an aggregation operation.
-
sql.identifier.case_sensitiveTrue ¶ Whether sql identifiers are considered case sensitive while parsing.
-
sql.join.broadcastNone ¶ If boolean, it determines whether all joins should use the broadcast join algorithm. If float, it's a value denoting dask's likelihood of selecting a broadcast join based codepath over a shuffle based join. Concretely, dask will select a broadcast based join algorithm if small_table.npartitions < log2(big_table.npartitions) * broadcast_bias Note: Forcing a broadcast join might lead to perf issues or OOM errors in cases where the broadcasted table is too large to fit on a single worker.
-
sql.limit.check-first-partitionTrue ¶ Whether or not to check the first partition length when computing a LIMIT without an OFFSET on a table with a relatively simple Dask graph (i.e. only IO and/or partition-wise layers); checking partition length triggers a Dask graph computation which can be slow for complex queries, but can signicantly reduce memory usage when querying a small subset of a large table. Default is ``true``.
-
sql.optimizeTrue ¶ Whether the first generated logical plan should be further optimized or used as is.
-
sql.predicate_pushdownTrue ¶ Whether to try pushing down filter predicates into IO (when possible).
-
sql.dynamic_partition_pruningTrue ¶ Whether to apply the dynamic partition pruning optimizer rule.
-
sql.optimizer.verboseFalse ¶ The dynamic partition pruning optimizer rule can sometimes result in extremely long c.explain() outputs which are not helpful to the user. Setting this option to true allows the user to see the entire output, while setting it to false truncates the output. Default is false.
-
sql.fact_dimension_ratioNone ¶ Ratio of the size of the dimension tables to fact tables. Parameter for dynamic partition pruning and join reorder optimizer rules.
-
sql.max_fact_tablesNone ¶ Maximum number of fact tables to allow in a join. Parameter for join reorder optimizer rule.
-
sql.preserve_user_orderNone ¶ Whether to preserve user-defined order of unfiltered dimensions. Parameter for join reorder optimizer rule.
-
sql.filter_selectivityNone ¶ Constant to use when determining the number of rows produced by a filtered relation. Parameter for join reorder optimizer rule.
-
sql.sort.topk-nelem-limit1000000 ¶ Total number of elements below which dask-sql should attempt to apply the top-k optimization (when possible). ``nelem`` is defined as the limit or ``k`` value times the number of columns. Default is 1000000, corresponding to a LIMIT clause of 1 million in a 1 column table.
-
sql.mappings.decimal_supportpandas ¶ Decides how to handle decimal scalars/columns. ``"pandas"`` handling will treat decimals scalars and columns as floats and float64 columns, respectively, while ``"cudf"`` handling treats decimal scalars as ``decimal.Decimal`` objects and decimal columns as ``cudf.Decimal128Dtype`` columns, handling precision/scale accordingly. Default is ``"pandas"``, but ``"cudf"`` should be used if attempting to work with decimal columns on GPU.