Contents

Configuration in Dask-SQL

Configuration in Dask-SQL¶

dask-sql supports a list of configuration options to configure behavior of certain operations. dask-sql uses Dask’s config module and configuration options can be specified with YAML files, via environment variables, or directly, either through the dask.config.set method or the config_options argument in the dask_sql.Context.sql() method.

Configuration Reference¶

sql.aggregate.split_out 1 ¶: Number of output partitions from an aggregation operation

sql.aggregate.split_every None ¶: Number of branches per reduction step from an aggregation operation.

sql.identifier.case_sensitive True ¶: Whether sql identifiers are considered case sensitive while parsing.

sql.join.broadcast None ¶: If boolean, it determines whether all joins should use the broadcast join algorithm. If float, it's a value denoting dask's likelihood of selecting a broadcast join based codepath over a shuffle based join. Concretely, dask will select a broadcast based join algorithm if small_table.npartitions < log2(big_table.npartitions) * broadcast_bias Note: Forcing a broadcast join might lead to perf issues or OOM errors in cases where the broadcasted table is too large to fit on a single worker.

sql.limit.check-first-partition True ¶: Whether or not to check the first partition length when computing a LIMIT without an OFFSET on a table with a relatively simple Dask graph (i.e. only IO and/or partition-wise layers); checking partition length triggers a Dask graph computation which can be slow for complex queries, but can signicantly reduce memory usage when querying a small subset of a large table. Default is ``true``.

sql.optimize True ¶: Whether the first generated logical plan should be further optimized or used as is.

sql.predicate_pushdown True ¶: Whether to try pushing down filter predicates into IO (when possible).

sql.dynamic_partition_pruning True ¶: Whether to apply the dynamic partition pruning optimizer rule.

sql.optimizer.verbose False ¶: The dynamic partition pruning optimizer rule can sometimes result in extremely long c.explain() outputs which are not helpful to the user. Setting this option to true allows the user to see the entire output, while setting it to false truncates the output. Default is false.

sql.fact_dimension_ratio None ¶: Ratio of the size of the dimension tables to fact tables. Parameter for dynamic partition pruning and join reorder optimizer rules.

sql.max_fact_tables None ¶: Maximum number of fact tables to allow in a join. Parameter for join reorder optimizer rule.

sql.preserve_user_order None ¶: Whether to preserve user-defined order of unfiltered dimensions. Parameter for join reorder optimizer rule.

sql.filter_selectivity None ¶: Constant to use when determining the number of rows produced by a filtered relation. Parameter for join reorder optimizer rule.

sql.sort.topk-nelem-limit 1000000 ¶: Total number of elements below which dask-sql should attempt to apply the top-k optimization (when possible). ``nelem`` is defined as the limit or ``k`` value times the number of columns. Default is 1000000, corresponding to a LIMIT clause of 1 million in a 1 column table.

sql.mappings.decimal_support pandas ¶: Decides how to handle decimal scalars/columns. ``"pandas"`` handling will treat decimals scalars and columns as floats and float64 columns, respectively, while ``"cudf"`` handling treats decimal scalars as ``decimal.Decimal`` objects and decimal columns as ``cudf.Decimal128Dtype`` columns, handling precision/scale accordingly. Default is ``"pandas"``, but ``"cudf"`` should be used if attempting to work with decimal columns on GPU.

How does it work?

dask-sql documentation

Configuration in Dask-SQL

Contents

Configuration in Dask-SQL¶

Configuration Reference¶