dask-sql ======== ``dask-sql`` is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it. * **Combine the power of Python and SQL**: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With ``dask-sql`` you can mix the well known Python dataframe API of `pandas` and ``Dask`` with common SQL operations, to process your data in exactly the way that is easiest for you. * **Infinite Scaling**: using the power of the great ``Dask`` ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if ``Dask`` `supports it `_, so will ``dask-sql``. * **Your data - your queries**: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics. * **Easy to install and maintain**: ``dask-sql`` is just a pip/conda install away (or a docker run if you prefer). * **Use SQL from wherever you like**: ``dask-sql`` integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with `Apache Hue `_. * **GPU Support**: ``dask-sql`` has support for running SQL queries on CUDA-enabled GPUs by utilizing `RAPIDS `_ libraries like `cuDF `_ , enabling accelerated compute for SQL. Example ------- For this example, we use some data loaded from disk and query it with a SQL command. ``dask-sql`` accepts any pandas, cuDF, or dask dataframe as input and is able to read data directly from a variety of storage formats (CSV, Parquet, JSON) and file systems (S3, hdfs, gcs): .. tabs:: .. group-tab:: CPU .. code-block:: python import dask.datasets from dask_sql import Context # create a context to register tables c = Context() # create a table and register it in the context df = dask.datasets.timeseries() c.create_table("timeseries", df) # execute a SQL query; the result is a "lazy" Dask dataframe result = c.sql(""" SELECT name, SUM(x) as "sum" FROM timeseries GROUP BY name """) # actually compute the query... result.compute() # ...or use it for another computation result["sum"].mean().compute() .. group-tab:: GPU .. code-block:: python import dask.datasets from dask_sql import Context # create a context to register tables c = Context() # create a table and register it in the context df = dask.datasets.timeseries() c.create_table("timeseries", df, gpu=True) # execute a SQL query; the result is a "lazy" Dask dataframe result = c.sql(""" SELECT name, SUM(x) as "sum" FROM timeseries GROUP BY name """) # actually compute the query... result.compute() # ...or use it for another computation result["sum"].mean().compute() .. toctree:: :maxdepth: 1 :caption: Contents: installation quickstart sql data_input custom machine_learning best_practices api server cmd fugue how_does_it_work configuration .. note:: ``dask-sql`` is currently under development and does so far not understand all SQL commands. We are actively looking for feedback, improvements and contributors!