How DuckDB transformed my data science workflow

Published

January 13, 2024

Exploratory data analysis (EDA) is essential in the data science workflow. As a data scientist, and the necessity arises, I need a tool capable of effectively facing various challenges:

Collect: preparing datasets from diverse sources (flat files: CSV, JSON, Parquet; databases: MySQL, Postgres; cloud storage: GCS, S3).
Transform: filtering, aggregating, sorting data, joining multiple tables, pivot table and creating views to simplify complex queries.
Fast and Memory-Efficient: analytic workflow takes time; save your time and money.
Rich-Features: supports advanced datatypes such as array, list, map, struct, geometry (for geography information systems) and full text search.
Easy to integrate: supports popular programming languages like Python, R, Java, Rust, Typescript/Javascript, and Julia. Integrating different Python libraries (Pandas, Arrow and Polars; stored in variables and can be queried with SQL), is a good idea.

DuckDB is an open-source database that provides these functionalities with efficiency, reliability, and ease of use.

For more technical details, you can visit the link: Why DuckDB

Source code (python only) of this post you can find at GitHub: quangtiencs/duckdb-tutorial.

1. Multiple sources and easy to integrate

DuckDB supports two types of connection:

In memory: data is not persisted to disk. Suitable for ad hoc queries.
Persisted to disk: given a file, all tables, indexes, constraints, views, etc., will be stored in a single file.

import duckdb
1con = duckdb.connect()
2# con = duckdb.connect("duckdb.db")

1: In memory
2: or persisted to disk

library(duckdb)

1con <- dbConnect(duckdb())
2# con <- dbConnect(duckdb(), dbdir = "duckdb", read_only = FALSE)

1: In memory
2: or persisted to disk

In memory mode, just open terminal and run command:

duckdb

or persisted to disk

duckdb duckdb.db

DuckDB quickly loads data from diverse sources without boilerplate code. Without configuration, DuckDB auto-detects types and dialects (for some data sources, e.g. csv, json).

query_result = con.query("""
CREATE OR REPLACE TABLE titanic AS
SELECT * FROM 'titanic.csv';

SELECT * 
FROM titanic
LIMIT 7
""")
query_result

┌─────────────┬──────────┬────────┬───────────┬───┬───────┬──────────────────┬─────────┬─────────┬──────────┐
│ PassengerId │ Survived │ Pclass │   Lname   │ … │ Parch │      Ticket      │  Fare   │  Cabin  │ Embarked │
│    int64    │  int64   │ int64  │  varchar  │   │ int64 │     varchar      │ double  │ varchar │ varchar  │
├─────────────┼──────────┼────────┼───────────┼───┼───────┼──────────────────┼─────────┼─────────┼──────────┤
│           1 │        0 │      3 │ Braund    │ … │     0 │ A/5 21171        │    7.25 │ NULL    │ S        │
│           2 │        1 │      1 │ Cumings   │ … │     0 │ PC 17599         │ 71.2833 │ C85     │ C        │
│           3 │        1 │      3 │ Heikkinen │ … │     0 │ STON/O2. 3101282 │   7.925 │ NULL    │ S        │
│           4 │        1 │      1 │ Futrelle  │ … │     0 │ 113803           │    53.1 │ C123    │ S        │
│           5 │        0 │      3 │ Allen     │ … │     0 │ 373450           │    8.05 │ NULL    │ S        │
│           6 │        0 │      3 │ Moran     │ … │     0 │ 330877           │  8.4583 │ NULL    │ Q        │
│           7 │        0 │      1 │ McCarthy  │ … │     0 │ 17463            │ 51.8625 │ E46     │ S        │
├─────────────┴──────────┴────────┴───────────┴───┴───────┴──────────────────┴─────────┴─────────┴──────────┤
│ 7 rows                                                                               13 columns (9 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────┘

We can easily convert query_result object to pandas dataframe by:

df = query_result.df()
print(f"Type: {type(df)}")
print(f"Shape: { df.shape }")

Type: <class 'pandas.core.frame.DataFrame'>
Shape: (7, 13)

dbExecute(con, "
CREATE OR REPLACE TABLE titanic AS
SELECT * FROM 'titanic.csv';
")

# retrieve the items again
res <- dbGetQuery(con, "
SELECT * 
FROM titanic
LIMIT 7
")
res
typeof(res)

And more…

Load from MySQL or PostgreSQL
Load from remote files: httpfs
Load from Google Cloud Storage

2. Simplified Pivot Statement

Pivot statement helps us reorganize data in a table that allows we to create a new table with new columns based on distinct values (in the original table).

DuckDB supports SQL Standard PIVOT syntax and Simplified PIVOT syntax. Simplified PIVOT syntax is more intuitive. Try it!

PIVOT [dataset] 
ON [columns] 
USING [values] 
GROUP BY [rows]
ORDER BY [columns-with-order-directions]
LIMIT [limit-of-rows];

For example, if we want to convert gender to new columns and count survived classes of each gender.

pivot_result = con.query("""
PIVOT titanic ON Sex USING COUNT(*) GROUP BY  Survived
""")

print(pivot_result.df().to_markdown(tablefmt='psql'))

+----+------------+----------+--------+
|    |   Survived |   female |   male |
|----+------------+----------+--------|
|  0 |          0 |       16 |     86 |
|  1 |          1 |       40 |     14 |
+----+------------+----------+--------+

pivot_result <- dbGetQuery(con, "
PIVOT titanic ON Sex USING COUNT(*) GROUP BY  Survived
")
pivot_result

3. Join syntax is less messy compared with Pandas dataframe

Joining two tables is a common task in data analysis. Although Pandas pd.merge API worked well, the problem is when you have multiple tables (data frame), it becomes verbose.

Consider this example (adapt from pandas documentation):

import pandas as pd

left_df = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
print(left_df.to_markdown(tablefmt='psql'))

right_df = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
print(right_df.to_markdown(tablefmt='psql'))

pandas_join_result = pd.merge(left_df, right_df, on="key")
print(pandas_join_result.to_markdown(tablefmt='psql'))

+----+-------+-----+-----+
|    | key   | A   | B   |
|----+-------+-----+-----|
|  0 | K0    | A0  | B0  |
|  1 | K1    | A1  | B1  |
|  2 | K2    | A2  | B2  |
|  3 | K3    | A3  | B3  |
+----+-------+-----+-----+
+----+-------+-----+-----+
|    | key   | C   | D   |
|----+-------+-----+-----|
|  0 | K0    | C0  | D0  |
|  1 | K1    | C1  | D1  |
|  2 | K2    | C2  | D2  |
|  3 | K3    | C3  | D3  |
+----+-------+-----+-----+
+----+-------+-----+-----+-----+-----+
|    | key   | A   | B   | C   | D   |
|----+-------+-----+-----+-----+-----|
|  0 | K0    | A0  | B0  | C0  | D0  |
|  1 | K1    | A1  | B1  | C1  | D1  |
|  2 | K2    | A2  | B2  | C2  | D2  |
|  3 | K3    | A3  | B3  | C3  | D3  |
+----+-------+-----+-----+-----+-----+

Sometimes, condition joining needs preprocessing in columns; in this case, we can create a temporary column by Pandas .apply (on pd.Series). It also leads to verbose code.

DuckDB Join Statement is more concise. Luckily, we can access Pandas pd.DataFrames objects as DuckDB tables!

result = con.query("""
SELECT *
FROM left_df JOIN right_df ON left_df.key=right_df.key
""")

print(result.df().to_markdown(tablefmt='psql'))

+----+-------+-----+-----+---------+-----+-----+
|    | key   | A   | B   | key_1   | C   | D   |
|----+-------+-----+-----+---------+-----+-----|
|  0 | K0    | A0  | B0  | K0      | C0  | D0  |
|  1 | K1    | A1  | B1  | K1      | C1  | D1  |
|  2 | K2    | A2  | B2  | K2      | C2  | D2  |
|  3 | K3    | A3  | B3  | K3      | C3  | D3  |
+----+-------+-----+-----+---------+-----+-----+

4. Geospatial analytics

DuckDB provides spatial data management and analysis capabilities (through extension). It allows users to store and manipulate geographic data, including points, lines, and polygons, and perform spatial queries.

DuckDB supports over 50 different formats. We can quickly load the Singapore Map (in GeoJSON format):

con.query("""
INSTALL spatial;
LOAD spatial;

CREATE TABLE singapore_geometry AS 
SELECT * 
FROM ST_Read('singapore.json');
""")

geometry_df = con.query("""
SELECT COUNTRY, NAME_1, geom
FROM singapore_geometry
""")

geometry_df

┌───────────┬────────────┬─────────────────────────────────────────────────────────────────────────────────────────────┐
│  COUNTRY  │   NAME_1   │                                            geom                                             │
│  varchar  │  varchar   │                                          geometry                                           │
├───────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ Singapore │ Central    │ MULTIPOLYGON (((103.8478 1.2222, 103.8489 1.2231, 103.8517 1.2203, 103.8504 1.2141, 103.8…  │
│ Singapore │ East       │ MULTIPOLYGON (((104.0242 1.3656, 104.0386 1.3569, 104.0364 1.3533, 104.0333 1.3536, 104.0…  │
│ Singapore │ North      │ MULTIPOLYGON (((103.8282 1.3665, 103.8271 1.3627, 103.8251 1.3635, 103.8243 1.3618, 103.8…  │
│ Singapore │ North-East │ MULTIPOLYGON (((103.8959 1.3374, 103.8889 1.3336, 103.8851 1.3409, 103.8773 1.3443, 103.8…  │
│ Singapore │ West       │ MULTIPOLYGON (((103.7346 1.175, 103.7419 1.1742, 103.7389 1.1664, 103.7336 1.1669, 103.73…  │
└───────────┴────────────┴─────────────────────────────────────────────────────────────────────────────────────────────┘

Do a spatial query that checks the location of latitude 1.2966 and longitude 103.7764 in which region (Central, East, North, North-East, West):

which_region_df = con.query("""
SELECT COUNTRY, NAME_1
FROM singapore_geometry
WHERE ST_Contains(geom, ST_Point(103.7764, 1.2966))
""")
print(which_region_df.df().to_markdown(tablefmt='psql'))

+----+-----------+----------+
|    | COUNTRY   | NAME_1   |
|----+-----------+----------|
|  0 | Singapore | Central  |
+----+-----------+----------+

5. Benchmark

DuckDB is fast and memory-efficient. There are some benchmarks of Polars and DuckDB:

DuckDB (database-like ops benchmark): https://duckdblabs.github.io/db-benchmark/
Polars (TPCH Benchmark): https://pola.rs/posts/benchmarks/