- 1
- In memory
- 2
- or persisted to disk
How DuckDB transformed my data science workflow
Exploratory data analysis (EDA) is essential in the data science workflow. As a data scientist, and the necessity arises, I need a tool capable of effectively facing various challenges:
- Collect: preparing datasets from diverse sources (flat files: CSV, JSON, Parquet; databases: MySQL, Postgres; cloud storage: GCS, S3).
- Transform: filtering, aggregating, sorting data, joining multiple tables, pivot table and creating views to simplify complex queries.
- Fast and Memory-Efficient: analytic workflow takes time; save your time and money.
- Rich-Features: supports advanced datatypes such as array, list, map, struct, geometry (for geography information systems) and full text search.
- Easy to integrate: supports popular programming languages like Python, R, Java, Rust, Typescript/Javascript, and Julia. Integrating different Python libraries (Pandas, Arrow and Polars; stored in variables and can be queried with SQL), is a good idea.
DuckDB is an open-source database that provides these functionalities with efficiency, reliability, and ease of use.
For more technical details, you can visit the link: Why DuckDB
Source code (python only) of this post you can find at GitHub: quangtiencs/duckdb-tutorial.
1. Multiple sources and easy to integrate
DuckDB supports two types of connection:
- In memory: data is not persisted to disk. Suitable for ad hoc queries.
- Persisted to disk: given a file, all tables, indexes, constraints, views, etc., will be stored in a single file.
DuckDB quickly loads data from diverse sources without boilerplate code. Without configuration, DuckDB auto-detects types and dialects (for some data sources, e.g. csv, json).
query_result = con.query("""
CREATE OR REPLACE TABLE titanic AS
SELECT * FROM 'titanic.csv';
SELECT *
FROM titanic
LIMIT 7
""")
query_result
┌─────────────┬──────────┬────────┬───────────┬───┬───────┬──────────────────┬─────────┬─────────┬──────────┐
│ PassengerId │ Survived │ Pclass │ Lname │ … │ Parch │ Ticket │ Fare │ Cabin │ Embarked │
│ int64 │ int64 │ int64 │ varchar │ │ int64 │ varchar │ double │ varchar │ varchar │
├─────────────┼──────────┼────────┼───────────┼───┼───────┼──────────────────┼─────────┼─────────┼──────────┤
│ 1 │ 0 │ 3 │ Braund │ … │ 0 │ A/5 21171 │ 7.25 │ NULL │ S │
│ 2 │ 1 │ 1 │ Cumings │ … │ 0 │ PC 17599 │ 71.2833 │ C85 │ C │
│ 3 │ 1 │ 3 │ Heikkinen │ … │ 0 │ STON/O2. 3101282 │ 7.925 │ NULL │ S │
│ 4 │ 1 │ 1 │ Futrelle │ … │ 0 │ 113803 │ 53.1 │ C123 │ S │
│ 5 │ 0 │ 3 │ Allen │ … │ 0 │ 373450 │ 8.05 │ NULL │ S │
│ 6 │ 0 │ 3 │ Moran │ … │ 0 │ 330877 │ 8.4583 │ NULL │ Q │
│ 7 │ 0 │ 1 │ McCarthy │ … │ 0 │ 17463 │ 51.8625 │ E46 │ S │
├─────────────┴──────────┴────────┴───────────┴───┴───────┴──────────────────┴─────────┴─────────┴──────────┤
│ 7 rows 13 columns (9 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────┘
We can easily convert query_result
object to pandas dataframe
by:
And more…
- Load from MySQL or PostgreSQL
- Load from remote files: httpfs
- Load from Google Cloud Storage
2. Simplified Pivot Statement
Pivot statement helps us reorganize data in a table that allows we to create a new table with new columns based on distinct values (in the original table).
DuckDB supports SQL Standard PIVOT syntax
and Simplified PIVOT syntax
. Simplified PIVOT syntax is more intuitive. Try it!
PIVOT [dataset]
ON [columns]
USING [values]
GROUP BY [rows]
ORDER BY [columns-with-order-directions]
LIMIT [limit-of-rows];
For example, if we want to convert gender to new columns and count survived classes of each gender.
pivot_result = con.query("""
PIVOT titanic ON Sex USING COUNT(*) GROUP BY Survived
""")
print(pivot_result.df().to_markdown(tablefmt='psql'))
+----+------------+----------+--------+
| | Survived | female | male |
|----+------------+----------+--------|
| 0 | 0 | 16 | 86 |
| 1 | 1 | 40 | 14 |
+----+------------+----------+--------+
3. Join syntax is less messy compared with Pandas dataframe
Joining two tables is a common task in data analysis. Although Pandas pd.merge
API worked well, the problem is when you have multiple tables (data frame), it becomes verbose.
Consider this example (adapt from pandas documentation):
import pandas as pd
left_df = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
print(left_df.to_markdown(tablefmt='psql'))
right_df = pd.DataFrame(
{
"key": ["K0", "K1", "K2", "K3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
print(right_df.to_markdown(tablefmt='psql'))
pandas_join_result = pd.merge(left_df, right_df, on="key")
print(pandas_join_result.to_markdown(tablefmt='psql'))
+----+-------+-----+-----+
| | key | A | B |
|----+-------+-----+-----|
| 0 | K0 | A0 | B0 |
| 1 | K1 | A1 | B1 |
| 2 | K2 | A2 | B2 |
| 3 | K3 | A3 | B3 |
+----+-------+-----+-----+
+----+-------+-----+-----+
| | key | C | D |
|----+-------+-----+-----|
| 0 | K0 | C0 | D0 |
| 1 | K1 | C1 | D1 |
| 2 | K2 | C2 | D2 |
| 3 | K3 | C3 | D3 |
+----+-------+-----+-----+
+----+-------+-----+-----+-----+-----+
| | key | A | B | C | D |
|----+-------+-----+-----+-----+-----|
| 0 | K0 | A0 | B0 | C0 | D0 |
| 1 | K1 | A1 | B1 | C1 | D1 |
| 2 | K2 | A2 | B2 | C2 | D2 |
| 3 | K3 | A3 | B3 | C3 | D3 |
+----+-------+-----+-----+-----+-----+
Sometimes, condition joining needs preprocessing in columns; in this case, we can create a temporary column by Pandas .apply
(on pd.Series
). It also leads to verbose code.
DuckDB Join Statement is more concise. Luckily, we can access Pandas pd.DataFrames
objects as DuckDB tables!
result = con.query("""
SELECT *
FROM left_df JOIN right_df ON left_df.key=right_df.key
""")
print(result.df().to_markdown(tablefmt='psql'))
+----+-------+-----+-----+---------+-----+-----+
| | key | A | B | key_1 | C | D |
|----+-------+-----+-----+---------+-----+-----|
| 0 | K0 | A0 | B0 | K0 | C0 | D0 |
| 1 | K1 | A1 | B1 | K1 | C1 | D1 |
| 2 | K2 | A2 | B2 | K2 | C2 | D2 |
| 3 | K3 | A3 | B3 | K3 | C3 | D3 |
+----+-------+-----+-----+---------+-----+-----+
4. Geospatial analytics
DuckDB provides spatial data management and analysis capabilities (through extension). It allows users to store and manipulate geographic data, including points, lines, and polygons, and perform spatial queries.
DuckDB supports over 50 different formats. We can quickly load the Singapore Map (in GeoJSON format):
con.query("""
INSTALL spatial;
LOAD spatial;
CREATE TABLE singapore_geometry AS
SELECT *
FROM ST_Read('singapore.json');
""")
geometry_df = con.query("""
SELECT COUNTRY, NAME_1, geom
FROM singapore_geometry
""")
geometry_df
┌───────────┬────────────┬─────────────────────────────────────────────────────────────────────────────────────────────┐
│ COUNTRY │ NAME_1 │ geom │
│ varchar │ varchar │ geometry │
├───────────┼────────────┼─────────────────────────────────────────────────────────────────────────────────────────────┤
│ Singapore │ Central │ MULTIPOLYGON (((103.8478 1.2222, 103.8489 1.2231, 103.8517 1.2203, 103.8504 1.2141, 103.8… │
│ Singapore │ East │ MULTIPOLYGON (((104.0242 1.3656, 104.0386 1.3569, 104.0364 1.3533, 104.0333 1.3536, 104.0… │
│ Singapore │ North │ MULTIPOLYGON (((103.8282 1.3665, 103.8271 1.3627, 103.8251 1.3635, 103.8243 1.3618, 103.8… │
│ Singapore │ North-East │ MULTIPOLYGON (((103.8959 1.3374, 103.8889 1.3336, 103.8851 1.3409, 103.8773 1.3443, 103.8… │
│ Singapore │ West │ MULTIPOLYGON (((103.7346 1.175, 103.7419 1.1742, 103.7389 1.1664, 103.7336 1.1669, 103.73… │
└───────────┴────────────┴─────────────────────────────────────────────────────────────────────────────────────────────┘
Do a spatial query that checks the location of latitude 1.2966 and longitude 103.7764 in which region (Central, East, North, North-East, West):
which_region_df = con.query("""
SELECT COUNTRY, NAME_1
FROM singapore_geometry
WHERE ST_Contains(geom, ST_Point(103.7764, 1.2966))
""")
print(which_region_df.df().to_markdown(tablefmt='psql'))
+----+-----------+----------+
| | COUNTRY | NAME_1 |
|----+-----------+----------|
| 0 | Singapore | Central |
+----+-----------+----------+
5. Benchmark
DuckDB is fast and memory-efficient. There are some benchmarks of Polars and DuckDB:
- DuckDB (database-like ops benchmark): https://duckdblabs.github.io/db-benchmark/
- Polars (TPCH Benchmark): https://pola.rs/posts/benchmarks/