New York City Taxi and For-Hire Vehicle Data
On August 12th, 2016, Mark Litwintschik
published a blog post called
1.1 Billion Taxi Rides with MapD & 8 Nvidia Pascal
Titan Xs
.
We reproduced the approach provided by this post to
create a benchmark using Ibis
with
New York City Taxi and For-Hire Vehicle Dataset
.
The data acquisition and cleaning used by Mark Litwintschik was described in his another blog post on 13th Febrary, 2016, called A Billion Taxi Rides in Redshift .
The mentioned blog post was based on another blog post published by Todd W. Schneider called Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance . The repository that contains the scripts and explanations about how to download and clean the data can be accessed at @toddwschneider's GitHub repo .
Also, both the two Mark's blog post, mentioned before,
give more extra details to prepare the data that should
be used to load the data to OmniSciDB
database.
All the code used to replicate the data acquisition and cleaning, mentioned before, and the current benchmark scripts are stored into Ibis-benchmark repo . For information about how to replicate this benchmark, check out the README.md file.
This benchmark aims to show the performance in time for
some expressions using Ibis with
OmniSciDB
and Pandas
.
Before start, there are some termilogies used in this document that is explained here to help the reading.
IPC
: Inter Process Communication. OmniSciDB uses Apache Thrift for IPC.Cursor
: Cursors are used by database programmers to process individual rows returned by database system queries.CPU
: Central Processing Unit.GPU
: Graphics Processing Unit.Ibis
: Python Data Analysis Productivity Framework.
This benchmark measures the execution time (in
seconds) consumed by expressions using
Ibis OmniSciDB
backend, with
OmniSciDB CPU
and
OmniSciDB CPU
, and pure
Pandas
.
Ibis OmniSciDB
backend can be used
with three differents execution types:
IPC CPU
,
IPC GPU
and
Cursor
.
The expressions used by this benchmark was based on the queries used by 1.1 Billion Taxi Rides with MapD & 8 Nvidia Pascal Titan Xs blog post :
-
t[t.cab_type, t.count()].group_by(t.cab_type)
-
t[ t.passenger_count, t.total_amount.mean().name('total_amount_mean') ].group_by(t.passenger_count)
-
t[ t.passenger_count, t.pickup_datetime.year().name('pickup_year'), t.count().name('pickup_year_count'), ].group_by([t.passenger_count, t.pickup_year]),
-
t[ t.passenger_count, t.pickup_datetime.year().name('pickup_year'), t.trip_distance.cast(int).name('distance'), t.count().name('the_count'), ] .group_by([t.passenger_count, t.pickup_year, t.distance]) .order_by([t.pickup_year, t.the_count], asc=False),
Where t
is the trip
table that contains
New York City Taxi and For-Hire Vehicle Data
.
The time reported here doesn't considerer the time
for data loading, eg. Pandas
read_csv
or
OmniSciDB
COPY FROM
Some considerations about this benchmark:
-
Ibis OmniSciDB
backend, usingcursor
execution type, presents a better performance thanPandas
and otherIbis OmniSciDB
execution types for all expressions. -
Ibis OmniSciDB
backend usingIPC GPU
presents a better performance than usingIPC CPU
for all expressions. -
Pandas
presents a better performance thanIbis OmniSciDB
backend usingIPC
execution types for simple expressions.
This benchmarking have been ran in a computer with the following specification:
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 4 Vendor ID: AuthenticAMD CPU family: 23 Model: 8 Model name: AMD Ryzen Threadripper 2970WX 24-Core Processor Stepping: 2 CPU MHz: 1717.658 CPU max MHz: 3000.0000 CPU min MHz: 2200.0000 BogoMIPS: 5987.92 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-5,24-29 NUMA node1 CPU(s): 12-17,36-41 NUMA node2 CPU(s): 6-11,30-35 NUMA node3 CPU(s): 18-23,42-47 MemTotal: 131947616 kB
GPU Information
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Quadro RTX 8000 Off | 00000000:41:00.0 Off | Off | | 34% 47C P0 65W / 260W | 0MiB / 48601MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 8000 Off | 00000000:42:00.0 Off | Off | | 35% 53C P0 1W / 260W | 0MiB / 48600MiB | 0% Default | +-------------------------------+----------------------+----------------------+
Disk Partition Information
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.7T 0 disk └─sda1 8:1 0 3.7T 0 part /work nvme0n1 259:0 0 953.9G 0 disk ├─nvme0n1p1 259:1 0 100G 0 part ├─nvme0n1p2 259:2 0 1K 0 part └─nvme0n1p5 259:3 0 853.9G 0 part /