New York City Taxi and For-Hire Vehicle Data

On August 12th, 2016, Mark Litwintschik published a blog post called 1.1 Billion Taxi Rides with MapD & 8 Nvidia Pascal Titan Xs . We reproduced the approach provided by this post to create a benchmark using Ibis with New York City Taxi and For-Hire Vehicle Dataset .

The data acquisition and cleaning used by Mark Litwintschik was described in his another blog post on 13th Febrary, 2016, called A Billion Taxi Rides in Redshift .

The mentioned blog post was based on another blog post published by Todd W. Schneider called Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance . The repository that contains the scripts and explanations about how to download and clean the data can be accessed at @toddwschneider's GitHub repo .

Also, both the two Mark's blog post, mentioned before, give more extra details to prepare the data that should be used to load the data to OmniSciDB database.

All the code used to replicate the data acquisition and cleaning, mentioned before, and the current benchmark scripts are stored into Ibis-benchmark repo . For information about how to replicate this benchmark, check out the README.md file.

This benchmark aims to show the performance in time for some expressions using Ibis with OmniSciDB and Pandas.

Terminology

Before start, there are some termilogies used in this document that is explained here to help the reading.

  • IPC: Inter Process Communication. OmniSciDB uses Apache Thrift for IPC.
  • Cursor: Cursors are used by database programmers to process individual rows returned by database system queries.
  • CPU: Central Processing Unit.
  • GPU: Graphics Processing Unit.
  • Ibis: Python Data Analysis Productivity Framework.
Methodology

This benchmark measures the execution time (in seconds) consumed by expressions using Ibis OmniSciDB backend, with OmniSciDB CPU and OmniSciDB CPU, and pure Pandas.

Ibis OmniSciDB backend can be used with three differents execution types: IPC CPU, IPC GPU and Cursor.

The expressions used by this benchmark was based on the queries used by 1.1 Billion Taxi Rides with MapD & 8 Nvidia Pascal Titan Xs blog post :

  • t[t.cab_type, t.count()].group_by(t.cab_type)
  • t[
      t.passenger_count, t.total_amount.mean().name('total_amount_mean')
    ].group_by(t.passenger_count)
                                      
  • t[
      t.passenger_count,
      t.pickup_datetime.year().name('pickup_year'),
      t.count().name('pickup_year_count'),
    ].group_by([t.passenger_count, t.pickup_year]),
                                      
  • t[
      t.passenger_count,
      t.pickup_datetime.year().name('pickup_year'),
      t.trip_distance.cast(int).name('distance'),
      t.count().name('the_count'),
    ]
    .group_by([t.passenger_count, t.pickup_year, t.distance])
    .order_by([t.pickup_year, t.the_count], asc=False),
                                      

Where t is the trip table that contains New York City Taxi and For-Hire Vehicle Data .

The time reported here doesn't considerer the time for data loading, eg. Pandas read_csv or OmniSciDB COPY FROM

Results
Summary

Some considerations about this benchmark:

  • Ibis OmniSciDB backend, using cursor execution type, presents a better performance than Pandas and other Ibis OmniSciDB execution types for all expressions.
  • Ibis OmniSciDB backend using IPC GPU presents a better performance than using IPC CPU for all expressions.
  • Pandas presents a better performance than Ibis OmniSciDB backend using IPC execution types for simple expressions.

Hardware information

This benchmarking have been ran in a computer with the following specification:

  Architecture:        x86_64
  CPU op-mode(s):      32-bit, 64-bit
  Byte Order:          Little Endian
  CPU(s):              48
  On-line CPU(s) list: 0-47
  Thread(s) per core:  2
  Core(s) per socket:  24
  Socket(s):           1
  NUMA node(s):        4
  Vendor ID:           AuthenticAMD
  CPU family:          23
  Model:               8
  Model name:          AMD Ryzen Threadripper 2970WX 24-Core Processor
  Stepping:            2
  CPU MHz:             1717.658
  CPU max MHz:         3000.0000
  CPU min MHz:         2200.0000
  BogoMIPS:            5987.92
  Virtualization:      AMD-V
  L1d cache:           32K
  L1i cache:           64K
  L2 cache:            512K
  L3 cache:            8192K
  NUMA node0 CPU(s):   0-5,24-29
  NUMA node1 CPU(s):   12-17,36-41
  NUMA node2 CPU(s):   6-11,30-35
  NUMA node3 CPU(s):   18-23,42-47
  MemTotal:            131947616 kB
                        

GPU Information

  +-----------------------------------------------------------------------------+
  | NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
  |-------------------------------+----------------------+----------------------+
  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
  |===============================+======================+======================|
  |   0  Quadro RTX 8000     Off  | 00000000:41:00.0 Off |                  Off |
  | 34%   47C    P0    65W / 260W |      0MiB / 48601MiB |      0%      Default |
  +-------------------------------+----------------------+----------------------+
  |   1  Quadro RTX 8000     Off  | 00000000:42:00.0 Off |                  Off |
  | 35%   53C    P0     1W / 260W |      0MiB / 48600MiB |      0%      Default |
  +-------------------------------+----------------------+----------------------+
                        

Disk Partition Information

  NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
  sda           8:0    0   3.7T  0 disk
  └─sda1        8:1    0   3.7T  0 part /work
  nvme0n1     259:0    0 953.9G  0 disk
  ├─nvme0n1p1 259:1    0   100G  0 part
  ├─nvme0n1p2 259:2    0     1K  0 part
  └─nvme0n1p5 259:3    0 853.9G  0 part /