ibis a pythonic query language

import ibis.omniscidb, dask, intake, sqlalchemy, pandas, pyarrow as arrow, altair, h5py as hdf5

The design of ibis, and other open source projects like intake and dask, begin with the pioneering "blaze" ecosystem provided interfaces to store, describe, query, and process data. "blaze" had ambitious goals and is now defunct, but its influences resonate throughout the scientific python community development.

dask emerged directly from "blaze"; meanwhile contributors can now be found working on projects like ibis and arrow. ibis took the role of providing a generic query language for data in python, that is independent of the computation.

In this document we'll describe the different ibis backend architectures and what it means to be independent of compute.

The design of ibis backends.

Currently, ibis supports >10 backends.

>>> dir(ibis)
[...HDFS...WebHDFS...bigquery...clickhouse...hdf5...impala...omniscidb...pandas...pyspark...spark...sql...sqlite...]

A backend takes an ibis query expression and applies computation, and the query is independent of the computation. A backend implementation, that can be queried with ibis, has one of the three following architectures.

  1. Direct execution backends - pandas and hdf5.
  2. Expression generating backends that create sqlalchemy expressions.
  3. String generating backends - ibis.bigquery and ibis.omniscidb

In the next few sections we'll unravel some of the different capiabilities of each approach.

A data-driven history of ibis compatability.

The table below looks at over 2000 issues in the ibis project. It provides an annual looked at the issues tagged in ibis for different backends over 6 years.

omnisci spark postgres bigquery pandas sqlite impala kudu geospatial clickhouse mysql sqlalchemy
year
2015 2 2 25 52 17
2016 3 2 4 3
2017 1 21 15 49 10 15 8 10
2018 31 10 71 35 8 17 9 2 2
2019 33 22 17 12 32 1 4 7 1 2 5
2020 38 3 4 2 4 1 2 1 3 4 4

We note an early focus ibis.sqlite, sqlalchemy and ibis.impala. Later, work began on the pandas backend rounding out the three different types of backgrounds. From this point, improvements were made to these key backends as ibis.clickhouse, ibis.spark and "postgres". Recently, For the past 3 years, Quansight, in partnership with OmniSci, added the ibis.omniscidb string generating backend. Since Quansight Labs has taken on a role as a community maintainer for ibis. This collaboration introduced geospatial functionality to ibis, and we have on going efforts to introduce SQL Server support

Currently, there is an ongoing effort to a . What backends what you like to see? dask? altair?

Learn more about working Geospatial Data in ibis.

ibis direct execution.

ibis direct execution backends like pandas and hdf5 operate on conventional in-memory python objects. pandas is the gold standard for structured data in python, and inspires the api for ibis.

pd = ibis.pandas.connect({'A': pandas.util.testing.makeDataFrame()})

pd is an ibis backend based off pandas.DataFrame objects.

expression = pd.table('A').head()

expression is an ibis query, that has expression.compile and expression.execute methods. We'll recognize the execute method when we return pandas.DataFrames from ibis expression. The compile method does not trigger any computation, rather it constructs an intermediate form that is interpretted by a backend.

>>> assert isinstance(expression.compile(), ibis.expr.types.TableExpr)

In the case of direction execution backends, the expression compiles to an the original ibis expression. And the computation is carried out based on a set of recipes defined in ibis.

In general, we would typically do this work directly in pandas, however this work is practical in mocking tests for expressions independent of backends.

Learn more about the HDF5 backend in the tests.

ibis expression generating backends.

db = ibis.sqlite.connect('lahmansbaseballdb.sqlite')
expression = db.table('halloffame').head()

Expression generating backends operate on SQL databases that interoperator with sqlalchemy.

>>> assert isinstance(expression.compile(), sqlalchemy.sql.Select)

In the case of expression generating backends, the intermediate representation is a sqlalchemy object. sqlalchemy is The Database Toolkit for Python, and ibis leverages it compatability with traditional SQL databases.

ibis string generating backends.

pip install --upgrade ibis-framework[omniscidb]

String generating backends allow ibis to interface with big data systems that manage their own computation. For example, we may connect to an example omnisci database.

omnisci = ibis.omniscidb.connect(host='metis.omnisci.com', user='demouser', password='HyperInteractive', port=443, database='omnisci', protocol='https')

omnisci is described as a string generating backend because the intermediate representation of the query is a flavor of SQL.

expression = omnisci.table('upstream_reservoir').head()

A string generating expression compiles to ibis.omniscidb flavored SQL, while ibis.bigquery may have a different string representatin.

>>> expression.compile()
'SELECT *\nFROM upstream_reservoir\nLIMIT 5'



Major credit goes to [@xmnlab] in his heroic PR to introduce ibis.omniscidb into ibis. You can watch the drama play out in this Github Issue. If you'd like to learn more about OmniSci and ibis.omniscidb checkout the following links.

Conclusion

We'd like to thank the maintainers of the ibis for their and effort in supporting the ibis community.

Comments

Comments powered by Disqus