ibis - an idiomatic flavor of SQL for Python programmers.

import ibis

import ibis, pandas, dask, pytest, base64

ibis is a mature open source project that has been in development for ~5 years with 1349 ⭐️ on Github. It provides an interface to SQL for Python programmers and bridges the gap between remote storage & execution systems. These features provide authors the ability to:

  1. Write backend independent SQL expressions in Python
  2. Access different database connections (eg. sqlite, omnisci, pandas)
  3. Visually confirm their SQL queries with directed acyclic graphs (DAG)

ibis is an alternative approach using databases that relies on Python rather than SQL experience. Typically, people would have to learn an entirely new syntax, or flavor of SQL, to perform simple tasks. Now, those with Python abilities can avoid a new learning curve by using ibis for composing and executing expressions in SQL. Using familiar Python syntaxes similar to pandas and dask. ibis assists in formation of SQL expressions by providing visual feedback about each Python object.

This blog focuses on writing SQL expressions in Python and how to compose queries visually using ibis. We'll demonstrate this by using SQLite and Sean Lahman’s baseball database.

Connecting to a database.

To get started, we’ll need to establish a database connection. ibis makes it easy to create different connections. Let's go ahead and do this now.

client = ibis.sqlite.connect('lahmansbaseballdb.sqlite')


The client variable above represents our connection to the database. The baseball database we are using has 29 tables, which we can see by running the following code:

>>> client.list_tables()
[...'appearances'...'halloffame', 'homegames', 'leagues', 'managers',...]


It is essential to use the correct ibis client for the file types being used.


ID playerID yearid votedBy ballots needed votes inducted category needed_note
0 1 cobbty01 1936 BBWAA 226 170 222 Y Player None
1 2 ruthba01 1936 BBWAA 226 170 215 Y Player None
2 3 wagneho01 1936 BBWAA 226 170 215 Y Player None
3 4 mathech01 1936 BBWAA 226 170 205 Y Player None
4 5 johnswa01 1936 BBWAA 226 170 189 Y Player None

Selecting and visualizing tables.

Now that we have a connection we can create a data frame from ibis expressions similar to the example above. To demonstrate some of ibis capabilities, we’ll examine information in the halloffame and appearances tables of the database. Let's create those tables now.

halloffame, appearances = client.table('halloffame'), client.table('appearances')


A useful feature of ibis is its ability to representat of our SQL query in a DAG using graphviz. dask users will be familiar with this helpful feature.

At the moment, our table doesn’t hold any data; instead, they are expressions represent operations applied to the data where its located - in the SQLite database. We can prove this by asserting its instance like this:

>>> assert isinstance(halloffame, ibis.expr.types.TableExpr)

sample = halloffame.head()

Looking at the sample expression we can reveal the SQL query we constructed.

>>> str(sample.compile())
'SELECT t0."ID", t0."playerID",...FROM base.halloffame AS t0...LIMIT ? OFFSET ?'

And we can now execute the expression to return a tidy pandas.DataFrame.

>>> assert isinstance(sample.execute(), pandas.DataFrame)

sample.execute()

Filtering and selecting data.

As mentioned earlier, ibis uses familiar pandas syntax to build SQL queries. Tho show how to filter and select data we will filter the players from our halloffame table by applying our boolean logic to the halloffame expression.

condition = halloffame.category == 'Player'
players = halloffame[condition]

Joining ibis tables.

If we want a single view of the halloffame players and their appearances, we need to join the two tables. To do this, we’ll perform an inner join based on the playerID column in our new players table.

condition = players.playerID == appearances.playerID

The players and appearances tables share columns, but SQL doesn't handle overlapping columns very well. Specifically in our data ('playerID' and 'ID') are in both tables, we will filter the columns from the appearances table, because they appear in the players table. Then we will compose the join. We do this by using a Python list comprehension.

columns = [x for x in appearances.columns if x not in ('playerID', 'ID')]

Now to compose the join.

unmaterialized = players.join(appearances[columns], condition)


Materializing the join

Joins, in databases, require a materialized view to proceed with building expressions. Just remember to use the expressions materialize method otherwise ibis will throw an error; demonstrated by the test below.

>>> with pytest.raises(ibis.IbisError): unmaterialized.distinct()


join = unmaterialized.materialize().distinct()  # like 🐼.DataFrame.drop_duplicates

The code above completes the join, and a sample of our joined data is shared in the table below.

ID playerID yearid votedBy ballots needed votes inducted category needed_note yearID teamID team_ID lgID G_all GS G_batting G_defense G_p G_c G_1b G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
0 2861 aaronha01 1982 BBWAA 415 312 406 Y Player None 1871 BS1 1 NA 31 31 31 31 0 0 0 16 0 15 0 0 0 0 0 0 0
1 3744 abbotji01 2005 BBWAA 516 387 13 N Player None 1871 BS1 1 NA 31 31 31 31 0 0 0 16 0 15 0 0 0 0 0 0 0
2 147 adamsba01 1937 BBWAA 201 151 8 N Player None 1871 BS1 1 NA 31 31 31 31 0 0 0 16 0 15 0 0 0 0 0 0 0
3 260 adamsba01 1938 BBWAA 262 197 11 N Player None 1871 BS1 1 NA 31 31 31 31 0 0 0 16 0 15 0 0 0 0 0 0 0
4 385 adamsba01 1939 BBWAA 274 206 11 N Player None 1871 BS1 1 NA 31 31 31 31 0 0 0 16 0 15 0 0 0 0 0 0 0

ibis supports other join strategies as well. Those are listed below:

>>> [x for x in dir(players) if 'join' in x]
['anti_join', 'any_inner_join', 'any_left_join', 'asof_join', 'cross_join', 'inner_join', 'join', 'left_join', 'outer_join', 'semi_join']


Executing an expression

In this section, we'll expand the join in a pandas.DataFrame. We'll answer the question:

How many pitchers have been inducted into the hall of fame?

yearID 1936 1937 1938 1939 1942 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
inducted
N 105.0 106.0 114.0 99.0 67.0 89.0 97.0 34.0 100.0 117.0 96 80.0 69.0 77.0 50.0 60.0 102.0 - 150 - 131 - 75.0 - 87.0 - 47.0 77.0 47.0 44.0 45.0 48.0 43.0 43.0 42.0 36.0 30.0 33.0 36.0 53.0 59.0 38.0 40.0 44.0 26.0 39.0 40.0 26.0 43.0 39.0 42.0 42.0 35.0 32.0 38.0 38.0 35.0 29.0 25.0 25.0 28.0 30.0 27.0 31.0 30.0 25.0 28.0 30.0 24.0 21.0 25.0 31.0 26.0 37.0 33.0 30.0 30.0 31.0 29.0
Y 5.0 3.0 1.0 7.0 1.0 9.0 11.0 4.0 2.0 3.0 - 2.0 2.0 4.0 3.0 6.0 2.0 1 - 1 - 2 3.0 4 6.0 1 1.0 2.0 3.0 4.0 3.0 7.0 5.0 5.0 4.0 3.0 4.0 3.0 2.0 2.0 3.0 2.0 3.0 3.0 5.0 4.0 3.0 2.0 1.0 3.0 2.0 4.0 3.0 1.0 2.0 3.0 1.0 2.0 3.0 4.0 3.0 3.0 1.0 2.0 2.0 2.0 2.0 2.0 1.0 3.0 1.0 2.0 2.0 1.0 3.0 4.0 2.0 3.0 6.0

Some hitters may have also pitched, so we’ll need to filter out those appearances. join.G_p represents the numbers of games a player played as a pitcher and we select players who played mostly as pitchers, or pitched more than 100 games.

pitchers = join[join.G_p > 100]

Group the pitchers based on a condition.

condition = [pitchers.inducted, pitchers.yearID]

And count them annually.

grouped_pitchers = pitchers.groupby(condition).count()

df = grouped_pitchers.execute().set_index('inducted yearID'.split())


What next?

That's it! In future posts, we’ll explore more backends and visualize more ibis objects. If you’d like to contribute to ibis, please take a look at OpenTeams or the ibis contributing guide.

Comments

Comments powered by Disqus