import ibis

import ibis, pandas, dask, pytest, base64

ibis is a mature open source project that has been in development for ~5 years with 1349 ⭐️ on Github. It provides an interface to SQL for Python programmers and bridges the gap between remote storage & execution systems. These features provide authors the ability to:

Write backend independent SQL expressions in Python
Access different database connections (eg. sqlite, omnisci, pandas)
Visually confirm their SQL queries with directed acyclic graphs (DAG)

ibis is an alternative approach using databases that relies on Python rather than SQL experience. Typically, people would have to learn an entirely new syntax, or flavor of SQL, to perform simple tasks. Now, those with Python abilities can avoid a new learning curve by using ibis for composing and executing expressions in SQL. Using familiar Python syntaxes similar to pandas and dask. ibis assists in formation of SQL expressions by providing visual feedback about each Python object.

This blog focuses on writing SQL expressions in Python and how to compose queries visually using ibis. We'll demonstrate this by using SQLite and Sean Lahman’s baseball database.

Connecting to a database.¶

To get started, we’ll need to establish a database connection. ibis makes it easy to create different connections. Let's go ahead and do this now.

client = ibis.sqlite.connect('lahmansbaseballdb.sqlite')

The client variable above represents our connection to the database. The baseball database we are using has 29 tables, which we can see by running the following code:

>>> client.list_tables()
[...'appearances'...'halloffame', 'homegames', 'leagues', 'managers',...]

It is essential to use the correct ibis client for the file types being used.

	ID	playerID	yearid	votedBy	ballots	needed	votes	inducted	category	needed_note
0	1	cobbty01	1936	BBWAA	226	170	222	Y	Player	None
1	2	ruthba01	1936	BBWAA	226	170	215	Y	Player	None
2	3	wagneho01	1936	BBWAA	226	170	215	Y	Player	None
3	4	mathech01	1936	BBWAA	226	170	205	Y	Player	None
4	5	johnswa01	1936	BBWAA	226	170	189	Y	Player	None

Selecting and visualizing tables.¶

Now that we have a connection we can create a data frame from ibis expressions similar to the example above. To demonstrate some of ibis capabilities, we’ll examine information in the halloffame and appearances tables of the database. Let's create those tables now.

halloffame, appearances = client.table('halloffame'), client.table('appearances')

A useful feature of ibis is its ability to representat of our SQL query in a DAG using graphviz. dask users will be familiar with this helpful feature.

At the moment, our table doesn’t hold any data; instead, they are expressions represent operations applied to the data where its located - in the SQLite database. We can prove this by asserting its instance like this:

>>> assert isinstance(halloffame, ibis.expr.types.TableExpr)

sample = halloffame.head()

Looking at the sample expression we can reveal the SQL query we constructed.

>>> str(sample.compile())
'SELECT t0."ID", t0."playerID",...FROM base.halloffame AS t0...LIMIT ? OFFSET ?'

And we can now execute the expression to return a tidy pandas.DataFrame.

>>> assert isinstance(sample.execute(), pandas.DataFrame)

sample.execute()

Filtering and selecting data.¶

As mentioned earlier, ibis uses familiar pandas syntax to build SQL queries. Tho show how to filter and select data we will filter the players from our halloffame table by applying our boolean logic to the halloffame expression.

condition = halloffame.category == 'Player'
players = halloffame[condition]

Joining `ibis` tables.¶

If we want a single view of the halloffame players and their appearances, we need to join the two tables. To do this, we’ll perform an inner join based on the playerID column in our new players table.

condition = players.playerID == appearances.playerID

The players and appearances tables share columns, but SQL doesn't handle overlapping columns very well. Specifically in our data ('playerID' and 'ID') are in both tables, we will filter the columns from the appearances table, because they appear in the players table. Then we will compose the join. We do this by using a Python list comprehension.

columns = [x for x in appearances.columns if x not in ('playerID', 'ID')]

Now to compose the join.

unmaterialized = players.join(appearances[columns], condition)

Materializing the `join`¶

Joins, in databases, require a materialized view to proceed with building expressions. Just remember to use the expressions materialize method otherwise ibis will throw an error; demonstrated by the test below.

>>> with pytest.raises(ibis.IbisError): unmaterialized.distinct()

join = unmaterialized.materialize().distinct()  # like 🐼.DataFrame.drop_duplicates

The code above completes the join, and a sample of our joined data is shared in the table below.

	ID	playerID	yearid	votedBy	ballots	needed	votes	inducted	category	needed_note	yearID	teamID	team_ID	lgID	G_all	GS	G_batting	G_defense	G_2b	G_ss
0	2861	aaronha01	1982	BBWAA	415	312	406	Y	Player	None	1871	BS1	1	NA	31	31	31	31	16	15
1	3744	abbotji01	2005	BBWAA	516	387	13	N	Player	None	1871	BS1	1	NA	31	31	31	31	16	15
2	147	adamsba01	1937	BBWAA	201	151	8	N	Player	None	1871	BS1	1	NA	31	31	31	31	16	15
3	260	adamsba01	1938	BBWAA	262	197	11	N	Player	None	1871	BS1	1	NA	31	31	31	31	16	15
4	385	adamsba01	1939	BBWAA	274	206	11	N	Player	None	1871	BS1	1	NA	31	31	31	31	16	15

ibis supports other join strategies as well. Those are listed below:

>>> [x for x in dir(players) if 'join' in x]
['anti_join', 'any_inner_join', 'any_left_join', 'asof_join', 'cross_join', 'inner_join', 'join', 'left_join', 'outer_join', 'semi_join']

Executing an expression¶

In this section, we'll expand the join in a pandas.DataFrame. We'll answer the question:

How many pitchers have been inducted into the hall of fame?

yearID	1936	1937	1938	1939	1942	1945	1946	1947	1948	1949	1950	1951	1952	1953	1954	1955	1956	1957	1958	1959	1960	1961	1962	1963	1964	1965	1966	1967	1968	1969	1970	1971	1972	1973	1974	1975	1976	1977	1978	1979	1980	1981	1982	1983	1984	1985	1986	1987	1988	1989	1990	1991	1992	1993	1994	1995	1996	1997	1998	1999	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
inducted
N	105.0	106.0	114.0	99.0	67.0	89.0	97.0	34.0	100.0	117.0	96	80.0	69.0	77.0	50.0	60.0	102.0	-	150	-	131	-	75.0	-	87.0	-	47.0	77.0	47.0	44.0	45.0	48.0	43.0	43.0	42.0	36.0	30.0	33.0	36.0	53.0	59.0	38.0	40.0	44.0	26.0	39.0	40.0	26.0	43.0	39.0	42.0	42.0	35.0	32.0	38.0	38.0	35.0	29.0	25.0	25.0	28.0	30.0	27.0	31.0	30.0	25.0	28.0	30.0	24.0	21.0	25.0	31.0	26.0	37.0	33.0	30.0	30.0	31.0	29.0
Y	5.0	3.0	1.0	7.0	1.0	9.0	11.0	4.0	2.0	3.0	-	2.0	2.0	4.0	3.0	6.0	2.0	1	-	1	-	2	3.0	4	6.0	1	1.0	2.0	3.0	4.0	3.0	7.0	5.0	5.0	4.0	3.0	4.0	3.0	2.0	2.0	3.0	2.0	3.0	3.0	5.0	4.0	3.0	2.0	1.0	3.0	2.0	4.0	3.0	1.0	2.0	3.0	1.0	2.0	3.0	4.0	3.0	3.0	1.0	2.0	2.0	2.0	2.0	2.0	1.0	3.0	1.0	2.0	2.0	1.0	3.0	4.0	2.0	3.0	6.0

Some hitters may have also pitched, so we’ll need to filter out those appearances. join.G_p represents the numbers of games a player played as a pitcher and we select players who played mostly as pitchers, or pitched more than 100 games.

pitchers = join[join.G_p > 100]

Group the pitchers based on a condition.

condition = [pitchers.inducted, pitchers.yearID]

And count them annually.

grouped_pitchers = pitchers.groupby(condition).count()

df = grouped_pitchers.execute().set_index('inducted yearID'.split())

What next?¶

That's it! In future posts, we’ll explore more backends and visualize more ibis objects. If you’d like to contribute to ibis, please take a look at OpenTeams or the ibis contributing guide.