ibis - an idiomatic flavor of SQL for Python programmers.
import ibis
import ibis, pandas, dask, pytest, base64
ibis
is a mature open source project that has been in development for ~5
years with 1349 ⭐️ on Github.
It provides an interface to SQL for Python programmers and
bridges the gap between remote storage & execution systems.
These features provide authors the ability to:
- Write backend independent SQL expressions in Python
- Access different database connections (eg. sqlite, omnisci, pandas)
- Visually confirm their SQL queries with directed acyclic graphs (DAG)
ibis
is an alternative approach using databases that relies on Python rather than SQL experience.
Typically, people would have to learn an entirely new syntax,
or flavor of SQL, to perform simple tasks.
Now, those with Python abilities can avoid a new learning curve
by using ibis
for composing and executing expressions in SQL.
Using familiar Python syntaxes similar to pandas
and dask
.
ibis
assists in formation of SQL expressions by providing visual feedback about each Python object.
This blog focuses on writing SQL expressions in Python and how to compose queries visually using ibis
.
We'll demonstrate this by using SQLite and Sean Lahman’s baseball database.
Connecting to a database.¶
To get started, we’ll need to establish a database connection. ibis
makes it easy to create different connections.
Let's go ahead and do this now.
client = ibis.sqlite.connect('lahmansbaseballdb.sqlite')
The client
variable above represents our connection to the database.
The baseball database we are using has 29 tables,
which we can see by running the following code:
>>> client.list_tables()
[...'appearances'...'halloffame', 'homegames', 'leagues', 'managers',...]
It is essential to use the correct
ibis
client
for the file types being used.
ID | playerID | yearid | votedBy | ballots | needed | votes | inducted | category | needed_note | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | cobbty01 | 1936 | BBWAA | 226 | 170 | 222 | Y | Player | None |
1 | 2 | ruthba01 | 1936 | BBWAA | 226 | 170 | 215 | Y | Player | None |
2 | 3 | wagneho01 | 1936 | BBWAA | 226 | 170 | 215 | Y | Player | None |
3 | 4 | mathech01 | 1936 | BBWAA | 226 | 170 | 205 | Y | Player | None |
4 | 5 | johnswa01 | 1936 | BBWAA | 226 | 170 | 189 | Y | Player | None |
Selecting and visualizing tables.¶
Now that we have a connection we can create a data frame from ibis expressions similar to the example above.
To demonstrate some of ibis
capabilities,
we’ll examine information in the halloffame
and appearances
tables of the database.
Let's create those tables now.
halloffame, appearances = client.table('halloffame'), client.table('appearances')
A useful feature of ibis is its ability to representat of our SQL query in a DAG using graphviz
.
dask
users will be familiar with this helpful feature.
At the moment, our table doesn’t hold any data; instead, they are expressions represent operations applied to the data where its located - in the SQLite database. We can prove this by asserting its instance like this:
>>> assert isinstance(halloffame, ibis.expr.types.TableExpr)
sample = halloffame.head()
Looking at the sample
expression we can reveal the SQL query we constructed.
>>> str(sample.compile())
'SELECT t0."ID", t0."playerID",...FROM base.halloffame AS t0...LIMIT ? OFFSET ?'
And we can now execute the expression to return a tidy pandas.DataFrame
.
>>> assert isinstance(sample.execute(), pandas.DataFrame)
sample.execute()
Filtering and selecting data.¶
As mentioned earlier, ibis uses familiar pandas syntax to build SQL queries.
Tho show how to filter and select data we will filter
the players
from our halloffame
table by applying
our boolean logic to the halloffame
expression.
condition = halloffame.category == 'Player'
players = halloffame[condition]
Joining ibis
tables.¶
If we want a single view of the halloffame
players and their appearances
,
we need to join the two tables.
To do this, we’ll perform an inner join based on the playerID
column in our new players
table.
condition = players.playerID == appearances.playerID
The players
and appearances
tables share columns, but SQL doesn't handle overlapping columns very well.
Specifically in our data ('playerID' and 'ID')
are in both tables, we will filter the
columns from the appearances
table, because they appear in the players
table.
Then we will compose the join. We do this by using a Python list comprehension.
columns = [x for x in appearances.columns if x not in ('playerID', 'ID')]
Now to compose the join.
unmaterialized = players.join(appearances[columns], condition)
Materializing the join
¶
Joins, in databases, require a materialized view to proceed with building expressions.
Just remember to use the expressions materialize method otherwise
ibis
will throw an error; demonstrated by the test below.
>>> with pytest.raises(ibis.IbisError): unmaterialized.distinct()
join = unmaterialized.materialize().distinct() # like 🐼.DataFrame.drop_duplicates
The code above completes the join
, and a sample of our joined data is shared in the table below.
ID | playerID | yearid | votedBy | ballots | needed | votes | inducted | category | needed_note | yearID | teamID | team_ID | lgID | G_all | GS | G_batting | G_defense | G_p | G_c | G_1b | G_2b | G_3b | G_ss | G_lf | G_cf | G_rf | G_of | G_dh | G_ph | G_pr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2861 | aaronha01 | 1982 | BBWAA | 415 | 312 | 406 | Y | Player | None | 1871 | BS1 | 1 | NA | 31 | 31 | 31 | 31 | 0 | 0 | 0 | 16 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 3744 | abbotji01 | 2005 | BBWAA | 516 | 387 | 13 | N | Player | None | 1871 | BS1 | 1 | NA | 31 | 31 | 31 | 31 | 0 | 0 | 0 | 16 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 147 | adamsba01 | 1937 | BBWAA | 201 | 151 | 8 | N | Player | None | 1871 | BS1 | 1 | NA | 31 | 31 | 31 | 31 | 0 | 0 | 0 | 16 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 260 | adamsba01 | 1938 | BBWAA | 262 | 197 | 11 | N | Player | None | 1871 | BS1 | 1 | NA | 31 | 31 | 31 | 31 | 0 | 0 | 0 | 16 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 385 | adamsba01 | 1939 | BBWAA | 274 | 206 | 11 | N | Player | None | 1871 | BS1 | 1 | NA | 31 | 31 | 31 | 31 | 0 | 0 | 0 | 16 | 0 | 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ibis
supports other join strategies as well. Those are listed below:
>>> [x for x in dir(players) if 'join' in x]
['anti_join', 'any_inner_join', 'any_left_join', 'asof_join', 'cross_join', 'inner_join', 'join', 'left_join', 'outer_join', 'semi_join']
Executing an expression¶
In this section, we'll expand the join in a pandas.DataFrame
. We'll answer the question:
How many pitchers have been inducted into the hall of fame?
yearID | 1936 | 1937 | 1938 | 1939 | 1942 | 1945 | 1946 | 1947 | 1948 | 1949 | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
inducted | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
N | 105.0 | 106.0 | 114.0 | 99.0 | 67.0 | 89.0 | 97.0 | 34.0 | 100.0 | 117.0 | 96 | 80.0 | 69.0 | 77.0 | 50.0 | 60.0 | 102.0 | - | 150 | - | 131 | - | 75.0 | - | 87.0 | - | 47.0 | 77.0 | 47.0 | 44.0 | 45.0 | 48.0 | 43.0 | 43.0 | 42.0 | 36.0 | 30.0 | 33.0 | 36.0 | 53.0 | 59.0 | 38.0 | 40.0 | 44.0 | 26.0 | 39.0 | 40.0 | 26.0 | 43.0 | 39.0 | 42.0 | 42.0 | 35.0 | 32.0 | 38.0 | 38.0 | 35.0 | 29.0 | 25.0 | 25.0 | 28.0 | 30.0 | 27.0 | 31.0 | 30.0 | 25.0 | 28.0 | 30.0 | 24.0 | 21.0 | 25.0 | 31.0 | 26.0 | 37.0 | 33.0 | 30.0 | 30.0 | 31.0 | 29.0 |
Y | 5.0 | 3.0 | 1.0 | 7.0 | 1.0 | 9.0 | 11.0 | 4.0 | 2.0 | 3.0 | - | 2.0 | 2.0 | 4.0 | 3.0 | 6.0 | 2.0 | 1 | - | 1 | - | 2 | 3.0 | 4 | 6.0 | 1 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 7.0 | 5.0 | 5.0 | 4.0 | 3.0 | 4.0 | 3.0 | 2.0 | 2.0 | 3.0 | 2.0 | 3.0 | 3.0 | 5.0 | 4.0 | 3.0 | 2.0 | 1.0 | 3.0 | 2.0 | 4.0 | 3.0 | 1.0 | 2.0 | 3.0 | 1.0 | 2.0 | 3.0 | 4.0 | 3.0 | 3.0 | 1.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 1.0 | 3.0 | 1.0 | 2.0 | 2.0 | 1.0 | 3.0 | 4.0 | 2.0 | 3.0 | 6.0 |
Some hitters may have also pitched, so we’ll need to filter out those appearances.
join.G_p
represents the numbers of games a player played as a pitcher and we select
players who played mostly as pitchers, or pitched more than 100 games.
pitchers = join[join.G_p > 100]
Group the pitchers
based on a condition
.
condition = [pitchers.inducted, pitchers.yearID]
And count them annually.
grouped_pitchers = pitchers.groupby(condition).count()
df = grouped_pitchers.execute().set_index('inducted yearID'.split())
What next?¶
That's it! In future posts, we’ll explore more backends and visualize more ibis objects. If you’d like to contribute to ibis, please take a look at OpenTeams or the ibis contributing guide.
Comments
Comments powered by Disqus