Using Ibis with Impala¶
One goal of Ibis is to provide an integrated Python API for an Impala cluster without requiring you to switch back and forth between Python code and the Impala shell (where one would be using a mix of DDL and SQL statements).
If you find an Impala task that you cannot perform with Ibis, please get in touch on the GitHub issue tracker.
While interoperability between the Hadoop / Spark ecosystems and pandas / the PyData stack is overall poor (but improving), we also show some ways that you can use pandas with Ibis and Impala.
The Impala client object¶
To use Ibis with Impala, you first must connect to a cluster using the
ibis.impala.connect function, optionally supplying an HDFS connection:
import ibis hdfs = ibis.hdfs_connect(host=webhdfs_host, port=webhdfs_port) client = ibis.impala.connect(host=impala_host, port=impala_port, hdfs_client=hdfs)
You can accomplish many tasks directly through the client object, but we additionally provide to streamline tasks involving a single Impala table or database.
If you’re doing analytics on a single table, you can get going by using the
table method on the client:
table = client.table(table_name, database=db_name)
Database and Table objects¶
||Create a Database object for a given database name that can be used for exploring and manipulating the objects (tables, functions, views, etc.) inside|
||Create a table expression that references a particular table in the database|
table method allows you to create an Ibis table expression
referencing a physical Impala table:
In : table = client.table('functional_alltypes', database='ibis_testing')
While you can get by fine with only table and client objects, Ibis has a notion of a “database object” that simplifies interactions with a single Impala database. It also gives you IPython tab completion of table names (that are valid Python variable names):
In : db = client.database('ibis_testing') In : db Out: ImpalaDatabase('ibis_testing') In : table = db.functional_alltypes In : db.list_tables() Out: ['alltypes', 'functional_alltypes', 'tpch_customer', 'tpch_lineitem', 'tpch_nation', 'tpch_orders', 'tpch_part', 'tpch_partsupp', 'tpch_region', 'tpch_region_avro', 'tpch_supplier']
So, these two lines of code are equivalent:
table1 = client.table(table_name, database=db) table2 = db.table(table_name)
ImpalaTable is a Python subclass of the more general Ibis
that has additional Impala-specific methods. So you can use it interchangeably
with any code expecting a
Like all table expressions in Ibis,
ImpalaTable has a
schema method you
can use to examine its schema:
||Get the schema for this table (if one is known)|
While the client has a
drop_table method you can use to drop tables, the
table itself has a method
drop that you can use:
Expression execution and asynchronous queries¶
Ibis expressions have an
execute method with compiles and runs the
expressions on Impala or whichever backend is being referenced.
In : fa = db.functional_alltypes In : expr = fa.double_col.sum() In : expr.execute() Out: 331785.00000000006
For longer-running queries, if you press Control-C (or whatever triggers the
KeyboardInterrupt on your system), Ibis will attempt to cancel the
query in progress.
As of Ibis 0.5.0, there is an explicit asynchronous API:
In : query = expr.execute(async=True)
With the returned
AsyncQuery object, you have various methods available to
check on the status of the executing expression:
In : import time In : while not query.is_finished(): ....: time.sleep(1) ....: In : query.is_finished() Out: True In : query.get_result()