Basics: Aggregation, filtering, limits

Now that you have ibis installed and connecting to your Impala, let’s get our feet wet.

In [1]:
import ibis
import os
hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)
hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)
con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',
                          hdfs_client=hdfs)

Accessing an Impala table

The optional database paramter allows to fully qualify the table.

In [2]:
table = con.table('functional_alltypes')
table = con.table('functional_alltypes', database='ibis_testing')

Column selection and basic expressions

Selecting columns from a table works just like pandas

In [3]:
col = table.double_col

# alternately
col2 = table['bigint_col']

Table columns are equipped with a variety of math operations and other methods to assist in writing your analytics. For example:

In [4]:
expr = col.log2() - 1

Some operations transform arrays to arrays, while others aggregate, like sum and mean:

In [5]:
expr2 = expr.sum()

The methods that are available on columns depend on the type of the column. For example, you won’t see the substr or upper methods for strings on numeric columns:

In [6]:
substr_expr = table.string_col.upper().substr(0, 2)

Notice that printing the expressions to the console does not actually do anything, and shows a graphical representation of the expression you’ve built.

Note: don’t worry too much about the details of the expression tree that’s outputted, it’s very likely to change over time

In [7]:
expr2
Out[7]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_14_0.png

We can also execute an expression by calling execute on the Impala connection object

In [8]:
con.execute(col.sum())
Out[8]:
331785.00000000006

There’s a shortcut to make this a little more convenient in interactive use

Interactive mode

Many Ibis expressions can be immediately executed against the database, and it may improve your productivity to have the executed for you whenever you try to print the expression in the console / IPython notebook.

To do this, we have an interactive mode available, which can be turned on/off like so:

In [9]:
ibis.options.interactive = True

Now, any expressions you write will be executed right away

In [10]:
table.limit(10)
Out[10]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_21_0.png

You can select a row range with slicing syntax:

In [11]:
table.double_col.sum()
Out[11]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_23_0.png

Don’t worry about the syntax here, but expressions resulting in tabular output will come back as a pandas DataFrame by default:

In [12]:
metrics = [table.double_col.sum().name('total')]
expr = table.group_by('string_col').aggregate(metrics)
expr
Out[12]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_25_0.png

Verbose mode and logging

You may want to see or log all the Impala queries that Ibis executes under the hood. To do that, use the options.verbose option:

In [13]:
ibis.options.verbose = True

metrics = [table.double_col.sum().name('total')]
expr = table.group_by('string_col').aggregate(metrics)
expr
SELECT `string_col`, sum(`double_col`) AS `total`
FROM ibis_testing.`functional_alltypes`
GROUP BY 1
LIMIT 10000
Out[13]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_27_1.png
In [14]:
queries = []
def logger(x):
    queries.append(x)

ibis.options.verbose_log = logger

expr.execute()
expr.execute()

queries
Out[14]:
['SELECT `string_col`, sum(`double_col`) AS `total`\nFROM ibis_testing.`functional_alltypes`\nGROUP BY 1\nLIMIT 10000',
 'SELECT `string_col`, sum(`double_col`) AS `total`\nFROM ibis_testing.`functional_alltypes`\nGROUP BY 1\nLIMIT 10000']
In [15]:
from __future__ import print_function
ibis.options.verbose_log = lambda x: print(x)
ibis.options.verbose = False

Aggregation basics

One of the essential table API functions is aggregate. Aggregation involves the following

  • One or more named aggregate expressions, or reductions
  • Zero or more grouping expressions or column names

This ends up working very similarly to pandas’s groupby mechanism.

Let’s start with a simple reduction:

In [16]:
metric = table.double_col.sum()

As you saw above, you can execute this immediately and obtain a value:

In [17]:
metric
Out[17]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_33_0.png

The reduced column can be more complex; for example, you could count the number of null values in a column like so:

In [18]:
table.double_col.isnull().sum()
Out[18]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_35_0.png

To aggregate a table, potentially with grouping keys, we have to give the reduction a name and call aggregate

In [19]:
metric = metric.name('double_total')
expr = table.aggregate([metric])
result = con.execute(expr)
result
Out[19]:
double_total
0 331785.0

The result here is actually a pandas DataFrame with 1 row and just the one column. We can add another metric and add a grouping key

In [20]:
metric2 = (table.bigint_col + 1).log10().max().name('some_metric')

expr = table.aggregate([metric, metric2], by=['string_col'])
expr
Out[20]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_39_0.png

We provide a convenience group_by, a la pandas, to make this a little more composable:

In [21]:
expr = (table.group_by('string_col')
        .aggregate([metric, metric2]))
expr
Out[21]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_41_0.png

You can also group by named column expressions

In [22]:
keys = [table.timestamp_col.hour().name('hour'), 'string_col']

expr = table.group_by(keys).aggregate([metric])

# Top 10 by double_total, more on this later
expr.sort_by([('double_total', False)]).limit(10)
Out[22]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_43_0.png

In most cases, an aggregation by itself can be evaluated:

In [23]:
table.double_col.mean()
Out[23]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_45_0.png

This can also be done in simple cases along with group_by:

In [24]:
table.group_by('string_col').double_col.mean()
Out[24]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_47_0.png

Many reduction functions have a default expression name, unlike many other Ibis expressions (for now!), to make some common analyses easier:

In [25]:
d = table.double_col

(table.group_by('string_col')
 .aggregate([d.sum(), d.mean(), d.min(), d.max()]))
Out[25]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_49_0.png

Of course, for this particular case you can always use summary

In [26]:
table.group_by('string_col').double_col.summary()
Out[26]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_51_0.png

Filtering

Filtering tables works similarly to pandas

In [27]:
table.aggregate([table.bigint_col.max().name('bigint_max'),
                 table.bigint_col.min().name('bigint_min'),
                 table.int_col.max().name('int_max'),
                 table.int_col.min().name('int_min')])
Out[27]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_53_0.png
In [28]:
table.count()
Out[28]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_54_0.png
In [29]:
table[table.bigint_col > 50].count()
Out[29]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_55_0.png

Filters can be composed using & (and), | (or), and other logical array operators

In [30]:
cond1 = table.bigint_col > 50
cond2 = table.int_col.between(2, 7)
table[cond1 | cond2].count()
Out[30]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_57_0.png

There’s a filter function that allows you to pass a list of conditions (that are all required to hold):

In [31]:
table.filter([cond1, cond2]).count()
Out[31]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_59_0.png

Note this is the same as &-ing the boolean conditions yourself:

In [32]:
table[cond1 & cond2].count()
Out[32]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_61_0.png

Limits

Like in SQL, the limit function selects the indicated numbers of rows from the beginning of a table’s result set

In [33]:
table.limit(2)
Out[33]:
../../_images/notebooks_tutorial_2-Basics-Aggregate-Filter-Limit_63_0.png