# Basics: Aggregation, filtering, limits¶

Now that you have ibis installed and connecting to your Impala, let’s get our feet wet.

import ibis
import os
hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)
hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)
con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',
hdfs_client=hdfs)


## Accessing an Impala table¶

The optional database paramter allows to fully qualify the table.

table = con.table('functional_alltypes')
table = con.table('functional_alltypes', database='ibis_testing')


## Column selection and basic expressions¶

Selecting columns from a table works just like pandas

col = table.double_col

# alternately
col2 = table['bigint_col']


Table columns are equipped with a variety of math operations and other methods to assist in writing your analytics. For example:

expr = col.log2() - 1


Some operations transform arrays to arrays, while others aggregate, like sum and mean:

expr2 = expr.sum()


The methods that are available on columns depend on the type of the column. For example, you won’t see the substr or upper methods for strings on numeric columns:

substr_expr = table.string_col.upper().substr(0, 2)


Notice that printing the expressions to the console does not actually do anything, and shows a graphical representation of the expression you’ve built.

Note: don’t worry too much about the details of the expression tree that’s outputted, it’s very likely to change over time

expr2

ref_0
DatabaseTable[table]
name: ibis_testing.functional_alltypes
schema:
id : int32
bool_col : boolean
tinyint_col : int8
smallint_col : int16
int_col : int32
bigint_col : int64
float_col : float
double_col : double
date_string_col : string
string_col : string
timestamp_col : timestamp
year : int32
month : int32

sum = Sum[double]
Subtract[array(double)]
Log2[array(double)]
double_col = Column[array(double)] 'double_col' from table ref_0
Literal[int8]
1
None


We can also execute an expression by calling execute on the Impala connection object

con.execute(col.sum())

331785.0


There’s a shortcut to make this a little more convenient in interactive use

## Interactive mode¶

Many Ibis expressions can be immediately executed against the database, and it may improve your productivity to have the executed for you whenever you try to print the expression in the console / IPython notebook.

To do this, we have an interactive mode available, which can be turned on/off like so:

ibis.options.interactive = True


Now, any expressions you write will be executed right away

table.limit(10)

    id bool_col  tinyint_col  smallint_col  int_col  bigint_col  float_col  0  900     True            0             0        0           0        0.0
1  901    False            1             1        1          10        1.1
2  902     True            2             2        2          20        2.2
3  903    False            3             3        3          30        3.3
4  904     True            4             4        4          40        4.4
5  905    False            5             5        5          50        5.5
6  906     True            6             6        6          60        6.6
7  907    False            7             7        7          70        7.7
8  908     True            8             8        8          80        8.8
9  909    False            9             9        9          90        9.9

double_col date_string_col string_col           timestamp_col  year  month
0         0.0        04/01/09          0 2009-04-01 00:00:00.000  2009      4
1        10.1        04/01/09          1 2009-04-01 00:01:00.000  2009      4
2        20.2        04/01/09          2 2009-04-01 00:02:00.100  2009      4
3        30.3        04/01/09          3 2009-04-01 00:03:00.300  2009      4
4        40.4        04/01/09          4 2009-04-01 00:04:00.600  2009      4
5        50.5        04/01/09          5 2009-04-01 00:05:00.100  2009      4
6        60.6        04/01/09          6 2009-04-01 00:06:00.150  2009      4
7        70.7        04/01/09          7 2009-04-01 00:07:00.210  2009      4
8        80.8        04/01/09          8 2009-04-01 00:08:00.280  2009      4
9        90.9        04/01/09          9 2009-04-01 00:09:00.360  2009      4


You can select a row range with slicing syntax:

table.double_col.sum()

331785.0


Don’t worry about the syntax here, but expressions resulting in tabular output will come back as a pandas DataFrame by default:

metrics = [table.double_col.sum().name('total')]
expr = table.group_by('string_col').aggregate(metrics)
expr

  string_col    total
0          6  44238.0
1          1   7373.0
2          4  29492.0
3          8  58984.0
4          7  51611.0
5          9  66357.0
6          2  14746.0
7          5  36865.0
8          3  22119.0
9          0      0.0


## Verbose mode and logging¶

You may want to see or log all the Impala queries that Ibis executes under the hood. To do that, use the options.verbose option:

ibis.options.verbose = True

metrics = [table.double_col.sum().name('total')]
expr = table.group_by('string_col').aggregate(metrics)
expr

SELECT string_col, sum(double_col) AS total
FROM ibis_testing.functional_alltypes
GROUP BY 1
LIMIT 10000

  string_col    total
0          6  44238.0
1          1   7373.0
2          4  29492.0
3          8  58984.0
4          7  51611.0
5          2  14746.0
6          9  66357.0
7          5  36865.0
8          3  22119.0
9          0      0.0

queries = []
def logger(x):
queries.append(x)

ibis.options.verbose_log = logger

expr.execute()
expr.execute()

queries

['SELECT string_col, sum(double_col) AS totalnFROM ibis_testing.functional_alltypesnGROUP BY 1nLIMIT 10000',
'SELECT string_col, sum(double_col) AS totalnFROM ibis_testing.functional_alltypesnGROUP BY 1nLIMIT 10000']

from __future__ import print_function
ibis.options.verbose_log = lambda x: print(x)
ibis.options.verbose = False


## Aggregation basics¶

One of the essential table API functions is aggregate. Aggregation involves the following

• One or more named aggregate expressions, or reductions
• Zero or more grouping expressions or column names

This ends up working very similarly to pandas’s groupby mechanism.

metric = table.double_col.sum()


As you saw above, you can execute this immediately and obtain a value:

metric

331785.0


The reduced column can be more complex; for example, you could count the number of null values in a column like so:

table.double_col.isnull().sum()

0


To aggregate a table, potentially with grouping keys, we have to give the reduction a name and call aggregate

metric = metric.name('double_total')
expr = table.aggregate([metric])
result = con.execute(expr)
result

double_total
0 331785.0

The result here is actually a pandas DataFrame with 1 row and just the one column. We can add another metric and add a grouping key

metric2 = (table.bigint_col + 1).log10().max().name('some_metric')

expr = table.aggregate([metric, metric2], by=['string_col'])
expr

  string_col  double_total  some_metric
0          6       44238.0     1.785330
1          4       29492.0     1.612784
2          7       51611.0     1.851258
3          1        7373.0     1.041393
4          8       58984.0     1.908485
5          2       14746.0     1.322219
6          5       36865.0     1.707570
7          0           0.0     0.000000
8          9       66357.0     1.959041
9          3       22119.0     1.491362


We provide a convenience group_by, a la pandas, to make this a little more composable:

expr = (table.group_by('string_col')
.aggregate([metric, metric2]))
expr

  string_col  double_total  some_metric
0          6       44238.0     1.785330
1          4       29492.0     1.612784
2          7       51611.0     1.851258
3          1        7373.0     1.041393
4          8       58984.0     1.908485
5          2       14746.0     1.322219
6          9       66357.0     1.959041
7          5       36865.0     1.707570
8          0           0.0     0.000000
9          3       22119.0     1.491362


You can also group by named column expressions

keys = [table.timestamp_col.hour().name('hour'), 'string_col']

expr = table.group_by(keys).aggregate([metric])

# Top 10 by double_total, more on this later
expr.sort_by([('double_total', False)]).limit(10)

   hour string_col  double_total
0     1          9       13544.1
1     2          9       13180.5
2     0          9       13089.6
3     3          9       13089.6
4     4          9       12271.5
5     1          8       12039.2
6     2          8       11716.0
7     0          8       11635.2
8     3          8       11635.2
9     4          8       10908.0


In most cases, an aggregation by itself can be evaluated:

table.double_col.mean()

45.450000000000003


This can also be done in simple cases along with group_by:

table.group_by('string_col').double_col.mean()

  string_col  mean(double_col)
0          1              10.1
1          6              60.6
2          8              80.8
3          4              40.4
4          2              20.2
5          9              90.9
6          7              70.7
7          5              50.5
8          3              30.3
9          0               0.0


Many reduction functions have a default expression name, unlike many other Ibis expressions (for now!), to make some common analyses easier:

d = table.double_col

(table.group_by('string_col')
.aggregate([d.sum(), d.mean(), d.min(), d.max()]))

  string_col      sum  mean   min   max
0          1   7373.0  10.1  10.1  10.1
1          8  58984.0  80.8  80.8  80.8
2          2  14746.0  20.2  20.2  20.2
3          5  36865.0  50.5  50.5  50.5
4          0      0.0   0.0   0.0   0.0
5          6  44238.0  60.6  60.6  60.6
6          4  29492.0  40.4  40.4  40.4
7          7  51611.0  70.7  70.7  70.7
8          9  66357.0  90.9  90.9  90.9
9          3  22119.0  30.3  30.3  30.3


Of course, for this particular case you can always use summary

table.group_by('string_col').double_col.summary()

  string_col  count  nulls   min   max      sum  mean  approx_nunique
0          6    730      0  60.6  60.6  44238.0  60.6               1
1          1    730      0  10.1  10.1   7373.0  10.1               1
2          8    730      0  80.8  80.8  58984.0  80.8               1
3          9    730      0  90.9  90.9  66357.0  90.9               1
4          2    730      0  20.2  20.2  14746.0  20.2               1
5          3    730      0  30.3  30.3  22119.0  30.3               1
6          5    730      0  50.5  50.5  36865.0  50.5               1
7          4    730      0  40.4  40.4  29492.0  40.4               1
8          7    730      0  70.7  70.7  51611.0  70.7               1
9          0    730      0   0.0   0.0      0.0   0.0               1


## Filtering¶

Filtering tables works similarly to pandas

table.aggregate([table.bigint_col.max().name('bigint_max'),
table.bigint_col.min().name('bigint_min'),
table.int_col.max().name('int_max'),
table.int_col.min().name('int_min')])

   bigint_max  bigint_min  int_max  int_min
0          90           0        9        0

table.count()

7300

table[table.bigint_col > 50].count()

2920


Filters can be composed using & (and), | (or), and other logical array operators

cond1 = table.bigint_col > 50
cond2 = table.int_col.between(2, 7)
table[cond1 | cond2].count()

5840


There’s a filter function that allows you to pass a list of conditions (that are all required to hold):

table.filter([cond1, cond2]).count()

1460


Note this is the same as &-ing the boolean conditions yourself:

table[cond1 & cond2].count()

1460


## Limits¶

Like in SQL, the limit function selects the indicated numbers of rows from the beginning of a table’s result set

table.limit(2)

     id bool_col  tinyint_col  smallint_col  int_col  bigint_col  float_col  0  1510     True            0             0        0           0        0.0
1  1511    False            1             1        1          10        1.1

double_col date_string_col string_col       timestamp_col  year  month
0         0.0        06/01/09          0 2009-06-01 00:00:00  2009      6
1        10.1        06/01/09          1 2009-06-01 00:01:00  2009      6