Additional Analytics Tools¶
Setup¶
In [1]:
import ibis
import os
hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)
hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)
con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',
hdfs_client=hdfs)
ibis.options.interactive = True
Frequency tables¶
Ibis provides the value_counts
API, just like pandas, for computing
a frequency table for a table column or array expression. You might have
seen it used already earlier in the tutorial.
In [2]:
lineitem = con.table('tpch_lineitem')
orders = con.table('tpch_orders')
items = (orders.join(lineitem, orders.o_orderkey == lineitem.l_orderkey)
[lineitem, orders])
items.o_orderpriority.value_counts()
Out[2]:

This can be customized, of course:
In [3]:
freq = (items.group_by(items.o_orderpriority)
.aggregate([items.count().name('nrows'),
items.l_extendedprice.sum().name('total $')]))
freq
Out[3]:

Binning and histograms¶
Numeric array expressions (columns with numeric type and other array
expressions) have bucket
and histogram
methods which produce
different kinds of binning. These produce category values (the computed
bins) that can be used in grouping and other analytics.
Let’s have a look at a few examples
I’ll use the summary
function to see the general distribution of
lineitem prices in the order data joined above:
In [4]:
items.l_extendedprice.summary()
Out[4]:

Alright then, now suppose we want to split the item prices up into some buckets of our choosing:
In [5]:
buckets = [0, 5000, 10000, 50000, 100000]
The bucket
function creates a bucketed category from the prices:
In [6]:
bucketed = items.l_extendedprice.bucket(buckets).name('bucket')
Let’s have a look at the value counts:
In [7]:
bucketed.value_counts()
Out[7]:

The buckets we wrote down define 4 buckets numbered 0 through 3. The
NaN
is a pandas NULL
value (since that’s how pandas represents
nulls in numeric arrays), so don’t worry too much about that. Since the
bucketing ends at 100000, we see there are 4122 values that are over
100000. These can be included in the bucketing with include_over
:
In [8]:
bucketed = (items.l_extendedprice
.bucket(buckets, include_over=True)
.name('bucket'))
bucketed.value_counts()
Out[8]:

The bucketed
object here is a special *category* type
In [9]:
bucketed.type()
Out[9]:
Category(cardinality=5)
Category values can either have a known or unknown *cardinality*. In
this case, there’s either 4 or 5 buckets based on how we used the
bucket
function.
Labels can be assigned to the buckets at any time using the label
function:
In [10]:
bucket_counts = bucketed.value_counts()
labeled_bucket = (bucket_counts.bucket
.label(['0 to 5000', '5000 to 10000', '10000 to 50000',
'50000 to 100000', 'Over 100000'])
.name('bucket_name'))
expr = (bucket_counts[labeled_bucket, bucket_counts]
.sort_by('bucket'))
expr
Out[10]:

Nice, huh?
histogram
is a linear (fixed size bin) equivalent:
In [11]:
t = con.table('functional_alltypes')
d = t.double_col
tier = d.histogram(10).name('hist_bin')
expr = (t.group_by(tier)
.aggregate([d.min(), d.max(), t.count()])
.sort_by('hist_bin'))
expr
Out[11]:

Filtering in aggregations¶
Suppose that you want to compute an aggregation with a subset of the
data for only one of the metrics / aggregates in question, and the
complete data set with the other aggregates. Most aggregation functions
are thus equipped with a where
argument. Let me show it to you in
action:
In [12]:
t = con.table('functional_alltypes')
d = t.double_col
s = t.string_col
cond = s.isin(['3', '5', '7'])
metrics = [t.count().name('# rows total'),
cond.sum().name('# selected'),
d.sum().name('total'),
d.sum(where=cond).name('selected total')]
color = (t.float_col
.between(3, 7)
.ifelse('red', 'blue')
.name('color'))
t.group_by(color).aggregate(metrics)
Error: Unknown HTML element <string> on line 1
in label of node -5633032083077033653
Out[12]:
color # rows total # selected total selected total
0 blue 4380 730 199071.0 51611.0
1 red 2920 1460 132714.0 58984.0