Advanced Topics: Additional Filtering

The filtering examples we’ve shown to this point have been pretty simple, either comparisons between columns or fixed values, or set filter functions like isin and notin.

Ibis supports a number of richer analytical filters that can involve one or more of:

  • Aggregates computed from the same or other tables
  • Conditional aggregates (in SQL-speak these are similar to “correlated subqueries”)
  • “Existence” set filters (equivalent to the SQL EXISTS and NOT EXISTS keywords)

Setup

In [1]:
import ibis
import os
hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)
hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)
con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',
                          hdfs_client=hdfs)
ibis.options.interactive = True

Using scalar aggregates in filters

In [2]:
table = con.table('functional_alltypes')
table.limit(5)
Out[2]:
../../_images/notebooks_tutorial_7-Advanced-Topics-ComplexFiltering_4_0.png

We could always compute some aggregate value from the table and use that in another expression, or we can use a data-derived aggregate in the filter. Take the average of a column for example:

In [3]:
table.double_col.mean()
Out[3]:
../../_images/notebooks_tutorial_7-Advanced-Topics-ComplexFiltering_6_0.png

You can use this expression as a substitute for a scalar value in a filter, and the execution engine will combine everything into a single query rather than having to access Impala multiple times:

In [4]:
cond = table.bigint_col > table.double_col.mean()
expr = table[cond & table.bool_col].limit(5)
expr
Out[4]:
../../_images/notebooks_tutorial_7-Advanced-Topics-ComplexFiltering_8_0.png

Conditional aggregates

Suppose that we wish to filter using an aggregate computed conditional on some other expressions holding true. Using the TPC-H datasets, suppose that we want to filter customers based on the following criteria: Orders such that their amount exceeds the average amount for their sales region over the whole dataset. This can be computed any numbers of ways (such as joining auxiliary tables and filtering post-join)

Again, from prior examples, here are the joined up tables with all the customer data:

In [5]:
region = con.table('tpch_region')
nation = con.table('tpch_nation')
customer = con.table('tpch_customer')
orders = con.table('tpch_orders')

fields_of_interest = [customer,
                      region.r_name.name('region'),
                      orders.o_totalprice,
                      orders.o_orderdate.cast('timestamp').name('odate')]

tpch = (region.join(nation, region.r_regionkey == nation.n_regionkey)
        .join(customer, customer.c_nationkey == nation.n_nationkey)
        .join(orders, orders.o_custkey == customer.c_custkey)
        [fields_of_interest])

tpch.limit(5)
Out[5]:
   c_custkey              c_name                            c_address  \
0     133885  Customer#000133885                         EVK76gJeTspw
1      36934  Customer#000036934                0rsf17ITLCTceWGXzpe9n
2      41483  Customer#000041483                          LNzmnS7zhAH
3     148102  Customer#000148102           sqd6wCIoHZ6qMdVveUERJraOO1
4      19864  Customer#000019864  juCsfGD12 ySsWfOrSp5TQyHbxmJMuYKp3k

   c_nationkey          c_phone c_acctbal c_mktsegment  \
0           24  34-956-563-4308   6723.55   AUTOMOBILE
1           10  20-879-918-2113   5928.79    FURNITURE
2            6  16-613-774-2621   6913.32    HOUSEHOLD
3            6  16-318-581-1666   8906.57     BUILDING
4           12  22-890-475-8535   4600.19    HOUSEHOLD

                                           c_comment       region  \
0  mptotes use blithely furiously regular escapad...      AMERICA
1  usual dependencies serve blithely deposits. ir...  MIDDLE EAST
2  structions. blithely final pearls boost around...       EUROPE
3  eas. deposits integrate along the slyly ironic...       EUROPE
4  careful braids. unusual, silent requests after...         ASIA

  o_totalprice      odate
0    160843.35 1992-06-22
1     78307.91 1996-04-19
2    103237.90 1994-10-12
3    201463.59 1997-09-12
4    166098.86 1995-09-12

In this particular case, filtering based on the conditional average o_totalprice by region requires creating a table view (similar to the self-join examples from earlier) that can be treated as a distinct table entity in the expression. This would not be required if we were computing a conditional statistic from some other table. So this is a little more complicated than some other cases would be:

In [6]:
t2 = tpch.view()
conditional_avg = t2[(t2.region == tpch.region)].o_totalprice.mean()

Once you’ve done this, you can use the conditional average in a filter expression

In [7]:
amount_filter = tpch.o_totalprice > conditional_avg
tpch[amount_filter].limit(10)
Out[7]:
   c_custkey              c_name                                c_address  \
0     133885  Customer#000133885                             EVK76gJeTspw
1     148102  Customer#000148102               sqd6wCIoHZ6qMdVveUERJraOO1
2      19864  Customer#000019864      juCsfGD12 ySsWfOrSp5TQyHbxmJMuYKp3k
3      57892  Customer#000057892  MxXEs3,3mbP0wLenKMByJ4bgQHN1cFMgzFXJE5i
4      86575  Customer#000086575                       zM8VpTc8mN8FH3hhSB
5      49127  Customer#000049127      7IWyxOa2IsnbukRaMT0VPHPWZx9 cgKjWDj
6      63037  Customer#000063037          YdZC4nk04tmsxK6XvoSEaoF2fF5H64J
7      96217  Customer#000096217  l2QpgaSPRJDXYO,HDUgD8aGfJajIem6eZpLGEBy
8     110077  Customer#000110077                  WYMxagzV8I4oZ70p,udJLpo
9      95935  Customer#000095935                               6N81QQGoEY

   c_nationkey          c_phone c_acctbal c_mktsegment  \
0           24  34-956-563-4308   6723.55   AUTOMOBILE
1            6  16-318-581-1666   8906.57     BUILDING
2           12  22-890-475-8535   4600.19    HOUSEHOLD
3            4  14-994-636-3663   -224.11    FURNITURE
4           14  24-285-938-5455   2829.32     BUILDING
5           18  28-188-101-1358    -43.40    HOUSEHOLD
6            1  11-233-858-3620   1899.70     BUILDING
7            8  18-247-268-4415   2648.95     BUILDING
8           13  23-554-808-3156   3028.96    FURNITURE
9           13  23-546-903-7783   6959.40    HOUSEHOLD

                                           c_comment       region  \
0  mptotes use blithely furiously regular escapad...      AMERICA
1  eas. deposits integrate along the slyly ironic...       EUROPE
2  careful braids. unusual, silent requests after...         ASIA
3   es around the silent theodolites wake pending id  MIDDLE EAST
4  en hockey players. even foxes grow slyly above...       AFRICA
5  ag permanently. carefully regular deposits caj...         ASIA
6  are doggedly even deposits. slyly slow ideas h...      AMERICA
7        uickly ironic excuses. blithely careful dep         ASIA
8   pending platelets against the slyly silent de...  MIDDLE EAST
9  ithely brave packages haggle furiously after t...  MIDDLE EAST

  o_totalprice      odate
0    160843.35 1992-06-22
1    201463.59 1997-09-12
2    166098.86 1995-09-12
3    306441.92 1996-01-03
4    158056.07 1993-11-27
5    279866.40 1992-09-04
6    294134.21 1998-01-27
7    228231.98 1992-01-03
8    229979.86 1998-05-03
9    164732.95 1992-10-13

By looking at the table sizes before and after applying the filter you can see the relative size of the subset taken.

In [8]:
tpch.count()
Out[8]:
1500000
In [9]:
tpch[amount_filter].count()
Out[9]:
711969

Or even group by year and compare before and after:

In [10]:
tpch.schema()
Out[10]:
ibis.Schema {
  c_custkey     int64
  c_name        string
  c_address     string
  c_nationkey   int32
  c_phone       string
  c_acctbal     decimal(12, 2)
  c_mktsegment  string
  c_comment     string
  region        string
  o_totalprice  decimal(12, 2)
  odate         timestamp
}
In [11]:
year = tpch.odate.year().name('year')

pre_sizes = tpch.group_by(year).size()
post_sizes = tpch[amount_filter].group_by(year).size().view()

percent = ((post_sizes['count'] / pre_sizes['count'].cast('double'))
           .name('fraction'))

expr = (pre_sizes.join(post_sizes, pre_sizes.year == post_sizes.year)
        [pre_sizes.year,
         pre_sizes['count'].name('pre_count'),
         post_sizes['count'].name('post_count'),
         percent])
expr
Out[11]:
   year  pre_count  post_count  fraction
0  1994     227597      108087  0.474905
1  1996     228626      108757  0.475698
2  1992     227089      107815  0.474770
3  1998     133623       63551  0.475599
4  1993     226645      107703  0.475206
5  1995     228637      108315  0.473742
6  1997     227783      107741  0.472998

“Existence” filters

Some filtering involves checking for the existence of a particular value in a column of another table, or amount the results of some value expression. This is common in many-to-many relationships, and can be performed in numerous different ways, but it’s nice to be able to express it with a single concise statement and let Ibis compute it optimally.

Here’s some examples:

  • Filter down to customers having at least one open order
  • Find customers having no open orders with 1-URGENT status
  • Find stores (in the stores table) having the same name as a vendor (in the vendors table).

We’ll go ahead and solve the first couple of these problems using the TPC-H tables to illustrate the API:

In [12]:
customer = con.table('tpch_customer')
orders = con.table('tpch_orders')
In [13]:
orders.limit(5)
Out[13]:
../../_images/notebooks_tutorial_7-Advanced-Topics-ComplexFiltering_23_0.png

We introduce the any reduction:

In [14]:
has_open_orders = ((orders.o_orderstatus == 'O') &
                   (customer.c_custkey == orders.o_custkey)).any()

This is now a valid filter:

In [15]:
customer[has_open_orders].limit(10)
Out[15]:
../../_images/notebooks_tutorial_7-Advanced-Topics-ComplexFiltering_27_0.png

For the second example, in which we want to find customers not having any open urgent orders, we write down the condition that they do have some first:

In [16]:
has_open_urgent_orders = ((orders.o_orderstatus == 'O') &
                          (orders.o_orderpriority == '1-URGENT') &
                          (customer.c_custkey == orders.o_custkey)).any()

Now, we can negate this condition and use it as a filter:

In [17]:
customer[-has_open_urgent_orders].count()
Out[17]:
../../_images/notebooks_tutorial_7-Advanced-Topics-ComplexFiltering_31_0.png

In this case, it is true that customer.c_custkey has no duplicate values, but that need not be the case. There could be multiple copies of any given value in either table column being compared, and the behavior will be the same (existence or non-existence is verified).