Design

Primary Goals

  1. Type safety
  2. Expressiveness
  3. Composability
  4. Familiarity

Flow of Execution

  1. User writes expression
  2. Each method or function call builds a new expression
  3. Expressions are type checked as you create them
  4. Expressions have some optimizations that happen as the user builds them
  5. Backend specific rewrites
  6. Expressions are compiled
  7. The SQL string that generated by the compiler is sent to the database and executed (this step is skipped for the pandas backend)
  8. The database returns some data that is then turned into a pandas DataFrame by ibis

Expressions

The main user-facing component of ibis is expressions. The base class of all expressions in ibis is the Expr class.

Expressions provide the user facing API, defined in ibis/expr/api.py

Type System

Ibis’s type system consists of a set of rules for specifying the types of inputs to Node subclasses. Upon construction of a Node subclass, ibis performs validation of every input to the node based on the rule that was used to declare the input.

Rules are defined in ibis/expr/rules.py

The Expr class

Expressions are a thin but important abstraction over operations, containing only type information and shape information, i.e., whether they are tables, columns, or scalars.

Examples of expressions include Int64Column, StringScalar, and TableExpr.

Here’s an example of each type of expression:

import ibis
t = ibis.table([('a', 'int64')])
int64_column = t.a
type(int64_column)
string_scalar = ibis.literal('some_string_value')
type(string_scalar)
table_expr = t.mutate(b=t.a + 1)
type(table_expr)

The Node Class

Node subclasses make up the core set of operations of ibis. Each node corresponds to a particular operation.

Most nodes are defined in the operations module.

Examples of nodes include Add and Sum.

Nodes have two important members (and often these are the only members defined):

  1. input_type: a list of rules
  2. output_type: a rule or method

The input_type member is a list of rules that defines the types of the inputs to the operation. This is sometimes called the signature.

The output_type member is a rule or a method that defines the output type of the operation. This is sometimes called the return type.

An example of input_type/output_type usage is the Log class:

class Log(Node):

    input_type = [
        rules.double(),
        rules.double(name='base', optional=True)
    ]
    output_type = rules.shape_like_arg(0, 'double')

This class describes an operation called Log that takes one required argument: a double scalar or column, and one optional argument: a double scalar or column named base that defaults to nothing if not provided. The base argument is None by default so that the expression will behave as the underlying database does.

These objects are instantiated when you use ibis APIs:

import ibis
t = ibis.table([('a', 'double')])
log_1p = (1 + t.a).log()  # an Add and a Log are instantiated here

Expressions vs Operations: Why are they different?

Separating expressions from their underlying operations makes it easy to generically describe and validate the inputs to particular nodes. In the log example, it doesn’t matter what operation (node) the double-valued arguments are coming from, they must only satisfy the requirement denoted by the rule.

Separation of the Node and Expr classes also allows the API to be tied to the physical type of the expression rather than the particular operation, making it easy to define the API in terms of types rather than specific operations.

Furthermore, operations often have an output type that depends on the input type. An example of this is the greatest function, which takes the maximum of all of its arguments. Another example is CASE statements, whose THEN expressions determine the output type of the expression.

This allows ibis to provide only the APIs that make sense for a particular type, even when an operation yields a different output type depending on its input. Concretely, this means that you cannot perform operations that don’t make sense, like computing the average of a string column.

Compilation

The next major component of ibis is the compilers.

The first few versions of ibis directly generated strings, but the compiler infrastructure was generalized to support compilation of SQLAlchemy based expressions.

The compiler works by translating the different pieces of SQL expression into a string or SQLAlchemy expression.

The main pieces of a SELECT statement are:

  1. The set of column expressions (select_set)
  2. WHERE clauses (where)
  3. GROUP BY clauses (group_by)
  4. HAVING clauses (having)
  5. LIMIT clauses (limit)
  6. ORDER BY clauses (order_by)
  7. DISTINCT clauses (distinct)

Each of these pieces is translated into a SQL string and finally assembled by the instance of the ExprTranslator subclass specific to the backend being compiled. For example, the ImpalaExprTranslator is one of the subclasses that will perform this translation.

Note

While ibis was designed with an explicit goal of first-class SQL support, ibis can target other systems such as pandas.

Execution

We presumably want to do something with our compiled expressions. This is where execution comes in.

This is least complex part of ibis, mostly only requiring ibis to correctly handle whatever the database hands back.

By and large, the execution of compiled SQL is handled by the database to which SQL is sent from ibis.

However, once the data arrives from the database we need to convert that data to a pandas DataFrame.

The Query class, with its _fetch() method, provides a way for ibis SQLClient objects to do any additional processing necessary after the database returns results to the client.