API Reference

Creating connections

These methods are in the ibis module namespace, and your main point of entry to using Ibis.

hdfs_connect([host, port, protocol, …]) Connect to HDFS

Impala client

These methods are available on the Impala client object after connecting to your HDFS cluster (ibis.hdfs_connect) and connecting to Impala with ibis.impala.connect.

connect([host, port, database, timeout, …]) Create an ImpalaClient for use with Ibis.
ImpalaClient.close() Close Impala connection and drop any temporary objects
ImpalaClient.database([name]) Create a Database object for a given database name that can be used for exploring and manipulating the objects (tables, functions, views, etc.) inside

Database methods

ImpalaClient.set_database(name) Set the default database scope for client
ImpalaClient.create_database(name[, path, force]) Create a new Impala database
ImpalaClient.drop_database(name[, force]) Drop an Impala database
ImpalaClient.list_databases([like]) List databases in the Impala cluster.
ImpalaClient.exists_database(name) Checks if a given database exists
ImpalaDatabase.create_table(table_name[, obj]) Dispatch to ImpalaClient.create_table.
ImpalaDatabase.drop([force]) Drop the database
ImpalaDatabase.namespace(ns) Creates a derived Database instance for collections of objects having a common prefix.
ImpalaDatabase.table(name) Return a table expression referencing a table in this database

Table methods

The ImpalaClient object itself has many helper utility methods. You’ll find the most methods on ImpalaTable.

ImpalaClient.database([name]) Create a Database object for a given database name that can be used for exploring and manipulating the objects (tables, functions, views, etc.) inside
ImpalaClient.table(name[, database]) Create a table expression that references a particular table in the database
ImpalaClient.sql(query) Convert a SQL query to an Ibis table expression
ImpalaClient.raw_sql(query[, results]) Execute a given query string.
ImpalaClient.list_tables([like, database]) List tables in the current (or indicated) database.
ImpalaClient.exists_table(name[, database]) Determine if the indicated table or view exists
ImpalaClient.drop_table(table_name[, …]) Drop an Impala table
ImpalaClient.create_table(table_name[, obj, …]) Create a new table in Impala using an Ibis table expression.
ImpalaClient.insert(table_name[, obj, …]) Insert into existing table.
ImpalaClient.truncate_table(table_name[, …]) Delete all rows from, but do not drop, an existing table
ImpalaClient.get_schema(table_name[, database]) Return a Schema object for the indicated table and database
ImpalaClient.cache_table(table_name[, …]) Caches a table in cluster memory in the given pool.
ImpalaClient.load_data(table_name, path[, …]) Wraps the LOAD DATA DDL statement.
ImpalaClient.get_options() Return current query options for the Impala session
ImpalaClient.set_options(options)
ImpalaClient.set_compression_codec(codec) Parameters

The best way to interact with a single table is through the ImpalaTable object you get back from ImpalaClient.table.

ImpalaTable.add_partition(spec[, location]) Add a new table partition, creating any new directories in HDFS if necessary.
ImpalaTable.alter([location, format, …]) Change setting and parameters of the table.
ImpalaTable.alter_partition(spec[, …]) Change setting and parameters of an existing partition
ImpalaTable.column_stats() Return results of SHOW COLUMN STATS as a pandas DataFrame
ImpalaTable.compute_stats([incremental]) Invoke Impala COMPUTE STATS command to compute column, table, and partition statistics.
ImpalaTable.describe_formatted() Return parsed results of DESCRIBE FORMATTED statement
ImpalaTable.drop() Drop the table from the database
ImpalaTable.drop_partition(spec) Drop an existing table partition
ImpalaTable.files() Return results of SHOW FILES statement
ImpalaTable.insert([obj, overwrite, …]) Insert into Impala table.
ImpalaTable.invalidate_metadata()
ImpalaTable.load_data(path[, overwrite, …]) Wraps the LOAD DATA DDL statement.
ImpalaTable.metadata() Return parsed results of DESCRIBE FORMATTED statement
ImpalaTable.partition_schema() For partitioned tables, return the schema (names and types) for the partition columns
ImpalaTable.partitions() Return a pandas.DataFrame giving information about this table’s partitions.
ImpalaTable.refresh()
ImpalaTable.rename(new_name[, database]) Rename table inside Impala.
ImpalaTable.stats() Return results of SHOW TABLE STATS as a DataFrame.

Creating views is also possible:

ImpalaClient.create_view(name, expr[, database]) Create an Impala view from a table expression
ImpalaClient.drop_view(name[, database, force]) Drop an Impala view
ImpalaClient.drop_table_or_view(name[, …]) Attempt to drop a relation that may be a view or table

Accessing data formats in HDFS

ImpalaClient.avro_file(hdfs_dir, avro_schema) Create a (possibly temporary) table to read a collection of Avro data.
ImpalaClient.delimited_file(hdfs_dir, schema) Interpret delimited text files (CSV / TSV / etc.) as an Ibis table.
ImpalaClient.parquet_file(hdfs_dir[, …]) Make indicated parquet file in HDFS available as an Ibis table.

Executing expressions

ImpalaClient.execute(expr[, params, limit]) Compile and execute Ibis expression using this backend client interface, returning results in-memory in the appropriate object type
ImpalaClient.disable_codegen([disabled]) Turn off or on LLVM codegen in Impala query execution

PostgreSQL client

The PostgreSQL client is accessible through the ibis.postgres namespace.

Use ibis.postgres.connect with a SQLAlchemy-compatible connection string to create a client.

connect([host, user, password, port, …]) Create an Ibis client located at user:password`@`host:port connected to a PostgreSQL database named database.
PostgreSQLClient.database([name]) Connect to a database called name.
PostgreSQLClient.list_tables([like, …]) List tables/views in the current (or indicated) database.
PostgreSQLClient.list_databases()
PostgreSQLClient.table(name[, database, schema]) Create a table expression that references a particular a table called name in a PostgreSQL database called database.

SQLite client

The SQLite client is accessible through the ibis.sqlite namespace.

Use ibis.sqlite.connect to create a SQLite client.

connect([path, create]) Create an Ibis client connected to a SQLite database.
SQLiteClient.attach(name, path[, create]) Connect another SQLite database file
SQLiteClient.database([name]) Create a Database object for a given database name that can be used for exploring and manipulating the objects (tables, functions, views, etc.) inside
SQLiteClient.list_tables([like, database, …]) List tables/views in the current (or indicated) database.
SQLiteClient.table(name[, database]) Create a table expression that references a particular table in the SQLite database

HDFS

Client objects have an hdfs attribute you can use to interact directly with HDFS.

HDFS.ls(hdfs_path[, status]) Return contents of directory
HDFS.chmod(hdfs_path, permissions) Change permissions of a file of directory
HDFS.chown(hdfs_path[, owner, group]) Change owner (and/or group) of a file or directory
HDFS.get(hdfs_path[, local_path, overwrite]) Download remote file or directory to the local filesystem
HDFS.head(hdfs_path[, nbytes, offset]) Retrieve the requested number of bytes from a file
HDFS.put(hdfs_path, resource[, overwrite, …]) Write file or directory to HDFS
HDFS.put_tarfile(hdfs_path, local_path[, …]) Write contents of tar archive to HDFS directly without having to decompress it locally first
HDFS.rm(path) Delete a single file
HDFS.rmdir(path) Delete a directory and all its contents
HDFS.size(hdfs_path) Return total size of file or directory
HDFS.status(path)

Top-level expression APIs

These methods are available directly in the ibis module namespace.

case() Similar to the .case method on array expressions, create a case builder that accepts self-contained boolean expressions (as opposed to expressions which are to be equality-compared with a fixed value expression)
literal(value[, type]) Create a scalar expression from a Python value.
schema([pairs, names, types]) Validate and return an Ibis Schema object
table(schema[, name]) Create an unbound Ibis table for creating expressions.
timestamp(value) Returns a timestamp literal if value is likely coercible to a timestamp
where(boolean_expr, true_expr, false_null_expr) Equivalent to the ternary expression: if X then Y else Z
ifelse(arg, true_expr, false_expr) Shorthand for implementing ternary expressions
coalesce(*args) Compute the first non-null value(s) from the passed arguments in left-to-right order.
greatest(*args) Compute the largest value (row-wise, if any arrays are present) among the supplied arguments.
least(*args) Compute the smallest value (row-wise, if any arrays are present) among the supplied arguments.
negate(arg) Negate a numeric expression
desc(expr) Create a sort key (when used in sort_by) by the passed array expression or column name.
now() Compute the current timestamp
NA
null() Create a NULL/NA scalar
expr_list(exprs)
row_number() Analytic function for the current row number, starting at 0.
window([preceding, following, group_by, …]) Create a window clause for use with window (analytic and aggregate) functions.
range_window([preceding, following, …]) Create a window clause for use with window (analytic and aggregate) functions.
trailing_window(rows[, group_by, order_by]) Create a trailing window for use with aggregate window functions.
cumulative_window([group_by, order_by]) Create a cumulative window clause for use with aggregate window functions.
trailing_range_window(preceding, order_by[, …]) Create a trailing time window for use with aggregate window functions.

General expression methods

Expr.compile([limit, params]) Compile expression to whatever execution target, to verify
Expr.equals(other[, cache])
Expr.execute([limit, params]) If this expression is based on physical tables in a database backend, execute it against that backend.
Expr.pipe(f, *args, **kwargs) Generic composition function to enable expression pipelining.
Expr.verify() Returns True if expression can be compiled to its attached client

Table methods

TableExpr.add_column(expr[, name]) Add indicated column expression to table, producing a new table.
TableExpr.aggregate([metrics, by, having]) Aggregate a table with a given set of reductions, with grouping expressions, and post-aggregation filters.
TableExpr.count() Returns the computed number of rows in the table expression
TableExpr.distinct() Compute set of unique rows/tuples occurring in this table
TableExpr.info([buf]) Similar to pandas DataFrame.info.
TableExpr.filter(predicates) Select rows from table based on boolean expressions
TableExpr.get_column(name) Get a reference to a single column from the table
TableExpr.get_columns(iterable) Get multiple columns from the table
TableExpr.group_by([by]) Create an intermediate grouped table expression, pending some group operation to be applied with it.
TableExpr.groupby([by]) Create an intermediate grouped table expression, pending some group operation to be applied with it.
TableExpr.limit(n[, offset]) Select the first n rows at beginning of table (may not be deterministic depending on implementation and presence of a sorting).
TableExpr.mutate([exprs]) Convenience function for table projections involving adding columns
TableExpr.projection(exprs) Compute new table expression with the indicated column expressions from this table.
TableExpr.relabel(substitutions[, replacements]) Change table column names, otherwise leaving table unaltered
TableExpr.schema() Get the schema for this table (if one is known)
TableExpr.set_column(name, expr) Replace an existing column with a new expression
TableExpr.sort_by(sort_exprs) Sort table by the indicated column expressions and sort orders (ascending/descending)
TableExpr.union(right[, distinct]) Form the table set union of two table expressions having identical schemas.
TableExpr.view() Create a new table expression that is semantically equivalent to the current one, but is considered a distinct relation for evaluation purposes (e.g.
TableExpr.join(right[, predicates, how]) Perform a relational join between two tables.
TableExpr.cross_join(**kwargs) Perform a cross join (cartesian product) amongst a list of tables, with optional set of prefixes to apply to overlapping column names
TableExpr.inner_join(other[, predicates]) Perform a relational join between two tables.
TableExpr.left_join(other[, predicates]) Perform a relational join between two tables.
TableExpr.outer_join(other[, predicates]) Perform a relational join between two tables.
TableExpr.semi_join(other[, predicates]) Perform a relational join between two tables.
TableExpr.anti_join(other[, predicates]) Perform a relational join between two tables.

Grouped table methods

GroupedTableExpr.aggregate([metrics])
GroupedTableExpr.count([metric_name]) Convenience function for computing the group sizes (number of rows per group) given a grouped table.
GroupedTableExpr.having(expr) Add a post-aggregation result filter (like the having argument in aggregate), for composability with the group_by API
GroupedTableExpr.mutate([exprs]) Returns a table projection with analytic / window functions applied.
GroupedTableExpr.order_by(expr) Expressions to use for ordering data for a window function computation.
GroupedTableExpr.over(window) Add a window clause to be applied to downstream analytic expressions
GroupedTableExpr.projection(exprs) Like mutate, but do not include existing table columns
GroupedTableExpr.size([metric_name]) Convenience function for computing the group sizes (number of rows per group) given a grouped table.

Generic value methods

Scalar or column methods

ValueExpr.between(lower, upper) Check if the input expr falls between the lower/upper bounds passed.
ValueExpr.cast(target_type) Cast value(s) to indicated data type.
ValueExpr.coalesce() Compute the first non-null value(s) from the passed arguments in left-to-right order.
ValueExpr.fillna(fill_value) Replace any null values with the indicated fill value
ValueExpr.isin(values) Check whether the value expression is contained within the indicated list of values.
ValueExpr.notin(values) Like isin, but checks whether this expression’s value(s) are not contained in the passed values.
ValueExpr.nullif(null_if_expr) Set values to null if they match/equal a particular expression (scalar or array-valued).
ValueExpr.hash([how]) Compute an integer hash value for the indicated value expression.
ValueExpr.isnull() Returns true if values are null
ValueExpr.notnull() Returns true if values are not null
ValueExpr.over(window) Turn an aggregation or full-sample analytic operation into a windowed operation.
ValueExpr.typeof() Return the data type of the argument according to the current backend
ValueExpr.case() Create a new SimpleCaseBuilder to chain multiple if-else statements.
ValueExpr.cases(case_result_pairs[, default]) Create a case expression in one shot.
ValueExpr.substitute(value[, replacement, else_]) Substitute (replace) one or more values in a value expression

Column methods

ColumnExpr.distinct() Compute set of unique values occurring in this array.
ColumnExpr.count([where]) Compute cardinality / sequence size of expression.
ColumnExpr.min([where])
ColumnExpr.max([where])
ColumnExpr.approx_median([where])
ColumnExpr.approx_nunique([where])
ColumnExpr.group_concat([sep, where]) Concatenate values using the indicated separator (comma by default) to produce a string
ColumnExpr.nunique([where])
ColumnExpr.summary([exact_nunique, prefix]) Compute a set of summary metrics from the input value expression
ColumnExpr.value_counts([metric_name]) Compute a frequency table for this value expression
ColumnExpr.first()
ColumnExpr.last()
ColumnExpr.dense_rank() Compute position of first element within each equal-value group in sorted order, ignoring duplicate values.
ColumnExpr.rank() Compute position of first element within each equal-value group in sorted order.
ColumnExpr.lag([offset, default])
ColumnExpr.lead([offset, default])
ColumnExpr.cummin() Cumulative min.
ColumnExpr.cummax() Cumulative max.

General numeric methods

Scalar or column methods

NumericValue.abs() Absolute value
NumericValue.ceil() Round up to the nearest integer value greater than or equal to this value
NumericValue.floor() Round down to the nearest integer value less than or equal to this value
NumericValue.sign()
NumericValue.exp()
NumericValue.sqrt()
NumericValue.log([base]) Perform the logarithm using a specified base
NumericValue.ln() Natural logarithm
NumericValue.log2() Logarithm base 2
NumericValue.log10() Logarithm base 10
NumericValue.round([digits]) Round values either to integer or indicated number of decimal places.
NumericValue.nullifzero() Set values to NULL if they equal to zero.
NumericValue.zeroifnull()
NumericValue.add(other)
NumericValue.sub(other)
NumericValue.mul(other)
NumericValue.div(other)
NumericValue.pow(other)
NumericValue.rdiv(other)
NumericValue.rsub(other)

Column methods

NumericColumn.sum([where])
NumericColumn.mean([where])
NumericColumn.std([where, how]) Compute standard deviation of numeric array
NumericColumn.var([where, how]) Compute standard deviation of numeric array
NumericColumn.cumsum() Cumulative sum.
NumericColumn.cummean() Cumulative mean.
NumericColumn.bottomk(k[, by])
NumericColumn.topk(k[, by])
Returns:
NumericColumn.bucket(buckets[, closed, …]) Compute a discrete binning of a numeric array
NumericColumn.histogram([nbins, binwidth, …]) Compute a histogram with fixed width bins

Integer methods

Scalar or column methods

IntegerValue.convert_base(from_base, to_base) Convert number (as integer or string) from one base to another
IntegerValue.to_timestamp([unit]) Convert integer UNIX timestamp (at some resolution) to a timestamp type

String methods

All string operations are valid either on scalar or array values

StringValue.convert_base(from_base, to_base) Convert number (as integer or string) from one base to another
StringValue.length() Compute length of strings
StringValue.lower() Convert string to all lowercase
StringValue.upper() Convert string to all uppercase
StringValue.reverse() Reverse string
StringValue.ascii_str()
StringValue.strip() Remove whitespace from left and right sides of string
StringValue.lstrip() Remove whitespace from left side of string
StringValue.rstrip() Remove whitespace from right side of string
StringValue.capitalize() Return a capitalized version of input string
StringValue.contains(substr) Determine if indicated string is exactly contained in the calling string.
StringValue.like(patterns) Wildcard fuzzy matching function equivalent to the SQL LIKE directive.
StringValue.to_timestamp(format_str[, timezone]) Parses a string and returns a timestamp.
StringValue.parse_url(extract[, key]) Returns the portion of a URL corresponding to a part specified by ‘extract’ Can optionally specify a key to retrieve an associated value if extract parameter is ‘QUERY’
StringValue.substr(start[, length]) Pull substrings out of each string value by position and maximum length.
StringValue.left(nchars) Return left-most up to N characters from each string.
StringValue.right(nchars) Return up to nchars starting from end of each string.
StringValue.repeat(n) Returns the argument string repeated n times
StringValue.find(substr[, start, end]) Returns position (0 indexed) of first occurence of substring, optionally after a particular position (0 indexed)
StringValue.translate(from_str, to_str) Returns string with set of ‘from’ characters replaced by set of ‘to’ characters.
StringValue.find_in_set(str_list) Returns postion (0 indexed) of first occurence of argument within a list of strings.
StringValue.join(strings) Joins a list of strings together using the calling string as a separator
StringValue.replace(pattern, replacement) Replaces each exactly occurrence of pattern with given replacement string.
StringValue.lpad(length[, pad]) Returns string of given length by truncating (on right) or padding (on left) original string
StringValue.rpad(length[, pad]) Returns string of given length by truncating (on right) or padding (on right) original string
StringValue.rlike(pattern) Search string values using a regular expression.
StringValue.re_search(pattern) Search string values using a regular expression.
StringValue.re_extract(pattern, index) Returns specified index, 0 indexed, from string based on regex pattern given
StringValue.re_replace(pattern, replacement) Replaces match found by regex with replacement string.

Timestamp methods

All timestamp operations are valid either on scalar or array values

TimestampValue.strftime(format_str) Format timestamp according to the passed format string.
TimestampValue.year()
TimestampValue.month()
TimestampValue.day()
TimestampValue.day_of_week Namespace expression containing methods for extracting information about the day of the week of a TimestampValue or DateValue expression.
TimestampValue.hour()
TimestampValue.minute()
TimestampValue.second()
TimestampValue.millisecond()
TimestampValue.truncate(unit) Zero out smaller-size units beyond indicated unit.
TimestampValue.time() Return a Time node for a Timestamp We can then perform certain operations on this node w/o actually instantiating the underlying structure (which is inefficient in pandas/numpy)
TimestampValue.date() Return a Date node for a Timestamp We can then perform certain operations on this node w/o actually instantiating the underlying structure (which is inefficient in pandas/numpy)
TimestampValue.add(other)
TimestampValue.radd(other)
TimestampValue.sub(right)
TimestampValue.rsub(right)

Date methods

DateValue.strftime(format_str) Format timestamp according to the passed format string.
DateValue.year()
DateValue.month()
DateValue.day()
DateValue.day_of_week Namespace expression containing methods for extracting information about the day of the week of a TimestampValue or DateValue expression.
DateValue.truncate(unit) Zero out smaller-size units beyond indicated unit.
DateValue.add(other)
DateValue.radd(other)
DateValue.sub(right)
DateValue.rsub(right)

Day of week methods

DayOfWeek.index() Get the index of the day of the week.
DayOfWeek.full_name() Get the name of the day of the week.

Time methods

TimeValue.between(lower, upper[, timezone]) Check if the input expr falls between the lower/upper bounds passed.
TimeValue.truncate(unit) Zero out smaller-size units beyond indicated unit.
TimeValue.hour()
TimeValue.minute()
TimeValue.second()
TimeValue.millisecond()
TimeValue.add(other)
TimeValue.radd(other)
TimeValue.sub(right)
TimeValue.rsub(right)

Interval methods

IntervalValue.to_unit(target_unit)
IntervalValue.years Extract the number of years from an IntervalValue expression.
IntervalValue.quarters Extract the number of quarters from an IntervalValue expression.
IntervalValue.months Extract the number of months from an IntervalValue expression.
IntervalValue.weeks Extract the number of weeks from an IntervalValue expression.
IntervalValue.days Extract the number of days from an IntervalValue expression.
IntervalValue.hours Extract the number of hours from an IntervalValue expression.
IntervalValue.minutes Extract the number of minutes from an IntervalValue expression.
IntervalValue.seconds Extract the number of seconds from an IntervalValue expression.
IntervalValue.milliseconds Extract the number of milliseconds from an IntervalValue expression.
IntervalValue.microseconds Extract the number of microseconds from an IntervalValue expression.
IntervalValue.nanoseconds Extract the number of nanoseconds from an IntervalValue expression.
IntervalValue.add(other)
IntervalValue.radd(other)
IntervalValue.sub(other)
IntervalValue.mul(other)
IntervalValue.rmul(other)
IntervalValue.floordiv(other)
IntervalValue.negate() Negate a numeric expression

Boolean methods

BooleanValue.ifelse(true_expr, false_expr) Shorthand for implementing ternary expressions
BooleanColumn.any()
BooleanColumn.all()
BooleanColumn.cumany()
BooleanColumn.cumall()

Category methods

Category is a logical type with either a known or unknown cardinality. Values are represented semantically as integers starting at 0.

CategoryValue.label(labels[, nulls]) Format a known number of categories as strings