Extending the Record Layer

As the name suggests, the Record Layer is meant to sit above the FoundationDB key-value store and possibly below higher level layers such as query language parsers. Within the layer itself, there are a number of extension points. Generally, concepts that have several variants or involve significant implementation tradeoffs are defined and accessed through such an extension point.

Objects that need to be known by the Record Layer core are generally discovered using ServiceLoader. This allows them to be introduced by additional .jar files in the class path.

NOTE: These extension interfaces have varying levels of maturity. Our @API stability annotations are meant to make this level of maturity clearer to users of the Record Layer. Anyone writing extensions for the Record Layer which, for whatever reason, are not appropriate to submit back to the core source, is still strongly encouraged to create GitHub issues or forum posts noting what they used so that their extensions can be better supported in a compatible way as the system evolves.

Indexing data

A (secondary) index in the Record Layer is a subspace of the record store uniquely associated with the index. This subspace is updated when records are inserted or updated in the same transaction so that it is always consistent with the (primary) record data.

Within the meta-data, an index consists of:

  • A unique name. The name is used for meta-data operations. There is a separate, more compact, unique identifier that is the actual part of the index’s subspace prefix key.
  • An index type. An index type is a unique string used to look up an index maintainer in a registry.
  • A key expression. A key expression is evaluated against a stored record to produce a list of tuples for each index entry for the same record.
  • Various index options. These options can tweak index behavior by, for example, adding uniqueness constraints.

The most basic index type is value. A value index follows the secondary index pattern in the FoundationDB samples more or less exactly: each index entry consists of a key-value pair within the index’s subspace whose key suffix is the indexed value concatenated with the record’s primary key. To find matching records as part of a query, a range scan is performed against the index’s subspace with the target value as a prefix and the primary keys that follow it used to retrieve the actual records.

The most basic key expression type is field. It gets a field value from the record and returns a singleton list with the singleton tuple of that value. If the field were repeated, the list would be longer. If a concat expression were used, the tuple would be longer.

The first consideration in designing a new way of indexing data is therefore determining whether this is a new way of storing the data or just a new way of extracting the data from the record and then storing it in the standard way. In the former case, we require a new index maintainer. In the latter case, we require a new key expression. For example, the Record Layer’s indexes on ordinal rank require a skip list optimized for FoundationDB’s key space which is maintained by the RankIndexMaintainer. In contrast, collation indexes are maintained exactly like standard value indexes once the collation has been applied; this is handled by CollateFunctionKeyExpression.

An index maintainer is an implementation of the IndexMaintainer abstract class. The objects in the service loader are the IndexMaintainerFactory instances which are associated with one or more index types and produce index maintainers with specific state for each use. Most index maintainers extend StandardIndexMaintainer, which has methods for things like determining which index entries have in fact changed to avoid removing and then adding right back the same entry.

A key expression is an implementation of the KeyExpression interface. The most general form of a key expression requires changes to the core because it must implement toProto for serialization with the meta-data and be known to fromProto for the other direction.

A special kind of key expression meant specifically for implementations outside the core is a FunctionKeyExpression. These are produced by instances of FunctionKeyExpression.Factory found through the service loader. A function key expression adds:

  • a name, through which the expression can be found and with which it can be stored in the meta-data’s protobuf.
  • an argument, another key expression, which is evaluated as part of evaluating the function key expression.

The FunctionKeyExpression’s evaluateFunction method is called with the record being saved, so it can do whatever it wants to get data from there. If the function just needs some field(s) from the records, it is a best practice to specify those as part of the argument expression.

Record functions

A RecordFunction is evaluated by a record store against an FDBRecord. A function is identified by name.

A StoreRecordFunction is evaluated by the record store itself, so new ones can only be added by changes to the core or a new record store subclass.

An IndexRecordFunction is evaluated by an index maintainer. The canEvaluateRecordFunction method is used to choose the index maintainer and then evaluateRecordFunction to do the actual evaluation.

An example of a record function implemented by an index is rank. The index maintains a skip list of the indexed value at save time. Then, at query time, rank can be evaluated more efficiently. Something similar could be done with other data structures that store some kind of state from which relationships between a given record and the other index records can be calculated.

Aggregate functions

An IndexAggregateFunction is like an aggregate function in SQL, but it is evaluated with the assistance of an index. The index maintains aggregates (or partial aggregates) so that it can efficiently calculate the aggregate without having to actually scan records. The index does not typically store enough information to associate individual records back with the aggregate data and so is more like a materialized view of an aggregate scan. An IndexAggregateFunction has an operand, which is like the GROUP BY expression in SQL, separating the aggregates into several bins.

An IndexAggregateFunction is evaluated by an index maintainer. The canEvaluateAggregateFunction method is used to choose the index maintainer and then evaluateAggregateFunction to do the actual evaluation.

An ordinary value index can implement the min and max aggregate functions. The index stores values in order (possibly grouped by earlier fields on a compound index). The min is just the smallest such value and the max the largest.

The remaining index aggregate functions in the core, such as count and sum, are implemented using FoundationDB’s atomic mutations. This gives the added benefit that updates to the total do not cause transaction conflicts.

Query components

A QueryComponent is a tri-valued (true, false, unknown, as in SQL) predicate used to implement filtering in queries. It is invoked with the record and a context of parameter bindings.

Query execution plan elements

A query plan is a tree of RecordQueryPlan. The plan’s execute method returns a RecordCursor of queried records. For the root of the tree, these records are the result of the query. For the leaves, the plan element will open a cursor that scans records from the record store, usually through a secondary index. The intermediate plan elements execute their children, take the cursors that result and filter, buffer, or combine them in some way.

A RecordCursor is an asynchronous form of iterator based on CompletableFuture. Execution of a tree of RecordQueryPlan produces a tree of RecordCursor. A new RecordQueryPlan implementation might require a new implementation of RecordCursor.

Since FoundationDB is optimized for throughput more than individual latency, it is important to keep a reasonable amount of database work outstanding. The core set of RecordCursor implementations include mapPipelined and flatMapPipelined, which turn input cursor elements into futures and sub-streams, respectively, keeping a specified number of elements outstanding ahead of where the cursor itself has generated results.

RecordQueryPlan.execute and RecordCursors must also support continuations, which are opaque byte strings used to identify a position in the stream at which the cursor can continue again, possibly in a different transaction. Since all the normal leaf elements and cursors support continuations, implementing them for a new cursor usually means combining the continuation state of child cursors when generating a continuation and then splitting the combined continuation into initial continuations for the child sub-plans when resuming execution.

Query planner rules

Making the query planner extensible is being tracked as issue #274.