Extending the Record Layer
As the name suggests, the Record Layer is meant to sit above the FoundationDB key-value store and possibly below higher level layers such as query language parsers. Within the layer itself, there are a number of extension points. Generally, concepts that have several variants or involve significant implementation tradeoffs are defined and accessed through such an extension point.
Objects that need to be known by the Record Layer core are generally discovered using
ServiceLoader. This allows them to be introduced by additional .jar files in the class path.
NOTE: These extension interfaces have varying levels of maturity. Our
@API stability annotations are meant to make this level of maturity clearer to users of the Record Layer. Anyone writing extensions for the Record Layer which, for whatever reason, are not appropriate to submit back to the core source, is still strongly encouraged to create GitHub issues or forum posts noting what they used so that their extensions can be better supported in a compatible way as the system evolves.
A (secondary) index in the Record Layer is a subspace of the record store uniquely associated with the index. This subspace is updated when records are inserted or updated in the same transaction so that it is always consistent with the (primary) record data.
Within the meta-data, an index consists of:
- A unique name. The name is used for meta-data operations. There is a separate, more compact, unique identifier that is the actual part of the index’s subspace prefix key.
- An index type. An index type is a unique string used to look up an index maintainer in a registry.
- A key expression. A key expression is evaluated against a stored record to produce a list of tuples for each index entry for the same record.
- Various index options. These options can tweak index behavior by, for example, adding uniqueness constraints.
The most basic index type is
value. A value index follows the secondary index pattern in the FoundationDB samples more or less exactly: each index entry consists of a key-value pair within the index’s subspace whose key suffix is the indexed value concatenated with the record’s primary key. To find matching records as part of a query, a range scan is performed against the index’s subspace with the target value as a prefix and the primary keys that follow it used to retrieve the actual records.
The most basic key expression type is
field. It gets a field value from the record and returns a singleton list with the singleton tuple of that value. If the field were repeated, the list would be longer. If a
concat expression were used, the tuple would be longer.
The first consideration in designing a new way of indexing data is therefore determining whether this is a new way of storing the data or just a new way of extracting the data from the record and then storing it in the standard way. In the former case, we require a new index maintainer. In the latter case, we require a new key expression. For example, the Record Layer’s indexes on ordinal rank require a skip list optimized for FoundationDB’s key space which is maintained by the
RankIndexMaintainer. In contrast, collation indexes are maintained exactly like standard value indexes once the collation has been applied; this is handled by
An index maintainer is an implementation of the
IndexMaintainer abstract class. The objects in the service loader are the
IndexMaintainerFactory instances which are associated with one or more index types and produce index maintainers with specific state for each use. Most index maintainers extend
StandardIndexMaintainer, which has methods for things like determining which index entries have in fact changed to avoid removing and then adding right back the same entry.
A key expression is an implementation of the
KeyExpression interface. The most general form of a key expression requires changes to the core because it must implement
toProto for serialization with the meta-data and be known to
fromProto for the other direction.
A special kind of key expression meant specifically for implementations outside the core is a
FunctionKeyExpression. These are produced by instances of
FunctionKeyExpression.Factory found through the service loader. A function key expression adds:
- a name, through which the expression can be found and with which it can be stored in the meta-data’s protobuf.
- an argument, another key expression, which is evaluated as part of evaluating the function key expression.
evaluateFunction method is called with the record being saved, so it can do whatever it wants to get data from there. If the function just needs some field(s) from the records, it is a best practice to specify those as part of the argument expression.
RecordFunction is evaluated by a record store against an
FDBRecord. A function is identified by name.
StoreRecordFunction is evaluated by the record store itself, so new ones can only be added by changes to the core or a new record store subclass.
IndexRecordFunction is evaluated by an index maintainer. The
canEvaluateRecordFunction method is used to choose the index maintainer and then
evaluateRecordFunction to do the actual evaluation.
An example of a record function implemented by an index is
rank. The index maintains a skip list of the indexed value at save time. Then, at query time, rank can be evaluated more efficiently. Something similar could be done with other data structures that store some kind of state from which relationships between a given record and the other index records can be calculated.
IndexAggregateFunction is like an aggregate function in SQL, but it is evaluated with the assistance of an index. The index maintains aggregates (or partial aggregates) so that it can efficiently calculate the aggregate without having to actually scan records. The index does not typically store enough information to associate individual records back with the aggregate data and so is more like a materialized view of an aggregate scan. An
IndexAggregateFunction has an operand, which is like the
GROUP BY expression in SQL, separating the aggregates into several bins.
IndexAggregateFunction is evaluated by an index maintainer. The
canEvaluateAggregateFunction method is used to choose the index maintainer and then
evaluateAggregateFunction to do the actual evaluation.
value index can implement the
max aggregate functions. The index stores values in order (possibly grouped by earlier fields on a compound index). The min is just the smallest such value and the max the largest.
The remaining index aggregate functions in the core, such as
sum, are implemented using FoundationDB’s atomic mutations. This gives the added benefit that updates to the total do not cause transaction conflicts.
QueryComponent is a tri-valued (
unknown, as in SQL) predicate used to implement filtering in queries. It is invoked with the record and a context of parameter bindings.
Query execution plan elements
A query plan is a tree of
RecordQueryPlan. The plan’s execute method returns a
RecordCursor of queried records. For the root of the tree, these records are the result of the query. For the leaves, the plan element will open a cursor that scans records from the record store, usually through a secondary index. The intermediate plan elements execute their children, take the cursors that result and filter, buffer, or combine them in some way.
RecordCursor is an asynchronous form of iterator based on
CompletableFuture. Execution of a tree of
RecordQueryPlan produces a tree of
RecordCursor. A new
RecordQueryPlan implementation might require a new implementation of
Since FoundationDB is optimized for throughput more than individual latency, it is important to keep a reasonable amount of database work outstanding. The core set of
RecordCursor implementations include
flatMapPipelined, which turn input cursor elements into futures and sub-streams, respectively, keeping a specified number of elements outstanding ahead of where the cursor itself has generated results.
RecordCursors must also support continuations, which are opaque byte strings used to identify a position in the stream at which the cursor can continue again, possibly in a different transaction. Since all the normal leaf elements and cursors support continuations, implementing them for a new cursor usually means combining the continuation state of child cursors when generating a continuation and then splitting the combined continuation into initial continuations for the child sub-plans when resuming execution.
Query planner rules
Making the query planner extensible is being tracked as issue #274.