Class Config
java.lang.Object
com.apple.foundationdb.async.hnsw.Config
Configuration settings for a
HNSW.-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final booleanstatic final intstatic final booleanstatic final booleanstatic final intstatic final intstatic final intstatic final doublestatic final intstatic final intstatic final Metricstatic final intstatic final doublestatic final intstatic final booleanstatic final boolean -
Method Summary
Modifier and TypeMethodDescriptionbooleanintMaximum size of the search queues (one independent queue per layer) that are used during the insertion of a new node.intgetM()This attribute (namedMby the HNSW paper) is the connectivity value for all nodes stored on any layer.doubleIf sampling is necessary (currently iffisUseRaBitQ()istrue), this attribute represents the probability of theStorageAdapter.SUBSPACE_PREFIX_SAMPLESsubspace to be further aggregated (rolled-up) when a new vector is inserted.intMaximum number of concurrent neighborhood fetches during modification operations when the neighbors are pruned.intMaximum number of concurrent node fetches during search and modification operations.The metric that is used to determine distances between vectors.intgetMMax()This attribute (namedM_maxby the HNSW paper) is the maximum connectivity value for nodes stored on a layer greater than0.intgetMMax0()This attribute (namedM_max0by the HNSW paper) is the maximum connectivity value for nodes stored on layer0.intThe number of dimensions used.intNumber of bits per dimensions iffisUseRaBitQ()is set totrue, ignored otherwise.doubleIf sampling is necessary (currently iffisUseRaBitQ()istrue), this attribute represents the probability of a vector being inserted to also be written into theStorageAdapter.SUBSPACE_PREFIX_SAMPLESsubspace.intIf sampling is necessary (currently iffisUseRaBitQ()istrue), this attribute represents the threshold (being a number of vectors) that when reached causes the stats maintenance logic to compute the actual statistics (currently the centroid of the vectors that have been inserted to far).inthashCode()booleanIndicator that iftruecauses the insert logic of the HNSW to be seeded using a hash of the primary key of the record that is inserted.booleanIndicator to signal if, during the insertion of a node, the set of nearest neighbors of that node is to be extended by the actual neighbors of those neighbors to form a set of candidates that the new node may be connected to during the insert operation.booleanIndicator to signal if, during the insertion of a node, candidates that have been discarded due to not satisfying the select-neighbor heuristic may get added back in to pad the set of neighbors if the new node would otherwise have too few neighbors (seem).booleanIndicator if all layers except layer0use inlining.booleanIndicator if we should RaBitQ quantization.toString()
-
Field Details
-
DEFAULT_DETERMINISTIC_SEEDING
public static final boolean DEFAULT_DETERMINISTIC_SEEDING- See Also:
-
DEFAULT_METRIC
-
DEFAULT_USE_INLINING
public static final boolean DEFAULT_USE_INLINING- See Also:
-
DEFAULT_M
public static final int DEFAULT_M- See Also:
-
DEFAULT_M_MAX_0
public static final int DEFAULT_M_MAX_0- See Also:
-
DEFAULT_M_MAX
public static final int DEFAULT_M_MAX- See Also:
-
DEFAULT_EF_CONSTRUCTION
public static final int DEFAULT_EF_CONSTRUCTION- See Also:
-
DEFAULT_EXTEND_CANDIDATES
public static final boolean DEFAULT_EXTEND_CANDIDATES- See Also:
-
DEFAULT_KEEP_PRUNED_CONNECTIONS
public static final boolean DEFAULT_KEEP_PRUNED_CONNECTIONS- See Also:
-
DEFAULT_SAMPLE_VECTOR_STATS_PROBABILITY
public static final double DEFAULT_SAMPLE_VECTOR_STATS_PROBABILITY- See Also:
-
DEFAULT_MAINTAIN_STATS_PROBABILITY
public static final double DEFAULT_MAINTAIN_STATS_PROBABILITY- See Also:
-
DEFAULT_STATS_THRESHOLD
public static final int DEFAULT_STATS_THRESHOLD- See Also:
-
DEFAULT_USE_RABITQ
public static final boolean DEFAULT_USE_RABITQ- See Also:
-
DEFAULT_RABITQ_NUM_EX_BITS
public static final int DEFAULT_RABITQ_NUM_EX_BITS- See Also:
-
DEFAULT_MAX_NUM_CONCURRENT_NODE_FETCHES
public static final int DEFAULT_MAX_NUM_CONCURRENT_NODE_FETCHES- See Also:
-
DEFAULT_MAX_NUM_CONCURRENT_NEIGHBOR_FETCHES
public static final int DEFAULT_MAX_NUM_CONCURRENT_NEIGHBOR_FETCHES- See Also:
-
-
Method Details
-
isDeterministicSeeding
public boolean isDeterministicSeeding()Indicator that iftruecauses the insert logic of the HNSW to be seeded using a hash of the primary key of the record that is inserted. That can be useful for testing. IfisDeterministicSeedingisfalse, we useSystem.nanoTime()for seeding. -
getMetric
The metric that is used to determine distances between vectors. -
getNumDimensions
public int getNumDimensions()The number of dimensions used. All vectors must have exactly this number of dimensions. -
isUseInlining
public boolean isUseInlining()Indicator if all layers except layer0use inlining. One entire layer is fully managed using either the compact or the inlining layout. If inlining is used, each node is persisted as a key/value pair per neighbor which includes the vectors of the neighbors but not the vector for itself. If inlining is not used, and therefore the compact layout is used instead, each node is persisted as exactly one key/value pair per node which stores its own vector but specifically excludes the vectors of the neighbors.If a layer uses the compact storage layout, each vector is stored with the node and therefore is stored exactly once. During a nearest neighbor search, a fetch of the neighborhood of a node incurs a fetch (get) for each of the neighbors of that node.
If a layer uses the inlining storage layout, a vector of a node is stored with the neighboring information of an adjacent node pointing to this node and is therefore is stored multiple times (once per neighbor). During a nearest neighbor search, however, the neighboring vectors of a vector can all be fetched in one range scan.
Choosing which storage format is right for the use case depends on some factors:
-
Inlining leaves uses more storage in the database as vectors are stored multiple times. However, since
inlining is only supported for layers greater than
0, the overhead of storing more data should be acceptable in most use cases. -
Inlining should outperform the compact storage model during search. Tests have shown a latency
improvement of searches of roughly 5% - 10%. Insert performance is slightly decreased due to higher
number of bytes that need to be written. Note that this overhead can be mitigated by using
isUseRaBitQ()to drastically reduce the sizes of vectors on disk.
-
Inlining leaves uses more storage in the database as vectors are stored multiple times. However, since
inlining is only supported for layers greater than
-
getM
public int getM()This attribute (namedMby the HNSW paper) is the connectivity value for all nodes stored on any layer. While by no means enforced or even enforceable, we strive to create and maintain exactlymneighbors for a node. Due to insert/delete operations it is possible that the actual number of neighbors a node references is not exactlymat any given time. -
getMMax
public int getMMax()This attribute (namedM_maxby the HNSW paper) is the maximum connectivity value for nodes stored on a layer greater than0. We will never create more thatmMaxneighbors for a node. That means that we even prune the neighbors of a node if the actual number of neighbors would otherwise exceedmMax. Note that this attribute must be greater than or equal tom. -
getMMax0
public int getMMax0()This attribute (namedM_max0by the HNSW paper) is the maximum connectivity value for nodes stored on layer0. We will never create more thatmMax0neighbors for a node that is stored on that layer. That means that we even prune the neighbors of a node if the actual number of neighbors would otherwise exceedmMax0. Note that this attribute must be greater than or equal tomMax. -
getEfConstruction
public int getEfConstruction()Maximum size of the search queues (one independent queue per layer) that are used during the insertion of a new node. IfefConstructionis set to1, the search naturally follows a greedy approach (monotonous descent), whereas a high number forefConstructionallows for a more nuanced search that can tolerate (false) local minima. -
isExtendCandidates
public boolean isExtendCandidates()Indicator to signal if, during the insertion of a node, the set of nearest neighbors of that node is to be extended by the actual neighbors of those neighbors to form a set of candidates that the new node may be connected to during the insert operation. -
isKeepPrunedConnections
public boolean isKeepPrunedConnections()Indicator to signal if, during the insertion of a node, candidates that have been discarded due to not satisfying the select-neighbor heuristic may get added back in to pad the set of neighbors if the new node would otherwise have too few neighbors (seem). -
getSampleVectorStatsProbability
public double getSampleVectorStatsProbability()If sampling is necessary (currently iffisUseRaBitQ()istrue), this attribute represents the probability of a vector being inserted to also be written into theStorageAdapter.SUBSPACE_PREFIX_SAMPLESsubspace. The vectors in that subspace are continuously aggregated until a totalstatsThresholdhas been reached. -
getMaintainStatsProbability
public double getMaintainStatsProbability()If sampling is necessary (currently iffisUseRaBitQ()istrue), this attribute represents the probability of theStorageAdapter.SUBSPACE_PREFIX_SAMPLESsubspace to be further aggregated (rolled-up) when a new vector is inserted. The vectors in that subspace are continuously aggregated until a totalstatsThresholdhas been reached. -
getStatsThreshold
public int getStatsThreshold()If sampling is necessary (currently iffisUseRaBitQ()istrue), this attribute represents the threshold (being a number of vectors) that when reached causes the stats maintenance logic to compute the actual statistics (currently the centroid of the vectors that have been inserted to far). -
isUseRaBitQ
public boolean isUseRaBitQ()Indicator if we should RaBitQ quantization. SeeRaBitQuantizerfor more details. -
getRaBitQNumExBits
public int getRaBitQNumExBits()Number of bits per dimensions iffisUseRaBitQ()is set totrue, ignored otherwise. If RaBitQ encoding is used, a vector is stored using roughly25 + numDimensions * (numExBits + 1) / 8bytes. -
getMaxNumConcurrentNodeFetches
public int getMaxNumConcurrentNodeFetches()Maximum number of concurrent node fetches during search and modification operations. -
getMaxNumConcurrentNeighborhoodFetches
public int getMaxNumConcurrentNeighborhoodFetches()Maximum number of concurrent neighborhood fetches during modification operations when the neighbors are pruned. -
toBuilder
-
equals
-
hashCode
public int hashCode() -
toString
-