Class EncodedRealVector

java.lang.Object
com.apple.foundationdb.rabitq.EncodedRealVector
All Implemented Interfaces:
RealVector

public class EncodedRealVector extends Object implements RealVector
Wire/storage representation of a RaBitQ-quantized vector.

Each component is encoded into numExBits + 1 bits (one sign bit plus numExBits magnitude "extra bits", following the RaBitQ paper's terminology), tightly bit-packed for storage. Three per-vector calibration constants — fAddEx, fRescaleEx, fErrorEx — accompany the integer codes; they are produced by RaBitQuantizer during encoding and consumed by RaBitDistanceEstimator during distance evaluation. The encoded form is dramatically smaller than the equivalent DoubleRealVector (roughly (numExBits + 1) / 64 of the bytes) while still supporting fast distance estimation against other RaBitQ-encoded queries.

This class implements RealVector so it composes with the rest of the linear-algebra surface, but it is fundamentally an opaque encoded blob, not a dense numeric vector. Two consequences of that worth knowing:

  • getData() lazily reconstructs an approximate dense double[] from the codes plus calibration constants (see computeData()). This is a best-effort dequantization, not the original vector — round-tripping through encoded form is lossy by design.
  • withData(double[]) returns a fresh DoubleRealVector, not another EncodedRealVector, because re-encoding requires the quantizer's full per-call state (rotation seed, calibration sweep). As a result, the arithmetic methods inherited from RealVector (add, subtract, multiply, normalize) produce ordinary double vectors and discard the encoding.

The wire format produced by getRawData() is a leading VectorType.RABITQ type byte, the three calibration doubles in big-endian order, then the bit-packed integer codes. fromBytes(byte[], int, int) parses the format back, given the dimensionality and numExBits (which are not stored in the byte array — the caller is expected to know them from surrounding context, typically the quantizer's configuration).

  • Constructor Details

    • EncodedRealVector

      public EncodedRealVector(int numExBits, @Nonnull int[] encoded, double fAddEx, double fRescaleEx, double fErrorEx)
      Constructs an encoded vector from the raw outputs of RaBitQuantizer. The encoded array is stored by reference (no defensive copy); callers must not mutate it after handing it over.
      Parameters:
      numExBits - number of magnitude bits per component (not counting the sign bit)
      encoded - per-dimension integer codes; ownership transfers to this vector
      fAddEx - the additive term (getAddEx())
      fRescaleEx - the multiplicative rescale (getRescaleEx())
      fErrorEx - the per-vector error bound (getErrorEx())
  • Method Details

    • getEncodedData

      @Nonnull public int[] getEncodedData()
      Returns the underlying integer-code array (no copy). Each entry is the signed code for one component, packed into numExBits + 1 bits when serialized. Callers must not mutate the returned array.
      Returns:
      the per-dimension code array
    • getAddEx

      public double getAddEx()
      Returns the per-vector additive term used during distance estimation. See fAddEx.
    • getRescaleEx

      public double getRescaleEx()
      Returns the per-vector multiplicative rescale used during distance estimation. See fRescaleEx.
    • getErrorEx

      public double getErrorEx()
      Returns the per-vector error bound used during distance estimation and dequantization. See fErrorEx.
    • equals

      public final boolean equals(Object o)
      Two encoded vectors compare equal iff they have identical code arrays and identical calibration constants (fAddEx, fRescaleEx, fErrorEx). The dequantized representation is intentionally not consulted — equality is on the encoding, not on the (lossy) reconstruction.
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Returns the memoized hash code, computed from the code array and the three calibration doubles — consistent with equals(Object).
      Overrides:
      hashCode in class Object
    • computeHashCode

      public int computeHashCode()
      Computes the hash from scratch, backing the memoizing hashCodeSupplier.
      Returns:
      the hash code
    • getNumDimensions

      public int getNumDimensions()
      Description copied from interface: RealVector
      Returns the number of elements in the vector, i.e. the number of dimensions.
      Specified by:
      getNumDimensions in interface RealVector
      Returns:
      the number of dimensions
    • getEncodedComponent

      public int getEncodedComponent(int dimension)
      Returns the raw integer code for the component at the given dimension. Unlike getComponent(int) this does not trigger dequantization — useful for distance estimators that operate directly on the encoded representation.
      Parameters:
      dimension - the zero-based dimension index
      Returns:
      the integer code for that dimension
      Throws:
      IndexOutOfBoundsException - if dimension is out of range
    • getComponent

      public double getComponent(int dimension)
      Gets the component of this object at the specified dimension.

      The dimension is a zero-based index. For a 3D vector, for example, dimension 0 might correspond to the x-component, 1 to the y-component, and 2 to the z-component. This method provides direct access to the underlying data element.

      Reads from the lazily reconstructed dense form (see getData()); the first call materializes the full double[] via computeData(). If you only need the raw integer code, prefer getEncodedComponent(int) which skips the reconstruction.

      Specified by:
      getComponent in interface RealVector
      Parameters:
      dimension - the zero-based index of the component to retrieve.
      Returns:
      the component at the specified dimension, which is guaranteed to be non-null.
    • getData

      @Nonnull public double[] getData()
      Returns the lazily reconstructed dense form of this encoded vector. The reconstruction is approximate (see computeData() for details) and memoized so repeated calls are cheap.
      Specified by:
      getData in interface RealVector
      Returns:
      the data array of type R[], never null.
    • withData

      @Nonnull public RealVector withData(@Nonnull double[] data)
      Returns a new vector of the same precision and length as the receiver but with the given component data. Implementations decide whether the returned vector aliases data (immutable subtypes typically do; mutable subtypes copy through their existing storage).

      Returns a fresh DoubleRealVector carrying datanot a re-encoded EncodedRealVector, because re-encoding requires the quantizer's per-call state (rotation seed, calibration sweep). This means the inherited arithmetic methods (add, subtract, multiply, normalize) all drop back to ordinary double-precision results.

      Specified by:
      withData in interface RealVector
      Parameters:
      data - the components for the new vector; length must match this vector's dimensionality
      Returns:
      a non-null vector with the given data
    • computeData

      @Nonnull public double[] computeData()
      Reconstructs an approximate dense representation of the original (rotated, residual-form) vector from the integer codes plus the three calibration constants. Backs the memoizing supplier behind getData().

      The math, in summary:

      1. Un-shift the codes by cB = (1 << numExBits) - 0.5 to recover a symmetric-range vector z.
      2. Estimate the per-vector confidence weight ρ from the ratio of fErrorEx to (ε₀-scaled) ||z|| * fRescaleEx, clamped to [0, 1].
      3. Return z scaled by -0.5 * fRescaleEx * ρ.
      The reconstruction is lossy by design — RaBitQ trades reconstruction fidelity for compact storage and fast distance estimation.
      Returns:
      the reconstructed dense components
      Throws:
      com.google.common.base.VerifyException - if the denominator that scales the error estimate is zero (a degenerate parameter combination that should never occur for a well-formed encoding)
    • getRawData

      @Nonnull public byte[] getRawData()
      Returns the memoized wire-format serialization of this vector (see computeRawData() for the format).
      Specified by:
      getRawData in interface RealVector
      Returns:
      a non-null byte array containing the raw data.
    • computeRawData

      @Nonnull protected byte[] computeRawData()
      Serializes this encoded vector into a byte array in the RaBitQ wire format. Layout (all multi-byte values big-endian):
      1. 1 byte — VectorType.RABITQ ordinal as a type tag.
      2. 8 bytes — fAddEx.
      3. 8 bytes — fRescaleEx.
      4. 8 bytes — fErrorEx.
      5. ceil(numDimensions * (numExBits + 1) / 8) bytes — the per-dimension integer codes, tightly bit-packed by packEncodedComponents(int, ByteBuffer).
      Note that the dimensionality and numExBits are not stored in the byte stream; fromBytes(byte[], int, int) requires both as explicit arguments.
      Returns:
      the serialized form; never null
    • toHalfRealVector

      @Nonnull public HalfRealVector toHalfRealVector()
      Converts this object into a RealVector of Half precision floating-point numbers.

      As this is an abstract method, implementing classes are responsible for defining the specific conversion logic from their internal representation to a RealVector using Half objects to serialize and deserialize the vector. If this object already is a HalfRealVector this method should return this.

      Computed from the lazily reconstructed dense form (see getData()); the resulting HalfRealVector is memoized.

      Specified by:
      toHalfRealVector in interface RealVector
      Returns:
      a non-null HalfRealVector containing the Half precision floating-point representation of this object.
    • toFloatRealVector

      @Nonnull public FloatRealVector toFloatRealVector()
      Converts this object into a RealVector of single precision floating-point numbers.

      As this is an abstract method, implementing classes are responsible for defining the specific conversion logic from their internal representation to a RealVector using floating point numbers to serialize and deserialize the vector. If this object already is a FloatRealVector this method should return this.

      Computed from the lazily reconstructed dense form (see getData()); the resulting FloatRealVector is memoized.

      Specified by:
      toFloatRealVector in interface RealVector
      Returns:
      a non-null FloatRealVector containing the single precision floating-point representation of this object.
    • toDoubleRealVector

      @Nonnull public DoubleRealVector toDoubleRealVector()
      Converts this vector into a DoubleRealVector.

      This method provides a way to obtain a double-precision floating-point representation of the vector. If the vector is already an instance of DoubleRealVector, this method may return the instance itself. Otherwise, it will create a new DoubleRealVector containing the same elements, which may involve a conversion of the underlying data type.

      Returns a fresh DoubleRealVector carrying the reconstructed dense components. Unlike the half/float conversions, this one is not memoized — the underlying double[] reconstruction is already memoized by getData(), so wrapping it again is cheap.

      Specified by:
      toDoubleRealVector in interface RealVector
      Returns:
      a non-null DoubleRealVector representation of this vector.
    • toImmutable

      @Nonnull public EncodedRealVector toImmutable()
      Returns this — instances of this class are already immutable.
      Specified by:
      toImmutable in interface RealVector
      Returns:
      a non-null immutable vector with the same components as this vector
    • l2SquaredNorm

      public double l2SquaredNorm()
      Returns the squared L2 norm Σ_i this[i]^2. Implementations typically memoize this since the value is reused by RealVector.l2Norm() and several distance helpers.

      Computed from the reconstructed dense form (it's dot(this) on the dequantized components) and memoized.

      Specified by:
      l2SquaredNorm in interface RealVector
      Returns:
      the squared L2 norm of this vector
    • fromBytes

      @Nonnull public static EncodedRealVector fromBytes(@Nonnull byte[] vectorBytes, int numDimensions, int numExBits)
      Deserializes an encoded vector from the wire format produced by computeRawData(). The dimensionality and numExBits are not stored in the byte stream and must be supplied by the caller (typically from the encoder's configuration).
      Parameters:
      vectorBytes - the serialized form; must start with the VectorType.RABITQ type tag
      numDimensions - number of components in the encoded vector
      numExBits - number of magnitude bits per component (sign bit implicit)
      Returns:
      a freshly allocated encoded vector
      Throws:
      com.google.common.base.VerifyException - if the leading type tag is not VectorType.RABITQ