Class KMeans

java.lang.Object
com.apple.foundationdb.kmeans.KMeans

public final class KMeans extends Object
A restartable Lloyd-style k-means++ implementation intended for LOCAL cluster restructuring (SPFresh-style). Self-contained and stateless: a single static fit entry point runs the whole algorithm and returns a KMeans.Result describing the chosen partitioning.

The algorithm assumes the external metric is ordinary L2, but the k-means core optimizes the standard squared-L2 objective internally:

  • assignment uses squared distance,
  • centroid update uses arithmetic mean,
  • k-means++ initialization weights candidate centroids by squared distance to the nearest already-chosen centroid, and
  • restart selection compares the per-restart geometric SSE.

Optional soft size balancing biases assignment by the running projected cluster size (see KMeans.SizePenalty); set lambda == 0 or sizePenalty == null to disable.

After the main Lloyd loop exits — whether via convergence (changed == 0) or by hitting maxIterationsfit(java.util.SplittableRandom, com.apple.foundationdb.linear.DistanceEstimator, com.apple.foundationdb.util.Lens<V, com.apple.foundationdb.linear.RealVector>, com.apple.foundationdb.util.Lens<C, com.apple.foundationdb.linear.RealVector>, java.util.List<V>, int, int, int, double, com.apple.foundationdb.kmeans.KMeans.SizePenalty) runs one final assignment pass against the latest centroids using the same scoring as the loop. This guarantees that the returned KMeans.Result.assignment(), KMeans.Result.clusterSizes(), and KMeans.Result.distances() are mutually consistent with KMeans.Result.clusterCentroids(), regardless of how the loop terminated.

  • Method Details

    • overflowQuadraticPenalty

      @Nonnull public static KMeans.SizePenalty overflowQuadraticPenalty()
      Returns a KMeans.SizePenalty that penalizes only overflow above the target cluster size, quadratically: (max(0, projected - target))^2 / max(1, target). A reasonable default for soft size balancing.
      Returns:
      a non-null overflow-quadratic penalty function
    • fit

      public static <V, C> KMeans.Result<C> fit(@Nonnull SplittableRandom random, @Nonnull DistanceEstimator distanceEstimator, @Nonnull Lens<V,RealVector> vectorLens, @Nonnull Lens<C,RealVector> centroidLens, @Nonnull List<V> vectors, int k, int maxIterations, int maxRestarts, double lambda, @Nullable KMeans.SizePenalty sizePenalty)
      Fits k clusters using a restartable Lloyd-style squared-L2 k-means with k-means++ initialization. Returns the best (lowest geometric SSE) result across all restarts.

      Per restart, the algorithm runs at most maxIterations Lloyd iterations of (assignment, centroid update). Empty clusters detected during the centroid update are reseeded with the data point currently farthest from any centroid. After the main loop, a final assignment pass against the produced centroids is run so the returned partitioning is self-consistent.

      When lambda > 0 the assignment step becomes order-dependent (each vector's size-balance penalty is evaluated against the running projected sizes of all clusters assigned before it). To stop a fixed input ordering from biasing the final partitioning, fit reshuffles the assignment order via Fisher–Yates at the top of every iteration whenever lambda > 0. With lambda == 0 the assignment is order-independent and shuffling is skipped as wasted work.

      Type Parameters:
      V - caller's input vector representation
      C - caller's centroid representation
      Parameters:
      random - the random source; consumed by k-means++ initialization, restart seeding, and per-iteration shuffling (when active)
      distanceEstimator - the distance estimator. Its metric must be either EUCLIDEAN_METRIC or COSINE_METRIC; other metrics are rejected with an UnsupportedOperationException
      vectorLens - lens that extracts a RealVector from each input element
      centroidLens - lens that wraps a RealVector as the caller's centroid type
      vectors - the input points to cluster; must contain at least k elements
      k - number of clusters; must be at least 2
      maxIterations - maximum Lloyd iterations per restart; must be at least 1
      maxRestarts - number of additional restarts beyond the first run; must be non-negative. The total number of runs is maxRestarts + 1; the run with the lowest geometric SSE wins
      lambda - strength of size-balancing bias added to the assignment score; must be non-negative. 0 disables balancing
      sizePenalty - penalty function applied to projected cluster sizes during assignment; null disables balancing. Ignored when lambda == 0
      Returns:
      the best partitioning found across all restarts, ranked by geometric SSE