com.apple.foundationdb.kmeans.KMeans

public final class KMeans extends Object

A restartable Lloyd-style k-means++ implementation intended for LOCAL cluster restructuring (SPFresh-style). Self-contained and stateless: a single static fit entry point runs the whole algorithm and returns a KMeans.Result describing the chosen partitioning.

The algorithm assumes the external metric is ordinary L2, but the k-means core optimizes the standard squared-L2 objective internally:

assignment uses squared distance,
centroid update uses arithmetic mean,
k-means++ initialization weights candidate centroids by squared distance to the nearest already-chosen centroid, and
restart selection compares the per-restart geometric SSE.

Optional soft size balancing biases assignment by the running projected cluster size (see KMeans.SizePenalty); set lambda == 0 or sizePenalty == null to disable.

After the main Lloyd loop exits — whether via convergence (changed == 0) or by hitting maxIterations — fit(java.util.SplittableRandom, com.apple.foundationdb.linear.DistanceEstimator, com.apple.foundationdb.util.Lens<V, com.apple.foundationdb.linear.RealVector>, com.apple.foundationdb.util.Lens<C, com.apple.foundationdb.linear.RealVector>, java.util.List<V>, int, int, int, double, com.apple.foundationdb.kmeans.KMeans.SizePenalty) runs one final assignment pass against the latest centroids using the same scoring as the loop. This guarantees that the returned KMeans.Result.assignment(), KMeans.Result.clusterSizes(), and KMeans.Result.distances() are mutually consistent with KMeans.Result.clusterCentroids(), regardless of how the loop terminated.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static final record

KMeans.Result<C>

Outcome of a fit call: the chosen centroids, the per-vector cluster assignment, the per-cluster sizes, the per-vector objective contributions, and the total objective.

static interface

KMeans.SizePenalty

Soft size-balancing penalty hook.
Method Summary

Modifier and Type

Method

Description

static <V, C> KMeans.Result<C>

fit(SplittableRandom random, DistanceEstimator distanceEstimator, Lens<V,RealVector> vectorLens, Lens<C,RealVector> centroidLens, List<V> vectors, int k, int maxIterations, int maxRestarts, double lambda, KMeans.SizePenalty sizePenalty)

Fits k clusters using a restartable Lloyd-style squared-L2 k-means with k-means++ initialization.

static KMeans.SizePenalty

overflowQuadraticPenalty()

Returns a KMeans.SizePenalty that penalizes only overflow above the target cluster size, quadratically: (max(0, projected - target))^2 / max(1, target).

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Method Details
- overflowQuadraticPenalty
  
  @Nonnull public static KMeans.SizePenalty overflowQuadraticPenalty()
  
  Returns a KMeans.SizePenalty that penalizes only overflow above the target cluster size, quadratically: (max(0, projected - target))^2 / max(1, target). A reasonable default for soft size balancing.
  
  Returns:
  
  a non-null overflow-quadratic penalty function
- fit
  
  public static <V, C> KMeans.Result<C> fit(@Nonnull SplittableRandom random, @Nonnull DistanceEstimator distanceEstimator, @Nonnull Lens<V,RealVector> vectorLens, @Nonnull Lens<C,RealVector> centroidLens, @Nonnull List<V> vectors, int k, int maxIterations, int maxRestarts, double lambda, @Nullable KMeans.SizePenalty sizePenalty)
  
  Fits k clusters using a restartable Lloyd-style squared-L2 k-means with k-means++ initialization. Returns the best (lowest geometric SSE) result across all restarts.
  Per restart, the algorithm runs at most maxIterations Lloyd iterations of (assignment, centroid update). Empty clusters detected during the centroid update are reseeded with the data point currently farthest from any centroid. After the main loop, a final assignment pass against the produced centroids is run so the returned partitioning is self-consistent.
  When lambda > 0 the assignment step becomes order-dependent (each vector's size-balance penalty is evaluated against the running projected sizes of all clusters assigned before it). To stop a fixed input ordering from biasing the final partitioning, fit reshuffles the assignment order via Fisher–Yates at the top of every iteration whenever lambda > 0. With lambda == 0 the assignment is order-independent and shuffling is skipped as wasted work.
  
  Type Parameters:
  
  V - caller's input vector representation
  
  C - caller's centroid representation
  
  Parameters:
  
  random - the random source; consumed by k-means++ initialization, restart seeding, and per-iteration shuffling (when active)
  
  distanceEstimator - the distance estimator. Its metric must be either EUCLIDEAN_METRIC or COSINE_METRIC; other metrics are rejected with an UnsupportedOperationException
  
  vectorLens - lens that extracts a RealVector from each input element
  
  centroidLens - lens that wraps a RealVector as the caller's centroid type
  
  vectors - the input points to cluster; must contain at least k elements
  
  k - number of clusters; must be at least 1. k == 1 is handled in closed form (the single centroid is the mean of all vectors), bypassing iteration and restarts
  
  maxIterations - maximum Lloyd iterations per restart; must be at least 1
  
  maxRestarts - number of additional restarts beyond the first run; must be non-negative. The total number of runs is maxRestarts + 1; the run with the lowest geometric SSE wins
  
  lambda - strength of size-balancing bias added to the assignment score; must be non-negative. 0 disables balancing
  
  sizePenalty - penalty function applied to projected cluster sizes during assignment; null disables balancing. Ignored when lambda == 0
  
  Returns:
  
  the best partitioning found across all restarts, ranked by geometric SSE

Class KMeans

Nested Class Summary

Method Summary

Methods inherited from class java.lang.Object

Method Details

overflowQuadraticPenalty

fit