Class KMeans
fit entry point
runs the whole algorithm and returns a KMeans.Result describing the chosen partitioning.
The algorithm assumes the external metric is ordinary L2, but the k-means core optimizes the standard squared-L2 objective internally:
- assignment uses squared distance,
- centroid update uses arithmetic mean,
- k-means++ initialization weights candidate centroids by squared distance to the nearest already-chosen centroid, and
- restart selection compares the per-restart geometric SSE.
Optional soft size balancing biases assignment by the running projected cluster size
(see KMeans.SizePenalty); set lambda == 0 or sizePenalty == null to disable.
After the main Lloyd loop exits — whether via convergence (changed == 0) or by hitting
maxIterations — fit(java.util.SplittableRandom, com.apple.foundationdb.linear.DistanceEstimator, com.apple.foundationdb.util.Lens<V, com.apple.foundationdb.linear.RealVector>, com.apple.foundationdb.util.Lens<C, com.apple.foundationdb.linear.RealVector>, java.util.List<V>, int, int, int, double, com.apple.foundationdb.kmeans.KMeans.SizePenalty) runs one final assignment pass against the latest
centroids using the same scoring as the loop. This guarantees that the returned
KMeans.Result.assignment(), KMeans.Result.clusterSizes(), and
KMeans.Result.distances() are mutually consistent with KMeans.Result.clusterCentroids(),
regardless of how the loop terminated.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final recordOutcome of afitcall: the chosen centroids, the per-vector cluster assignment, the per-cluster sizes, the per-vector objective contributions, and the total objective.static interfaceSoft size-balancing penalty hook. -
Method Summary
Modifier and TypeMethodDescriptionstatic <V,C> KMeans.Result <C> fit(SplittableRandom random, DistanceEstimator distanceEstimator, Lens<V, RealVector> vectorLens, Lens<C, RealVector> centroidLens, List<V> vectors, int k, int maxIterations, int maxRestarts, double lambda, KMeans.SizePenalty sizePenalty) Fitskclusters using a restartable Lloyd-style squared-L2 k-means with k-means++ initialization.static KMeans.SizePenaltyReturns aKMeans.SizePenaltythat penalizes only overflow above the target cluster size, quadratically:(max(0, projected - target))^2 / max(1, target).
-
Method Details
-
overflowQuadraticPenalty
Returns aKMeans.SizePenaltythat penalizes only overflow above the target cluster size, quadratically:(max(0, projected - target))^2 / max(1, target). A reasonable default for soft size balancing.- Returns:
- a non-null overflow-quadratic penalty function
-
fit
public static <V,C> KMeans.Result<C> fit(@Nonnull SplittableRandom random, @Nonnull DistanceEstimator distanceEstimator, @Nonnull Lens<V, RealVector> vectorLens, @Nonnull Lens<C, RealVector> centroidLens, @Nonnull List<V> vectors, int k, int maxIterations, int maxRestarts, double lambda, @Nullable KMeans.SizePenalty sizePenalty) Fitskclusters using a restartable Lloyd-style squared-L2 k-means with k-means++ initialization. Returns the best (lowest geometric SSE) result across all restarts.Per restart, the algorithm runs at most
maxIterationsLloyd iterations of (assignment, centroid update). Empty clusters detected during the centroid update are reseeded with the data point currently farthest from any centroid. After the main loop, a final assignment pass against the produced centroids is run so the returned partitioning is self-consistent.When
lambda > 0the assignment step becomes order-dependent (each vector's size-balance penalty is evaluated against the running projected sizes of all clusters assigned before it). To stop a fixed input ordering from biasing the final partitioning,fitreshuffles the assignment order via Fisher–Yates at the top of every iteration wheneverlambda > 0. Withlambda == 0the assignment is order-independent and shuffling is skipped as wasted work.- Type Parameters:
V- caller's input vector representationC- caller's centroid representation- Parameters:
random- the random source; consumed by k-means++ initialization, restart seeding, and per-iteration shuffling (when active)distanceEstimator- the distance estimator. Its metric must be eitherEUCLIDEAN_METRICorCOSINE_METRIC; other metrics are rejected with anUnsupportedOperationExceptionvectorLens- lens that extracts aRealVectorfrom each input elementcentroidLens- lens that wraps aRealVectoras the caller's centroid typevectors- the input points to cluster; must contain at leastkelementsk- number of clusters; must be at least 2maxIterations- maximum Lloyd iterations per restart; must be at least 1maxRestarts- number of additional restarts beyond the first run; must be non-negative. The total number of runs ismaxRestarts + 1; the run with the lowest geometric SSE winslambda- strength of size-balancing bias added to the assignment score; must be non-negative.0disables balancingsizePenalty- penalty function applied to projected cluster sizes during assignment;nulldisables balancing. Ignored whenlambda == 0- Returns:
- the best partitioning found across all restarts, ranked by geometric SSE
-