Seeding

Authors: F. Cazals and G. Carriere

Introduction

This package implements seeding methods, i.e. initialization methods for clustering algorithms, such as k-means. These methods provide seeds, namely initial positions for cluster centers. Ideally, such seeds identify the clusters ahead of the usual Lloyd iterations to solve the k-means problem. As shown in the seminal paper [12] on k-means++, a smart seeding strategies actually provides guarantees without Lloyd iterations.

The topic was revisited in a series of papers, including [130] [21] and our own work [35].

This package provides implementations of these recent, methods, which can also be used to provide initial positions for components in mixture models.

Algorithms: selection and reselection

We provide implementations for two kind of methods. Selection methods select seeds using no prior information. Reselection methods select refined seeds using a set of previous seeds as prior information. Reselection methods are meant to be used after a first selection to improve the seeding quality for an added runtime cost. All selection and reselection methods in this package have static runtime as a function of the dataset size, i.e. the methods are not convergence based.

Let denote the distance from a data point to its closest seed, then the quality of a seeding is evaluated as

$\begin{equation}\phi = \sum_{x\in \calX} D(x)^2. \end{equation}$

Selection methods

The seed selection methods provided in this package include:

Random seed selection.

Seeds are obtained by selecting data points uniformly at random.

Minimax seed selection.

The first seed is obtained by selecting a single data point uniformly at random. The following seeds are selected iteratively from the data points. At each step the new seed is the data point $x \in \calX$ with maximal .

Greedy K-means++ seed selection.

Greedy K-means++ is a variant of K-means++, an initialization scheme for K-means described in [12].

K-means++ obtains the first seed by selecting a single data point uniformly at random. The following seeds are selected iteratively from the data points. At each step the new seed is selected randomly from the data points, where the probability of selecting a specific data point $x \in \calX$ is equal to

$\begin{equation}\frac{D(x')^2}{\sum_{x \in \calX} D(x)^2}. \end{equation}$

Greedy K-means++ is a simple improvement on K-means++, in which multiple candidates data points are sampled at each step, and the one candidate that minimizes $\phi$ the most is selected as new seed.

We provide the amount of seed candidates to sample as parameter, such that sampling one candidate reduces the method to "classical" K-means++.

Reselection methods

The seed reselection methods provided in this package include:

Greedy K-means++ seed reselection

Greedy K-means++ reselection follows the reselection scheme described in [35].

The method obtains new seeds from a selection of already selected seeds, by iteratively replacing the previous seeds one by one. At each reselection step , the -th seed is removed and a replacement seed is selected randomly from a set of candidate data points as in Greedy K-means++ selection.

Greedy K-means++ seed reselection using centers of mass SSE.

Greedy K-means++ reselection with centers of mass SSE is a variant of greedy K-means++ reselection, differing on the seed selection metric from the pool of candidates.

Let $D_C{x}$ denote the distance from a data point to the center of mass of all data points sharing the same closest seed. Then in the variant, the replacement seed is selected from the set of candidate data points as the candidate that minimized $\phi_C = \sum_{x\in \calX} D_C(x)^2$ the most.

Implementation and functionalities

Selection methods

The package provides three functor classes for implementing selectors. The Cost type should be a functor class implementing the squared distance to be used. For example, in Euclidean K-means the Cost functor should output the squared Euclidean distance between two points. The three classes are :

SBL::GT::T_Seed_selector_random<PointType, VectorType> implements random seed selection.
SBL::GT::T_Seed_selector_minimax<PointType, VectorType, Cost> implements minimax seed selection.
SBL::GT::T_Seed_selector_k_means_plus_plus<PointType, VectorType, Cost> implements greedy K-means++ seed selection. SBL::GT::T_Seed_selector_k_means_plus_plus The operator () function of this functor admits an additional parameter n_local_trials, corresponding to the amount of candidate data points so sample at each selection step

In regards to seeding quality, using SBL::GT::T_Seed_selector_k_means_plus_plus with n_local_trials set to is preferable.

Reselection methods

The package provides the functor class SBL::GT::T_Seed_reselector<PointType, VectorType, Cost, Candidate_metric_calculator> to implement reselectors. The operator () function of this functor admits an additional parameter n_local_trials as in the greedy K-means++ selector functor. The package also provides two functor classes to be used as template parameter for Candidate_metric_calculator, implementing the different reselection methods:

SBL::GT::T_seed_inertia_calculator<PointType, VectorType, Cost> as candidate metric calculator implements Greedy K-means++ seed reselection
SBL::GT::T_com_inertia_calculator<PointType, VectorType, Cost, Center_of_mass> as candidate metric calculator implements Greedy K-means++ seed reselection using centers of mass SSE. Center_of_mass should be any of the center of mass classes provided in SBL::GT::Cluster_engines or any class providing the same interface.

In regards to seeding quality, using SBL::GT::T_com_inertia_calculator as Candidate_metric_calculator, and setting n_local_trials to is preferable [35].

Models

The functor classes defined in this package are templated to adapt to different contexts. In the following we describe the template parameters shared by these classes :

PointType : Should be a data object representing a single point such that the dataset is a set of PointType objects. This parameter allows both for using different libraries (Eigen, CGAL...), or handling different spaces (periodic, spherical...).
VectorType : Should be a data object representing a single vector, as in the difference between two PointType.
Cost : Should be a functor class implementing the squared distance to be used. For example in Euclidean K-means the Cost functor should output the squared Euclidean distance between two PointType.
Candidate_metric_calculator : Should be a functor class implementing the metric to be used to rank candidates datapoints during seed replacement. The functor should take a dataset and a set of seeds, and output a value proportional to the quality of the seeding.
Center_of_mass : Should be any of the center of mass classes provided in SBL::GT::Cluster_engines or any class providing the same interface. This template parameter implements calculation for the center of mass of a set of data points of type PointType.

Examples

The (re)selectors defined above are used in the implementation of SBL::GT::T_Cluster_engine_k_means algorithm – package Cluster_engines .