Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Cluster_spherical

Authors: F. Cazals and A. Commaret and L. Goldenberg

Introduction: the spherical cluster model

Let a parametric cluster model be a cluster model depending on parameters to be estimated. As opposed to clusters defined with respect to points, such as in k-means or k-medoids, parametric cluster models aim at providing insights on the geometry of the point set modeled.

This package presents the first method to compute spherical clusters [cazals2026modeling] , which were introduced in the context of subspace clustering [193] .

As illustrated by the logo of this package, a spherical cluster is defined by a ball, based on

  • inliers: points in the ball incur a null cost;
  • outliers: points outside the ball pay a cost equal to the power distance with respect to the ball.

See also Fig. fig-CM-examples-Iris for spherical clusters obtained on the classical Iris 4D dataset, and the sections below for the precise mathematical definitions.

Spherical cluster models for the three clusters of the classical Iris 4D dataset: (Left) Setosa (Middle) Versicolor (Rigth) Virginica. Spherical cluster models for $\eta=0.7$, and points projected onto the first two principal components of each dataset. Inliers (resp. outliers) presented as blue (resp. orange) dots. Note that outliers in 4D may project near the cluster center in 2D. Each plot also features the center of mass and the projection median.

Spherical clusters as parametric clusters


Cluster identification as an optimization problem. Let $D$ be a set of $n$ points in $\Rd$. We consider a partition of $D$ into $k$ clusters $C_1, \dots, C_k$, with $D_\ell$ the set of points associated to cluster $C_\ell$.

Take $C_\ell$ for $\ell \in \intrange{1}{k}$ and suppose $D_\ell$ is known. Cluster $C_\ell$ is described by the parameter set $\theta_\ell = (\theta_{\ell,1}, ..., \theta_{\ell,r})$ and a function $d_\ell: (x,C_\ell(\theta_\ell)) \mapsto d_\ell(x,C_\ell(\theta_\ell))$, that is some distance from a point to the cluster. We call the description of $C_\ell$ by the function $d_\ell$ a parametric cluster model.

Cluster optimization is concerned with the identification of the optimal parameters in the following sense:


Cluster optimization. Let $C_\ell$ be a parametric cluster. Cluster optimization is the optimization problem seeking the cluster parameters minimizing the dispersion

\begin{equation}\min_{\theta_\ell} \kmfunc, \text{ with } \kmfunc = \sum_{x \in D_\ell} d_\ell(x,C_\ell(\theta_\ell))^2.
\end{equation}


Spherical clusters. Recall that the unbiased variance estimate for distances within cluster $D_\ell$ of center $c_\ell$ satisfies

\begin{equation}\stdevdist* =  \frac{1}{n-1}\sum_{x_i\in D_\ell} \vvnorm{x_i - c_\ell}^2
\end{equation}

To define the spherical cluster model [193] , also recall that the power of a point $x$ with respect to a sphere $S(c,r)$ is defined by $\powerps{x}{S} = \vvnorm{x-c}^2 - r^2$.

We define:

(Spherical cluster) Let $\eta \in ]0,1[$ be is a hyperparameter, and let $c_\ell$ be apoint called the cluster center. Given the set $D_\ell$, the distance function associated to the spherical cluster $C_\ell(c_\ell)$ reads as

\begin{equation}d(x,C_\ell(c_\ell))^2
:=  \max \left(0, \vvnorm{x-c_\ell}^2 - \eta \stdevdist*  \right)  = \max \left(0, \powerps{x}{S(c_\ell, \sqrt{\eta} \stdevdist)}\right).
\end{equation}


The rationale of this definition is that one wishes to find the center covering as many inliers as possible while minimizing the cost of outliers–points outside the spherical cluster.

Thus, the spherical cluster model depends both on inliers and outliers.

Spherical clusters are well defined

For a fixed data set $D_\ell$, the previous optimization problem requires minimizing the functional

\begin{equation}\Feta{c}  := \sum_{x_i \in D_\ell} \max \left(0, \| x_i-c \|^2 - \frac{\eta}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2  \right).
\end{equation}

To study the previous function, for each $x_i \in D_\ell$, let

\begin{equation}\ftildeeta{c}  := \| x_i-c \|^2 - \eta \frac{1}{n-1}\sum_{x_j\in D_\ell} \|x_j - c\|^2.
\end{equation}

so that one gets

\begin{equation}\Feta{c} = \sum_{x_i \in D_\ell} \max(0, f_{\eta,x_i}(c)).
\end{equation}

The following is proved [cazals2026modeling] :


Theorem. (Strict convexity of $F$). Let $D_\ell$ be a set of $n$ points of $\Rd$, with at least two distinct points, and assume that $0 < \eta < 1 - \frac{1}{n}$. Then the associated $F$ map is $2(1 - \etaprime)$-strongly convex on $\Rd$, and is a fortiori strictly convex. Its minimization problem admits exactly one solution in $\Rd$.

While this theorem establishes the uniqueness of the SC center, the main difficulty to compute it is that the functional to be minimized is not smooth, see Fig. fig-opt-codim.

Minima of $F$ on cells of various dimensions. Data point in orange, minima in red. Selected level sets (in dotted-lines) are also reported.

Algorithms

Exact solver

An exact solver for spherical clusters, denoted $\text{\algoexact}$, is developed in [cazals2026modeling]. It actually minimizes a function of the form

\begin{equation}F = \sum_{i} \max(0, \vvnorm{x - c_i}^2 - R^2_i),
\end{equation}

and constructs a finite sequence of points $(x_n)$ by induction. The last point is the optimum of $\Feta$. This sequence is also called the trajectory in the sequel.

This algorithm relies on three sub-routines called $\text{\algtele}$, $\text{\algline}$, $\text{\algsphere}$.

The reader is referred to [cazals2026modeling] for the details.

Heuristics based on BFGS and L-BFGS

While the functional to be optimized in not smooth, it is known that BFGS works well in practice for non-differentiable functions [131].

We challenge $\text{\algoexact}$ with $\text{\algobfgs}$ and $\text{\algolbfgs}$ respectively, using the BFGS and L-BFGS-B solvers provided by scipy.optimize.. Note that the latter uses an approximation of the Hessian–as opposed to a $O(d^2)$ sized matrix.

Genericity and robustness

Our algorithm is implemented in python and numpy. The reader is referred to [cazals2026modeling] for a detailed discussion of numerics and genericity assumptions.

Modeling point clouds with spherical clusters

Spherical cluster: illustrations on a toy 2D dataset. (LEFT) Trajectories from five different starting points, with $\eta = 0.5$ (Line/Sphere descents in blue/orange). (RIGHT) Evolution of the cluster center for nine $\eta$ in arithmetic progression from 0.1 to 0.9.

Statistics of interest

Modeling point clouds with the SC model requires varying $\eta$. To this end, we define:

  • SC square radius. The square radius with respect to which the power distance is computed, that is $R^2 = \eta \hat{\sigma}^2$.
  • Projection plot. The plot of all points (data points, center of mass, SC centers) onto the first two principal directions. Inliers (resp. outliers) are displayed in blue (resp. orange). The number of outliers identified by our cluster model, is denoted $\numoutliersSC$. Similarly $\numoutliersCOM$ stand for the number be outliers defined with respect to a sphere of the same radius centered at the center of mass.
  • Dual plot. Reports $\Feta$ and $R2$ as a function of $\eta$. (NB: $R2$ values are represented negated on this plot.)
  • Stacked barplot. The plot function of $\eta$ counting the number of steps of each type (line descent, sphere descent, teleportation) in $\text{\algoexact}$.
  • Time ratio plots. The plots for $\timeexact / \timebfgs$ and $\timeexact / \timelbfgs$, comparing the running times of $\text{\algoexact}$ against those of \algobfgs and $\text{\algolbfgs}$ respectively.
  • Average outlier cost plots. The plots $\Feta(\optexact)/ \numoutliersSC$ and $\Feta(\optexact)/ \numoutliersCOM$.
  • Outlier ratio plot. The plot $\numoutliersCOM/\numoutliersSC$.

Fitting spherical cluster models

The provided script sbl-sc-model.py implements two modes:

  • Run mode: computes cluster models for a set of input files in txt format
  • Statistics mode: processes all output found in a directory and delivers a compilation-ready latex report with all the aforementioned plots.


Main options. The main options of the script sbl-sc-model.py are as follows:

–idpath string: Input directory path
–ifpaths string: File paths for the clusters to be analyzed – each in matrix format n x d
–odpath string: Output directory path –relative or absolute (default: results)
–view string: View plots during the calculation (default: False)
–verbose string: Verbose level: 0:none 1:main step 2:with intermediate info 3:with full details
–log string: Log file (default: False)
–mode M_MODE string: Mode: run stats run-stats glob visu
–no-par string: Turn parallel calculation off (default: False)
–Fnames string: File containing a list of input filenames
–eta string: List of eta values
–etan string: Num of values equally spaced\; nargs=n or nargs=[a,b,n]
–etag string: Gap to sample eta values\; nargs=gap or nargs=[a,b,gap]
–cells string: Report statistics on the cells traversed during the optim and dump the trajectory file
–rand-init string: Use a random initialization
–bfgs string: Heuristic uses bfgs/BFGS or lbfgs/LBFGS (default: lbfgs)
–dsname string: Dataset name–for plots at the latex report at the whole dataset level (default: NoName)



Single file mode. Fitting a SC model on one or several files, for one or several values of $\eta$ is done as follows:

sbl-sc-model.py --ifpaths iris_setosa.txt iris_versicolor.txt iris_virginica.txt --eta .3 .5 .7  --odpath results-Iris-4D

The calculation delivers a projection plot files, together with an xml file. Here is the output for virginica subset of the Iris-4D dataset, with $\eta=0.5$:

<statistics>
<fname>iris_virginica</fname>
<n>50</n>
<d>4</d>
<algo-xBFGS>BFGS</algo-xBFGS>
<mu>1</mu>
<eta>0.5</eta>
<Exact-converged>True</Exact-converged>
<Exact-sanity>True</Exact-sanity>
<Exact-value>26.24517843427694</Exact-value>
<Exact-R2>0.4482605447117279</Exact-R2>
<Outlier-SESC-ave-cost>0.8466186591702239</Outlier-SESC-ave-cost>
<Outlier-SESC-num>31</Outlier-SESC-num>
<Outlier-COM-ave-cost>0.8207792662258733</Outlier-COM-ave-cost>
<Outlier-COM-num>50</Outlier-COM-num>
<Exact-time>0.04823482397478074</Exact-time>
<xBFGS-value>26.245178434277097</xBFGS-value>
<xBFGS-R2>0.4482605489741717</xBFGS-R2>
<xBFGS-time>0.02992666099453345</xBFGS-time>
<Ratio-values-exact-xBFGS>0.999999999999994</Ratio-values-exact-xBFGS>
<Ratio-time-exact-xBFGS>1.6117676470352489</Ratio-time-exact-xBFGS>
<Distance-exact-xBFGS>9.967294856861747e-08</Distance-exact-xBFGS>
<Distance-exact-PM>0.13782260415531594</Distance-exact-PM>
<traj-segments><LINEDESCENT>5</LINEDESCENT></traj-segments>
</statistics>


Multiple files mode. Assume that the directory $DATASET contains a list of clusters in txt format, and that the file $DATASET/clusters.txt lists these individual clusters by filename (Nb: filename not filepath). The calculations and statistics are easily obtained as follows:

sbl-sc-model.py -F $DATASET/clusters.txt --etag 0.1 --bfgs bfgs --odpath $DATASET/results-bfgs
sbl-sc-model.py -F $DATASET/clusters.txt --idpath $DATASET/results-bfgs --dsname MyDataset

Implementation and classes

The python implementation of [cazals2026modeling] provides the following modules / classes.


Solvers and associated statistics. The package proposes three solvers handling the non-smooth convex optimization introduced in [cazals2026modeling] :

  • SBL::SESC_solver_Clarke::SESC_solver_Clarke : the exact solver developed in [cazals2026modeling] . Stores statistics in an instance of the class SBL::SESC_solver_Clarke::SESC_stats_Clarke .
  • SBL::SESC_solver_approx_BFGS::SESC_solver_BFGS : approx solver resorting to BFGS or L-BFGS . Stores statistics in an instance of the class SBL::SESC_solver_approx_BFGS::SESC_stats_BFGS .
  • SBL::SESC_solver_approx_grid::SESC_solver_grid : an approx solver using a grid discretization of the ambient space – only of interest for low dimensional spaces. Stores its statistics in an instance of the class SBL::SESC_solver_approx_grid::SESC_stats_grid .

These solvers make a consistent use of the class SBL::SESC_tools::SESC_tools , which provides various geometric calculations involved in the definition of the optimization problem.


Spherical cluster model. The class SBL::SESC_model::SESC_model serves as a data class to store all relevant pieces of information for a spherical cluster. This class itself relies on :

  • SBL::SESC_model::Params_run_SESC : the class monitoring a single calculations, following the framework provided in the package Script_design .
  • SBL::SESC_model::SESC_model : class storing all statistics, including those delivered by the three aforementioned solvers, namely SBL::SESC_solver_Clarke::SESC_solver_Clarke , SBL::SESC_solver_approx_BFGS::SESC_solver_BFGS , and SBL::SESC_solver_approx_grid::SESC_solver_grid .


Large scale experiments. To compute spherical cluster models of collections of datasets (each given as a $(n,d)$ dataset in matrix format), the following classes are provided:

  • SBL::SESC_prod_perfile::SESC_prod_perfile : for a given dataset, implement all statistics and plots presented in [cazals2026modeling] ;
  • SBL::SESC_prod_dataset::SESC_prod_dataset : repeats the previous for all files in a dataset;
  • SBL::SESC_tex_report::SESC_tex_report : assembles a ready-to-compile latex report, for all files in a given dataset.


Visualization in 2D. As mentioned above, the –cells options triggers the dump of a trajectory file – .npz suffix. To ease our understanding of the spherical cluster model, the class SBL::SESC_visu::SESC_visu provides two visualization modes in 2D, which exploit these trajectory files:

  • Case 1, at least two values of $\eta$ : show the spheres + the cluster centers (one per eta, each being the endpoint of a trajectory).
  • Case 2, one unique value of $eta$: display the spheres and the corresponding trajectory.


Main script: sbl-sc-model.py. The design of this script follows the framework proposed in the package Script_design. The Spherical cluster model can be confronted to other cluster analysis provided in the package Cluster_analysis.