Template C++ / Python API for developping structural bioinformatics applications.
User Manual
Cluster_spherical
Authors:F. Cazals and A. Commaret and L. Goldenberg
Introduction: the spherical cluster model
Let a parametric cluster model be a cluster model depending on parameters to be estimated. As opposed to clusters defined with respect to points, such as in k-means or k-medoids, parametric cluster models aim at providing insights on the geometry of the point set modeled.
This package presents the first method to compute spherical clusters[cazals2026modeling] , which were introduced in the context of subspace clustering [193] .
As illustrated by the logo of this package, a spherical cluster is defined by a ball, based on
inliers: points in the ball incur a null cost;
outliers: points outside the ball pay a cost equal to the power distance with respect to the ball.
See also Fig. fig-CM-examples-Iris for spherical clusters obtained on the classical Iris 4D dataset, and the sections below for the precise mathematical definitions.
Spherical cluster models for the three clusters of the classical Iris 4D dataset: (Left) Setosa (Middle) Versicolor (Rigth) Virginica. Spherical cluster models for , and points projected onto the first two principal components of each dataset. Inliers (resp. outliers) presented as blue (resp. orange) dots. Note that outliers in 4D may project near the cluster center in 2D. Each plot also features the center of mass and the projection median.
Spherical clusters as parametric clusters
Cluster identification as an optimization problem. Let be a set of points in . We consider a partition of into clusters , with the set of points associated to cluster .
Take for and suppose is known. Cluster is described by the parameter set and a function , that is some distance from a point to the cluster. We call the description of by the function a parametric cluster model.
Cluster optimization is concerned with the identification of the optimal parameters in the following sense:
Cluster optimization. Let be a parametric cluster. Cluster optimization is the optimization problem seeking the cluster parameters minimizing the dispersion
Spherical clusters. Recall that the unbiased variance estimate for distances within cluster of center satisfies
To define the spherical cluster model [193] , also recall that the power of a point with respect to a sphere is defined by .
We define:
(Spherical cluster) Let be is a hyperparameter, and let be apoint called the cluster center. Given the set , the distance function associated to the spherical cluster reads as
The rationale of this definition is that one wishes to find the center covering as many inliers as possible while minimizing the cost of outliers–points outside the spherical cluster.
Thus, the spherical cluster model depends both on inliers and outliers.
Spherical clusters are well defined
For a fixed data set , the previous optimization problem requires minimizing the functional
To study the previous function, for each , let
so that one gets
The following is proved [cazals2026modeling] :
Theorem. (Strict convexity of ). Let be a set of points of , with at least two distinct points, and assume that . Then the associated map is -strongly convex on , and is a fortiori strictly convex. Its minimization problem admits exactly one solution in .
While this theorem establishes the uniqueness of the SC center, the main difficulty to compute it is that the functional to be minimized is not smooth, see Fig. fig-opt-codim.
Minima of on cells of various dimensions. Data point in orange, minima in red. Selected level sets (in dotted-lines) are also reported.
Algorithms
Exact solver
An exact solver for spherical clusters, denoted , is developed in [cazals2026modeling]. It actually minimizes a function of the form
and constructs a finite sequence of points by induction. The last point is the optimum of . This sequence is also called the trajectory in the sequel.
This algorithm relies on three sub-routines called , , .
The reader is referred to [cazals2026modeling] for the details.
Heuristics based on BFGS and L-BFGS
While the functional to be optimized in not smooth, it is known that BFGS works well in practice for non-differentiable functions [131].
We challenge with and respectively, using the BFGS and L-BFGS-B solvers provided by scipy.optimize.. Note that the latter uses an approximation of the Hessian–as opposed to a sized matrix.
Genericity and robustness
Our algorithm is implemented in python and numpy. The reader is referred to [cazals2026modeling] for a detailed discussion of numerics and genericity assumptions.
Modeling point clouds with spherical clusters
Spherical cluster: illustrations on a toy 2D dataset. (LEFT) Trajectories from five different starting points, with (Line/Sphere descents in blue/orange). (RIGHT) Evolution of the cluster center for nine in arithmetic progression from 0.1 to 0.9.
Statistics of interest
Modeling point clouds with the SC model requires varying . To this end, we define:
SC square radius. The square radius with respect to which the power distance is computed, that is .
Projection plot. The plot of all points (data points, center of mass, SC centers) onto the first two principal directions. Inliers (resp. outliers) are displayed in blue (resp. orange). The number of outliers identified by our cluster model, is denoted . Similarly stand for the number be outliers defined with respect to a sphere of the same radius centered at the center of mass.
Dual plot. Reports and as a function of . (NB: values are represented negated on this plot.)
Stacked barplot. The plot function of counting the number of steps of each type (line descent, sphere descent, teleportation) in .
Time ratio plots. The plots for and , comparing the running times of against those of \algobfgs and respectively.
Average outlier cost plots. The plots and .
Outlier ratio plot. The plot .
Fitting spherical cluster models
The provided script sbl-sc-model.py implements two modes:
Run mode: computes cluster models for a set of input files in txt format
Statistics mode: processes all output found in a directory and delivers a compilation-ready latex report with all the aforementioned plots.
Main options. The main options of the script sbl-sc-model.py are as follows:
–idpathstring: Input directory path –ifpathsstring: File paths for the clusters to be analyzed – each in matrix format n x d –odpathstring: Output directory path –relative or absolute (default: results) –viewstring: View plots during the calculation (default: False) –verbosestring: Verbose level: 0:none 1:main step 2:with intermediate info 3:with full details –logstring: Log file (default: False) –mode M_MODEstring: Mode: run stats run-stats glob visu –no-parstring: Turn parallel calculation off (default: False) –Fnamesstring: File containing a list of input filenames –etastring: List of eta values –etanstring: Num of values equally spaced\; nargs=n or nargs=[a,b,n] –etagstring: Gap to sample eta values\; nargs=gap or nargs=[a,b,gap] –cellsstring: Report statistics on the cells traversed during the optim and dump the trajectory file –rand-initstring: Use a random initialization –bfgsstring: Heuristic uses bfgs/BFGS or lbfgs/LBFGS (default: lbfgs) –dsnamestring: Dataset name–for plots at the latex report at the whole dataset level (default: NoName)
Single file mode. Fitting a SC model on one or several files, for one or several values of is done as follows:
Multiple files mode. Assume that the directory $DATASET contains a list of clusters in txt format, and that the file $DATASET/clusters.txt lists these individual clusters by filename (Nb: filename not filepath). The calculations and statistics are easily obtained as follows:
The python implementation of [cazals2026modeling] provides the following modules / classes.
Solvers and associated statistics. The package proposes three solvers handling the non-smooth convex optimization introduced in [cazals2026modeling] :
SBL::SESC_solver_Clarke::SESC_solver_Clarke : the exact solver developed in [cazals2026modeling] . Stores statistics in an instance of the class SBL::SESC_solver_Clarke::SESC_stats_Clarke .
SBL::SESC_solver_approx_BFGS::SESC_solver_BFGS : approx solver resorting to BFGS or L-BFGS . Stores statistics in an instance of the class SBL::SESC_solver_approx_BFGS::SESC_stats_BFGS .
SBL::SESC_solver_approx_grid::SESC_solver_grid : an approx solver using a grid discretization of the ambient space – only of interest for low dimensional spaces. Stores its statistics in an instance of the class SBL::SESC_solver_approx_grid::SESC_stats_grid .
These solvers make a consistent use of the class SBL::SESC_tools::SESC_tools , which provides various geometric calculations involved in the definition of the optimization problem.
Spherical cluster model. The class SBL::SESC_model::SESC_model serves as a data class to store all relevant pieces of information for a spherical cluster. This class itself relies on :
SBL::SESC_model::Params_run_SESC : the class monitoring a single calculations, following the framework provided in the package Script_design .
SBL::SESC_model::SESC_model : class storing all statistics, including those delivered by the three aforementioned solvers, namely SBL::SESC_solver_Clarke::SESC_solver_Clarke , SBL::SESC_solver_approx_BFGS::SESC_solver_BFGS , and SBL::SESC_solver_approx_grid::SESC_solver_grid .
Large scale experiments. To compute spherical cluster models of collections of datasets (each given as a dataset in matrix format), the following classes are provided:
SBL::SESC_prod_perfile::SESC_prod_perfile : for a given dataset, implement all statistics and plots presented in [cazals2026modeling] ;
SBL::SESC_prod_dataset::SESC_prod_dataset : repeats the previous for all files in a dataset;
SBL::SESC_tex_report::SESC_tex_report : assembles a ready-to-compile latex report, for all files in a given dataset.
Visualization in 2D. As mentioned above, the –cells options triggers the dump of a trajectory file – .npz suffix. To ease our understanding of the spherical cluster model, the class SBL::SESC_visu::SESC_visu provides two visualization modes in 2D, which exploit these trajectory files:
Case 1, at least two values of : show the spheres + the cluster centers (one per eta, each being the endpoint of a trajectory).
Case 2, one unique value of : display the spheres and the corresponding trajectory.
Main script: sbl-sc-model.py. The design of this script follows the framework proposed in the package Script_design. The Spherical cluster model can be confronted to other cluster analysis provided in the package Cluster_analysis.