Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|
Authors: F. Cazals and T. Dreyfus
Spatial search engines are data structures used to locate points amongst a dataset of points. In the SBL, such engines are particularly used for querying the existence of a molecular conformation amongst a database of conformations, or for building a nearest neighbor graph connecting conformations.
In general, a molecular conformation is represented by a point in a d-dimensional space, with d equal to 3 times the number of atoms, see Conformations .
Hence, spatial search engines must be efficient to handle large collection of points in high dimensional spaces.
This package is designed to handle a large variety of spatial search algorithms, with the ability to switch from one implementation to another one depending on the context / the needs.
The focus has been place on data structures working in general metric spaces, as opposed e.g. to kd-trees, in order to accommodate general molecular distance measures.
A huge litterature exists for such data structures and algorithms, see [155], [94] or [159] .
The following concepts are of prime importance:
One option to design exact or approximate searches based upon spatial data structure which either split the ambient space of the dataset are the notions of exact search and of defeatist style search, see [64] .
In short, in a defeatist style search, one travels down a binary partition and collect the best candidate neighbor(s) at each level. This strategy is doomed to fail to report exact NN, as the NN may be located on the side of the partition not visited. Nevertheless, this flow can be partially fixed, as explained below.
As a final general comment, it is worth mentioning that high dimensional spaces are prone to measure concentration phenomea [88] , which basically stipulate that all distances from a given point tend to concentrate (Gaussian distribution) around a value. For molecular structures, one may consult [168] .
Because of this phenomenom, exact nearest neighbors are often irrelevant, which in turns justifies the use of approximate schemes.
We provide three particular algorithms :
For each algorithm, we describe the algorithm, the time complexity for building the points database, and the time complexity for searching the nearest point to a query. The time complexity for computing the distance between two points is denoted .
We implemented the naive algorithm, consisting simply on a database stored in a vector. Building a database of points is done linearly in , and searching a point in the database is done linearly in .
LAESA is an approximated spatial search algorithm based on the triangular inequality [131] . The algorithm for building the database consists on randomly picking pivots and computing only distances from any point of the database to the pivots. Hence, for a database of points and pivots, the time complexity of building the database is .
When a point is searched in the database, distances to the pivots are computed instead of distances to the points of the database, reducing the number of distance computation. Let be the query point, the pivot and a point in the database. Using the triangular inequality, the distance has a lower bound :
Hence, the lower bound can be generalized for any pivots by taking the maximum :
This lower bound is used to find a nearest neighbor candidate that minimizes . The time complexity is then in .
Note that this algorithm is particularly useful when the distance computation takes time since it reduces largely the number of times the distance is computed. Note also that the accuracy of this algorithm largely depends on how the pivots are selected.The base strategy consists on selecting a first point as pivot, then selecting the next pivot as far as possible from all previously selected pivots.
Proximity tree: data structure and construction. In short, a metric tree is a binary tree recursively splitting the dataset as follows [178] and [139] :
The data structure supporting this construction is as follows:
The following points are of prime importance [178]
An incremental construction is also provided, for the cases where the samples are not provided at once. Assuming that a metric tree is provided, the insertion of a new sample goes as follows:
The complexities are as follows:
Construction cost for points: .
Defeatist style search:
Proximity forest. When used for defeatist style searches, the intrinsic difficulty mentioned in section Pre-requisites can be reduced by working with several proximity trees, yielding a proximity forest.
In a first step, each proximity tree is searched ad yields an approximate nearest neighbor; in a second step, the best candidate is retained.
All algorithms are implemented as a spatial search engine, i.e an object following a C++ concept. This concept helps to implement new algorithms and to switch easily between the different algorithms. The concept requires a number of methods that can be divided in 3 functionnalities :
void insert(const Point& p);//Insert a point p in the engine database. void insert(InputIterator begin, InputIterator end);//Insert a range of points in the engine database. void clear();//Clear the database. void reset();//Clear the database and re-insert all the points in the database. OutputIterator get_database(OutputIterator out)const;//Collects the database; unsigned get_number_of_points()const;//Returns the database size; const Point& get_point(unsigned i)const;//Returns the ith inserted point
void set_query_type(NN_query_type query_type);//Set the query type (KNN or by range). void set_number_of_neighbors(unsigned k);//Set k when the mode is KNN. unsigned get_number_of_neighbors();//Returns k when the mode is KNN. void set_range(const FT& range);//Set the range when the mode is by range. const FT& get_range()const;//Returns the range when the mode is by range.
FT get_distance(const Point& p, const Point& q)const;//Returns the distance between p and q. int get_nearest_neighbor(const Point& p, bool self_allowed)const;//Returns the index of the nearest neighbor of p, //possibly excluding p itself from the database. std::pair<OutputIterator, unsigned> k_nearest_neighbors(unsigned k, const Point& p, bool self_allowed, OutputIterator out)const;//Returns the KNN of p. OutputIterator range_nearest_neighbors(const Point& p, bool self_allowed, OutputIterator out)const;//Returns all the neighbors of p under a predefined range. const Point& operator()(const Point& p, bool self_allowed)const;//Alias for get_nearest_neighbor. OutputIterator operator()(const Point& p, bool self_allowed, OutputIterator out)const;//Perform KNN or by range following the predefined mode.
The different models of the concept implemented in the SBL are :
The FLANN library is a collection of spatial search engines and is very helpful for comparing different algorithms [132] .
We provide an additional search engine model to wrap the FLANN library. This model is implemented with the class SBL::GT::T_ANN_FLANN_wrapper< DistanceFunction , GetPoint , IndexParams , SearchParams >.
The parameter GetPoint is a functor transforming an input point with the format used in the distance calculations, into a point data structure inthe FLANN library, i.e a matrix.
The parameters IndexParams and SearchParams are classes defining the FLANN algorithms used for building and querying the database. They are set by default to the classes flann::LinearIndexParams and flann::SearchParams performing a linear building and a linear search, defining the naive search engine.
This package provides also a class for comparing any pair of algorithms implemented following the spatial search engine concept : SBL::GT::T_ANN_compare_algorithms< GeometricKernelD > . It is initialized with a database of points of a given dimension uniformly generated in a box of a given size. Then, three methods are provided for comparing the algorithms :
void compare_construction(Algorithm1& algo_1, Algorithm2& algo_2, std::ostream& out = std::cout);
void compare_nearest_query(Algorithm1& algo_1, Algorithm2& algo_2, std::ostream& out = std::cout);
void compare_dynamical(Algorithm1& algo_1, Algorithm2& algo_2, std::ostream& out = std::cout);
Finally, this package provides a SBL module SBL::Modules::T_Spatial_search_module< ModuleTraits , ApproximatedSpatialSearchEngine > . It takes as input a container of points and build a spatial search engine. The parameter ModuleTraits has to define the classes :
The parameter ApproximatedSpatialSearchEngine is one of the approximated spatial search engine and is set by default to the proximity forest data structure. An exact spatial search engine is also used and is set to the exact metric tree algorithm.
The main option of the module T_Spatial_search_module is:
exact-search
bool(= false): Uses the naive search rather than the ANN.
This example shows how to use the naive search using a custom Point type. It creates a grid of 2D points and builds the database, then query the nearest neighbors of the point (50,50).
This example shows how to use the LAESA algorithm with a custom Point type. It creates a grid of 2D points and builds the database, then query the nearest neighbors of the point (50,50), using KNN and using the search by range.
This example shows how to use the proximity forest algorithm using a custom Point type. It creates a grid of 2D points and builds the database, then query the nearest neighbors of the point (50,50), using KNN and using the search by range.
This example shows how to use the algorithm comparator for comparing the naive algorithm with the proximity forest algorithm. In particular, it loads an input file listing 3D points in the xyz formatand uses them for building the comparator database. It then compares the construction time, the query time and the dynamical construction time of the two algorithms.
A classical application of k-nearest neighbors is regression [21]. Consider a database of points, each endowed with a response value – a real number. To estimate the height at an arbitrary query point q, one can:
Using the data structures provided in this package, two applications are provided to do so:
As an illustration, the following command line loads a DB of 1960 2D points and the associated response values (labels 0 or 1 in this case), and performs an estimate for 2000 query points, using 149 nearest neighbors:
sbl-knn-regressor-euclidean.exe -p demos/data/points2D_coords.txt -r demos/data/points2D_heights.txt -q demos/data/query2D_coords.txt -s -o results -k 149