Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
User Manual

Spatial_search

Authors: F. Cazals and T. Dreyfus

Introduction

Spatial search engines are data structures used to locate points amongst a dataset of points. In the SBL, such engines are particularly used for querying the existence of a molecular conformation amongst a database of conformations, or for building a nearest neighbor graph connecting conformations.

In general, a molecular conformation is represented by a point in a d-dimensional space, with d equal to 3 times the number of atoms, see Conformations .

Hence, spatial search engines must be efficient to handle large collection of points in high dimensional spaces.

This package is designed to handle a large variety of spatial search algorithms, with the ability to switch from one implementation to another one depending on the context / the needs.

The focus has been place on data structures working in general metric spaces, as opposed e.g. to kd-trees, in order to accommodate general molecular distance measures.

Pre-requisites

A huge litterature exists for such data structures and algorithms, see [155], [94] or [159] .

The following concepts are of prime importance:

  • Exact or approximate queries: the exact nearest neighbor(s) is (are) returned, or only approximations thereof.
  • Static or dynamic searches: the dataset to be searched is fixed or evolves over time.
  • KNN versus range mode: in the KNN mode, the (approximate) k-nearest neighbors of a query point are returned; in the range mode, point located within a distance range are returned.
  • Euclidean versus metric spaces: the space accommodating the points has the structure of a Euclidean space, or only that of a metric space. In the former case, hyper-planes can be used to split the ambient space. In the latter, only the triangle inequality can be used to speed up searches.

One option to design exact or approximate searches based upon spatial data structure which either split the ambient space of the dataset are the notions of exact search and of defeatist style search, see [64] .

In short, in a defeatist style search, one travels down a binary partition and collect the best candidate neighbor(s) at each level. This strategy is doomed to fail to report exact NN, as the NN may be located on the side of the partition not visited. Nevertheless, this flow can be partially fixed, as explained below.

As a final general comment, it is worth mentioning that high dimensional spaces are prone to measure concentration phenomea [88] , which basically stipulate that all distances from a given point tend to concentrate (Gaussian distribution) around a value. For molecular structures, one may consult [168] .

Because of this phenomenom, exact nearest neighbors are often irrelevant, which in turns justifies the use of approximate schemes.

Algorithms

We provide three particular algorithms :

  • a naive algorithm performing a linear scan on the database,
  • the LAESA algorithm [131] ,
  • the proximity forest algorithm from [178] and [139] .

For each algorithm, we describe the algorithm, the time complexity for building the points database, and the time complexity for searching the nearest point to a query. The time complexity for computing the distance between two points is denoted $O(D)$.

Naive algorithm

We implemented the naive algorithm, consisting simply on a database stored in a vector. Building a database of $N$ points is done linearly in $O(N)$, and searching a point in the database is done linearly in $O(DN)$.

LAESA algorithm

LAESA is an approximated spatial search algorithm based on the triangular inequality [131] . The algorithm for building the database consists on randomly picking pivots and computing only distances from any point of the database to the pivots. Hence, for a database of $N$ points and $P$ pivots, the time complexity of building the database is $O(DPN)$.

When a point is searched in the database, distances to the pivots are computed instead of distances to the points of the database, reducing the number of distance computation. Let $x$ be the query point, $p$ the pivot and $q$ a point in the database. Using the triangular inequality, the distance $d(x, q)$ has a lower bound $g(x, q)$:

$ d(x, q) > \mid d(x, p) - d(p, q)\mid $

Hence, the lower bound $g(x, q)$ can be generalized for any pivots by taking the maximum :

$ g(x, q) = \max_{p}\mid d(x, p) - d(p, q)\mid $

This lower bound is used to find a nearest neighbor candidate $x$ that minimizes $g(x, q)$. The time complexity is then in $O(DP + N)$.

Note that this algorithm is particularly useful when the distance computation takes time since it reduces largely the number of times the distance is computed. Note also that the accuracy of this algorithm largely depends on how the pivots are selected.The base strategy consists on selecting a first point as pivot, then selecting the next pivot as far as possible from all previously selected pivots.

Proximity trees and forests

Proximity tree: data structure and construction. In short, a metric tree is a binary tree recursively splitting the dataset as follows [178] and [139] :

  • pick a pivot in the dataset and estimate the median distance $\mu$ from points of the dataset to this pivot.
  • build a binary tree assigning samples whose distance is smaller (resp. larger) than $\mu$ to the left (resp. right) subtree.

The data structure supporting this construction is as follows:

  • Points from the dataset are stored as pivots within internal nodes of the tree, or within leaves.
  • Each leaf is equipped with a bucket that can contain up to $b$ points, including the pivot.

The following points are of prime importance [178]

  • A proximity tree can subject to exact or approximate searches. Approximate searches use the defeatist style. Exact searches use a pruning strategy forcing the algorithm to visit one or two subtrees only when such visit(s) stand a chance to improve the $k$ nearest neighbors reported.
  • A proximity tree is best built using randomization, in particular to choose the pivot which conditions the binary partitioning at each level.

An incremental construction is also provided, for the cases where the samples are not provided at once. Assuming that a metric tree is provided, the insertion of a new sample $x$ goes as follows:

  • if the current node is not a leaf, compute the distance between $x$ and the current pivot $p$ : if the distance is larger (resp. smaller) than the value $\mu$, descend into the left (resp. right) subtree.
  • if the current node is a leaf , add $x$ to the bucket. If the bucket has $> b$ points, split it with the construction sketched above.

The complexities are as follows:

  • Construction cost for $N$ points: $O(Nlog(N))$.

  • Defeatist style search: $O(Dlog(N))$

  • Exact search: linear in the worst case – since all nodes may be visited.

Proximity forest. When used for defeatist style searches, the intrinsic difficulty mentioned in section Pre-requisites can be reduced by working with several proximity trees, yielding a proximity forest.

In a first step, each proximity tree is searched ad yields an approximate nearest neighbor; in a second step, the best candidate is retained.

Implementation

Engine concept

All algorithms are implemented as a spatial search engine, i.e an object following a C++ concept. This concept helps to implement new algorithms and to switch easily between the different algorithms. The concept requires a number of methods that can be divided in 3 functionnalities :

  • Database manipulation
void insert(const Point& p);//Insert a point p in the engine database.
void insert(InputIterator begin, InputIterator end);//Insert a range of points in the engine database.
void clear();//Clear the database.
void reset();//Clear the database and re-insert all the points in the database.
OutputIterator get_database(OutputIterator out)const;//Collects the database;
unsigned get_number_of_points()const;//Returns the database size;
const Point& get_point(unsigned i)const;//Returns the ith inserted point
  • Parametrization
void set_query_type(NN_query_type query_type);//Set the query type (KNN or by range).
void set_number_of_neighbors(unsigned k);//Set k when the mode is KNN.
unsigned get_number_of_neighbors();//Returns k when the mode is KNN.
void set_range(const FT& range);//Set the range when the mode is by range.
const FT& get_range()const;//Returns the range when the mode is by range.
  • Queries
FT get_distance(const Point& p, const Point& q)const;//Returns the distance between p and q.
int get_nearest_neighbor(const Point& p, bool self_allowed)const;//Returns the index of the nearest neighbor of p,
                                                                 //possibly excluding p itself from the database.
std::pair<OutputIterator, unsigned> k_nearest_neighbors(unsigned k, const Point& p, bool self_allowed, OutputIterator out)const;//Returns the KNN of p.
OutputIterator range_nearest_neighbors(const Point& p, bool self_allowed, OutputIterator out)const;//Returns all the neighbors of p under a predefined range.
const Point& operator()(const Point& p, bool self_allowed)const;//Alias for get_nearest_neighbor.
OutputIterator operator()(const Point& p, bool self_allowed, OutputIterator out)const;//Perform KNN or by range following the predefined mode.

The different models of the concept implemented in the SBL are :

  • SBL::GT::T_NN_linear_scan< DistanceFunction > : the naive algorithm for spatial search engine; the parameter DistanceFunction is the distance functor that takes as input two points and returns their distance.
  • SBL::GT::T_NN_metric_tree< DistanceFunction , SplitterFunction > : an implementation of an exact metric tree; the parameter SplitterFunction is the functor deciding how points are splitted when building the tree, see SBL::GT::T_Splitter_default< DistanceFunction >).
  • SBL::GT::T_ANN_metric_space_LAESA< DistanceFunction , ExcludedPivots > : the LAESA algorithm; the additional parameter ExcludedPivots is a boolean tag indicating if pivots should be excluded from nearest neighbor candidates (default is false).
  • SBL::GT::T_ANN_metric_space_proximity_tree< DistanceFunction , SplitterFunction > : representation of a single tree of the proximity forest; the parameter SplitterFunction has the same function as previously described, but is also used for selecting in which subtree (left or right) the tree should be visited when making a query, see SBL::GT::T_Splitter_default< DistanceFunction >).
  • SBL::GT::T_ANN_meta< RandomizedANN , IsLowerPoint > : representation of a collection of randomized algorithms,such as proximity trees; the parameter RandomizedANN is a randomized spatial search engine, such as a proximity tree, and the parameter IsLowerPoint is the less comparator for points (it uses the natural comparator if it exists by default).

FLANN Wrapper

The FLANN library is a collection of spatial search engines and is very helpful for comparing different algorithms [132] .

We provide an additional search engine model to wrap the FLANN library. This model is implemented with the class SBL::GT::T_ANN_FLANN_wrapper< DistanceFunction , GetPoint , IndexParams , SearchParams >.

The parameter GetPoint is a functor transforming an input point with the format used in the distance calculations, into a point data structure inthe FLANN library, i.e a matrix.

The parameters IndexParams and SearchParams are classes defining the FLANN algorithms used for building and querying the database. They are set by default to the classes flann::LinearIndexParams and flann::SearchParams performing a linear building and a linear search, defining the naive search engine.

Comparing Algorithms

This package provides also a class for comparing any pair of algorithms implemented following the spatial search engine concept : SBL::GT::T_ANN_compare_algorithms< GeometricKernelD > . It is initialized with a database of points of a given dimension uniformly generated in a box of a given size. Then, three methods are provided for comparing the algorithms :

  • comparing the time of construction of the database :
void compare_construction(Algorithm1& algo_1, Algorithm2& algo_2, std::ostream& out = std::cout);
  • comparing the time for finding the nearest neighbor of a given number of points :
void compare_nearest_query(Algorithm1& algo_1, Algorithm2& algo_2, std::ostream& out = std::cout);
  • comparing the time for querying while building the database :
void compare_dynamical(Algorithm1& algo_1, Algorithm2& algo_2, std::ostream& out = std::cout);

Module

Finally, this package provides a SBL module SBL::Modules::T_Spatial_search_module< ModuleTraits , ApproximatedSpatialSearchEngine > . It takes as input a container of points and build a spatial search engine. The parameter ModuleTraits has to define the classes :

  • Distance : the distance functor
  • Points_container : a stl container of points (e.g a vector)

The parameter ApproximatedSpatialSearchEngine is one of the approximated spatial search engine and is set by default to the proximity forest data structure. An exact spatial search engine is also used and is set to the exact metric tree algorithm.

The main option of the module T_Spatial_search_module is:
exact-search bool(= false): Uses the naive search rather than the ANN.

Examples

Naive

This example shows how to use the naive search using a custom Point type. It creates a grid of 2D points and builds the database, then query the nearest neighbors of the point (50,50).

LAESA

This example shows how to use the LAESA algorithm with a custom Point type. It creates a grid of 2D points and builds the database, then query the nearest neighbors of the point (50,50), using KNN and using the search by range.

Proximity forest

This example shows how to use the proximity forest algorithm using a custom Point type. It creates a grid of 2D points and builds the database, then query the nearest neighbors of the point (50,50), using KNN and using the search by range.

Comparing naive to proximity forest

This example shows how to use the algorithm comparator for comparing the naive algorithm with the proximity forest algorithm. In particular, it loads an input file listing 3D points in the xyz formatand uses them for building the comparator database. It then compares the construction time, the query time and the dynamical construction time of the two algorithms.

Applications

A classical application of k-nearest neighbors is regression [21]. Consider a database of points, each endowed with a response value – a real number. To estimate the height at an arbitrary query point q, one can:

  • compute the k-nearest neighbors of q
  • average the response values over these k nearest neighbors.

Using the data structures provided in this package, two applications are provided to do so:

  • $\text{\codecx{sbl-knn-regressor-euclidean.exe}}$ : the input points are d-dimensional points in the Euclidean space. The k NN are therefore reported for the Euclidean distance.
  • $\text{\codecx{sbl-knn-regressor-angular.exe}}$ : the input points are d-dimensional points on the d dimensional flat torus. The k NN are therefore reported for the distance on the flat torus.

As an illustration, the following command line loads a DB of 1960 2D points and the associated response values (labels 0 or 1 in this case), and performs an estimate for 2000 query points, using 149 nearest neighbors:

sbl-knn-regressor-euclidean.exe  -p demos/data/points2D_coords.txt -r demos/data/points2D_heights.txt   -q demos/data/query2D_coords.txt -s -o results -k 149