Structural Bioinformatics Library
Template C++ / Python API for developping structural bioinformatics applications.
|
Authors: F. Cazals and T. Dreyfus
The Union-Find (UF) algorithm maintains dynamically the connected components of an evolving graph, called independent sets in the sequel, upon addition of new edges, thanks to a Union-Find data structure (UF DS). The vertices of this graph may carry a piece of information called the attribute in the sequel.
Union-Find is based on the following three operations:
The key building blocks of the reference UF algorithm are as follows:
Details can be found e.g. in the book Introduction to algorithms, by Cormen-Leiserson-Rivest-Stein [60] .
In addition to the aforementioned basic Union-Find operations, we also provide the following extensions:
We provide two algorithms to process a list of edges, each of them possibly triggering a union_set operation.
An example scenario using these functionalities is given in the SBL::CADS::T_Union_find_specializes section.
A linked list of leaders is maintained. (Addition and removal from this list is done in constant time since the vertex data structure actually stores its location in that list.)
Storing the leaders allows :
Scenarios using this functionality are given in section SBL::CADS::T_Union_find_contains section and SBL::CADS::Union_find_default.
Each vertex u is endowed with a linked list of all its children, i.e all the vertices having u as father. (As for leader, each vertex knows it location in that list.)
This extra information allows iterating recursively over all the descendants of a vertex u, i.e all the vertices v for which there is a path in the graph from u to v. In particular, starting from a leader allows visiting its independent set.
Combining this functionality with the iteration on leaders allows visiting the whole Union-Find data structure.
A scenario using this functionality is given in the SBL::CADS::T_Union_find_contains section.
Internally, the topology of the UF DS (omitting the maintenance of the ranks) is modified by the following operations:
Along a sequence of updates, we stack the aforementioned two operations. To rewind, it is sufficient to unstack these operations, and undo each of them.
A scenario using this functionality is given in section SBL::CADS::T_Union_find_contains.
The Union-Find data structure has been designed to handle any attribute type, without sacrificing the efficiency. There are three ways to represent an attribute in the Union-Find data structure:
This example presents the SBL::CADS::Union_find_default class, with a graph encoded in a file. In particular, when the n vertices of the graph are represented by the n first not negative integers , the vertices of the graph can be used as the vertices of the UF DS.
The example reads a graph from a file, and dumps the leaders together with the size of their independent set.
This example presents the SBL::CADS::T_Union_find_contains class, using the Rewind functionality. It fills the Union-Find data structure with n vertices representing proteins, makes an arbitrary union of them, rewinds the Union-Find data structure to the last insertion, and re-makes a new arbitrary union of the vertices.
Since we need to attach an attribute to the Union-Find vertices, but we do not need to access a vertex from its associated data, we use the SBL::CADS::T_Union_find_contains class instantiated with the Protein const reference data type. (Note that instantiating directly with the data type will copy the data instead of copying the const reference to the data).
This example presents the SBL::CADS::T_Union_find_contains_with_map class. It uses the last example, but uses directly the data as input of Union-Find methods.
This example presents the the SBL::CADS::T_Union_find_specializes class.
Consider a set of proteins with a set of known contacts between them. First, we use the Union-Find data structure for maintaining the independent sets of proteins upon processing the contacts. We then check whether two specific proteins belong to the same independent set.
We use the SBL::CADS::T_Union_find_specializes class instantiated with the Protein_base data type. In doing so, the SBL::CADS::T_Union_find_vertex_data_structure_specializes type directly represents the data.
Note that we need to create new instances of the base data types, but we cast the pointers to the base data type to pointers to the vertex type: this way, the data are not copied.