Markush Structures & Combinatorial LibrariesFingerprinting & DictionariesClusteringDiversity AnalysisChemical Query Conversion
Torus™ToolkitsMain ProgramsWeb ServicesThird Party Integration
  About Us
  Products
  Consulting
  Support
  News & Events
  Contact Us
  Sitemap
 
Click here to login
 
 
... Clustering
This page gives an introduction to Clustering as an analytical tool and includes:
An Introduction to Clustering and Clustering Methods
Digital Chemistry Clustering Tools
How to get more Information and Evaluation Software
An Introduction to Clustering and Clustering Methods

Clustering is a data analysis technique in which a set of data items is segmented into groups, called clusters, in which members of the same cluster are similar to each other. Clustering is distinct from classification, in that there are no pre-determined characteristics used to define the membership of a cluster, although items in the same cluster are likely to have many characteristics in common.

There are many potential uses for clustering but Digital Chemistry specialises in clustering of chemical structures, for example, in the screening of combinatorial or Markush compound libraries in the quest for new active pharmaceuticals. Vast quantities of data can be screened at speed, saving significant time and money in identifying drug leads.

Clustering Methods

Many different methods and algorithms have been developed for cluster analysis, these divide into two main types; non-hierarchical and hierarchical.

In non-hierarchical, clusters form around centroids, the number of which can be specified by the user. All clusters rank equally and there is no particular relationship between them. In hierarchical, the clusters are arranged in hierarchies, in which smaller clusters are contained within larger ones; the bottom of the hierarchy consists of individual objects in "singleton" clusters, while the top of it consists of one cluster containing all the objects in the dataset. Such hierarchies can be built either from the bottom up (agglomerative) or the top downwards (divisive). A set of non-overlapping clusters (called a partition) can then be selected from the hierarchy.

The concepts of non-hierarchical and hierarchical methods are shown in the diagrams below.

Non- Hierarchical Clustering
enlarge  
Hierarchical Clustering
enlarge  
Within each type of clustering described above there are a number of clustering algorithms one could use, each with their respective merits, 5 are offered by Digital Chemistry as described below.
Digital Chemistry Clustering Tools
Digital Chemistry clustering tools are particularly good at handling very large datasets and can cluster faster than other available applications. The methods available are shown below:
enlarge  

All clustering methods have their own advantages and disadvantages, for example, some are inherently faster than others; some have been found in empirical experiments to perform better at separating known active and inactive compounds or at predicting property values.

More detailed information is given below about which type might be most suitable for you but if you have Digital Chemistry clustering software you don't need to choose one in preference to another, since all of the methods shown above are included* - you can use any or all of them whenever you want.

Which Method is Fastest?

K-means is generally the fastest of the methods available, with time requirements that increase linearly with the dataset size but it can only be used to generate a single set of clusters, the number of which is specified by the user (speed is also dependent on the number of clusters requested).

Divisive K-means is the next fastest, with nlogn time requirements (where n is the dataset size). Ward and Group-Average are the slowest, with time requirements that increase with the square of the dataset size. For all the hierarchical methods, the time-consuming part is building the hierarchy. The separate step of extracting partitions with the required number of clusters is very fast.

Which Method Gives the Best Clusters?

Empirical experiments suggest that the hierarchical methods generally produce "better" clusters in terms of their ability to bring compounds with similar properties into the same cluster, with Ward's method and Divisive K-means performing best.

The K-means method suffers from a dependency on the use of randomly-selected centroids for the clusters, which means that different runs with the same data may produce different results; it is, however, more appropriate for extremely large datasets for which the hierarchical methods are unacceptably slow.

* Empirical experiments have shown that all four other methods are markedly superior to the Jarvis-Patrick non-hierarchical method, which became popular for cheminformatics use in the 1980s; the Jarvis-Patrick method is therefore no longer offered by Digital Chemistry.

Implementing Digital Chemistry Clustering Tools
Digital Chemistry offers the clustering methods listed below as toolkits, command-line applications and as part of the Clustering Web Service. Versions of Ward's, Group Average and Divisive K-Means are also available to run on parallel processors to speed up the processing of very large datasets.
  • Group Average (Hierarchical, agglomerative)
  • Ward's Method (Hierarchical, agglomerative)
  • Divisive K-Means (Hierarchical, divisive)
  • K-Means (Non-hierarchical)
Also included are some exclusive tools to help you best interpret the results of your clustering:
  • Automated partitioning in a clustering hierarchy, giving you the 'best' clusters and minimising the need for more specialist knowledge.
  • Assignment of new items to existing clusters without the need for reprocessing, this can lead to huge time savings, especially if you have a large dataset or compound library that is being updated regularly with new items or structures.
Digital Chemistry clustering is available in 3 formats as listed below, if you would like more detailed information about these please click on the links:
Digital Chemistry also supports the following operating systems, for a full list of hardware and software requirements for Digital Chemistry products please click here.
  • Windows
  • SUN Solaris
  • Linux
How to get more Information and Evaluation Software
If you would like any more information about Digital Chemistry's software or if you would like to request an evaluation copy please contact us.
Top

 

 
   
  search :
     
  For general enquiries, contact:
T: +44 (0)113 2181851
F: +44 (0)113 2181869
E: info@digitalchemistry.co.uk

The Iron Shed
Harewood House Estate
Harewood
Leeds LS17 9LF
United Kingdom