| ... Clustering |
|
This page gives an introduction to Clustering as
an analytical tool and includes: |
An
Introduction to Clustering and Clustering Methods
Digital Chemistry
Clustering Tools
How
to get more Information and Evaluation Software |
| An
Introduction to Clustering and Clustering Methods |
Clustering is
a data analysis technique in which a set of data
items is segmented into groups, called clusters,
in which members of the same cluster are similar
to each other. Clustering is distinct from classification,
in that there are no pre-determined characteristics
used to define the membership of a cluster, although
items in the same cluster are likely to have many
characteristics in common.
There are many potential uses for clustering
but Digital Chemistry specialises in clustering
of chemical structures, for example, in the screening
of combinatorial or Markush compound libraries
in the quest for new active pharmaceuticals. Vast
quantities of data can be screened at speed, saving
significant time and money in identifying drug
leads. |
| Clustering
Methods |
Many different
methods and algorithms have been developed for
cluster analysis, these divide into two main types;
non-hierarchical and hierarchical.
In non-hierarchical, clusters form around centroids,
the number of which can be specified by the user.
All clusters rank equally and there is no particular
relationship between them. In hierarchical, the
clusters are arranged in hierarchies, in which
smaller clusters are contained within larger ones;
the bottom of the hierarchy consists of individual
objects in "singleton" clusters, while
the top of it consists of one cluster containing
all the objects in the dataset. Such hierarchies
can be built either from the bottom up (agglomerative)
or the top downwards (divisive). A set of non-overlapping
clusters (called a partition) can then be selected
from the hierarchy.
The concepts of non-hierarchical and hierarchical
methods are shown in the diagrams below. |
| Non-
Hierarchical Clustering |
|
| Hierarchical
Clustering |
|
| Within
each type of clustering described above there are
a number of clustering algorithms one could use,
each with their respective merits, 5 are offered
by Digital Chemistry as described below. |
| Digital
Chemistry Clustering Tools |
| Digital
Chemistry clustering tools are particularly good
at handling very large datasets and can cluster
faster than other available applications. The methods
available are shown below: |
 |
| enlarge |
|
All
clustering methods have their own advantages and
disadvantages, for example, some are inherently
faster than others; some have been found in empirical
experiments to perform better at separating known
active and inactive compounds or at predicting
property values.
More detailed information is given below about
which type might be most suitable for you but
if you have Digital Chemistry clustering software
you don't need to choose one in preference to another, since all of the methods
shown above are included* -
you can use any or all of them whenever you want. |
| Which
Method is Fastest? |
K-means
is generally the fastest of the methods available,
with time requirements that increase linearly
with the dataset size but it can only be used
to generate a single set of clusters, the number
of which is specified by the user (speed is also
dependent on the number of clusters requested).
Divisive K-means is the next fastest, with nlogn
time requirements (where n is the dataset size).
Ward and Group-Average are the slowest, with time
requirements that increase with the square of
the dataset size. For all the hierarchical methods,
the time-consuming part is building the hierarchy.
The separate step of extracting partitions with
the required number of clusters is very fast. |
| Which
Method Gives the Best Clusters? |
Empirical
experiments suggest that the hierarchical methods
generally produce "better" clusters
in terms of their ability to bring compounds with
similar properties into the same cluster, with
Ward's method and Divisive K-means performing
best.
The K-means method suffers from a dependency
on the use of randomly-selected centroids for
the clusters, which means that different runs
with the same data may produce different results;
it is, however, more appropriate for extremely
large datasets for which the hierarchical methods
are unacceptably slow.
* Empirical experiments have
shown that all four other methods are markedly
superior to the Jarvis-Patrick non-hierarchical
method, which became popular for cheminformatics
use in the 1980s; the Jarvis-Patrick method is therefore no longer offered by Digital Chemistry. |
| Implementing
Digital Chemistry Clustering Tools |
| Digital
Chemistry offers the clustering methods listed
below as toolkits, command-line applications and as part of the Clustering Web Service. Versions of Ward's, Group Average and Divisive K-Means are also available to run on parallel processors to speed up the processing of very large datasets. |
- Group Average (Hierarchical, agglomerative)
- Ward's Method (Hierarchical, agglomerative)
- Divisive K-Means (Hierarchical, divisive)
- K-Means (Non-hierarchical)
|
| Also
included are some exclusive tools to help you best
interpret the results of your clustering: |
- Automated partitioning in a clustering hierarchy,
giving you the 'best' clusters and minimising
the need for more specialist knowledge.
- Assignment of new items to existing clusters
without the need for reprocessing, this can
lead to huge time savings, especially if you
have a large dataset or compound library that
is being updated regularly with new items or
structures.
|
| Digital
Chemistry clustering is available in 3 formats as
listed below, if you would like more detailed information
about these please click on the links: |
|
| Digital
Chemistry also supports the following operating
systems, for a full list of hardware and software
requirements for Digital Chemistry products please click
here. |
- Windows
- SUN Solaris
- Linux
|
|
| How
to get more Information and Evaluation Software |
| If
you would like any more information about Digital
Chemistry's software or if you would like to request
an evaluation copy please contact us. |
|