The concept of
similarity between compounds leading to grouping
of a dataset based on the similarities between
structures, clustering, has been discussed elsewhere
under Digital
Chemistry Clustering Tools. The opposite of
this is the concept of dataset analysis based
on measures of dissimilarity, Diversity Analysis.
Diversity Analysis includes the calculation of
the structural or property diversity of datasets.
It can be used in diverse subset selection methods
that enable the extraction of subsets of maximally
dissimilar compounds from a dataset.
This is of particular importance in constructing
combinatorial libraries and for biological screening
programmes. Diversity Analysis finds much application
therefore in the pharmaceutical industry, where
one needs to select a small subset of compounds
that best cover a very large dataset. The cost
and time savings in needing to synthesise only
this small number of compounds to investigate
potential drug candidates over vast chemical space
are quite apparent.
Diversity Analysis is assisted by the generation
of two general dataset representations, the centroid
fingerprint and the modal fingerprint. The centroid
fingerprint is a form of average fingerprint for
the whole dataset and can be used to calculate
a measure of the average dataset dissimilarity
and also to calculate the change in average dataset
dissimilarity if two datasets were to be merged.The
modal fingerprint is only applicable to datasets
in which the compounds are represented by fingerprints
and are very useful in analysing the frequency
of incidence of each element of a fingerprint
across the whole dataset. Click
here for more information about fingerprints.
|