Diversity and similarity analysis

Diversity sorting - sort a data set by diversity with a very fast algorithm (full sorting of 200,000 compounds in 12 hours)
Compound Selection - selects compounds from external database with maximized diversity
Similarity calculations - for each compound in an initial data set calculates similarities with a selected data set

Similarity

Similarity is a quantitative measure of two molecules resemblance. To calculate similarity, ChemDBsoft divides a molecule into structural fragments. The more common structural fragments are there between tested molecules the more Similarity is. Spherical environments for each atom in a molecule are calculated. ChemDBsoft calculates spheres up to two bonds depth. For example, for molecule CH3-CH2-CH(OH)-CH3 next screens will be generated:

CH3-; -CH2-; -CH-; -OH; CH3CH2-; CH3CH2CH-; -CH2CH(OH)CH3; CH3CH-; CH(OH)

Bond types and ring sizes are taken into consideration during screen calculations. CH3-CH2 fragment is not the same as CH2=CH. Also, CH2 group on 5- and 6-membered cycles are different.

Screens are stored in internal database. If a new screen will occur during calculation, it will be added to database automatically. Unlimited number of screens is used for similarity calculations.

Let’s internal database contains N screens, which defines totally data set under interest. Each compound in the data set can be represented as vector of N dimension, elements of the vector being 0 if a screen is absent in the compound or 1 if a screen is present. Lets calculate similarity R(K,M) of a two compound in a data set as cosine between two vectors:

R(K,M)=SV(K)*V(M)/sqrt(SV(K)*SV(M))

The similarity between compounds K and M is equal zero, if no common screens are exists for both structures, and one if all screens are common.

The Tanimoto similarity coefficient:

R(K,M)=SV(K)*V(M)/(SV(K)+SV(M)- SV(K)*V(M))

is frequently used instead of cosine measure. It was shown (J.D.Holliday, S.S.Ranade, P.Willet; Quant. Struct.-Act. Relat. 14, 501-506 (1995)), that using of cosine and Tanimoto measures will give very highly correlated similarities.

ChemDBsoft uses Cosine measure elsewhere because of there exists very fast algorithms for similarity and diversity calculations.

The Similarity to dataset is defined as:

R=SR/K

e.a. averaged similarity between current structure and structures in the data set of compounds (K-dimension).

Diversity

Diversity is a property of dataset and characterizes the Similarity (8.3.2) (or dissimilarity) of molecules included in it.

Dissimilarity of pair structures I and J (or diversity of compounds ) is defined as:

D(I,J)=1-R(I,J)

Where R – similarity. 1 value for D(I,J) means totally different compounds, 0 – identical compounds.

A Diversity matrix can be created for a data set of N compounds:

|D|=

1 D₂₁ D₃₁….D_N1

D₁₂ 1 D₃₂…D_N2

……………….

D_1N D_2N D_3N…1

The diagonal matrix (Dij=Dji) contains 1 at diagonal.

The Diversity of the dataset (or simply Diversity) is defined as:

D=SDij(i<>j)/N*(N-1)

- i.e. as sum nondiagonal elements, divided by number of such elements. The possible values for D from zero (all compounds in data set are identical) to one (no common screens exists for a pair of compounds in the data set).

Diversity is calculated in ChemDBsoft by Diversity... command of main menu Tools item.

Sorting according to similarity to structure

The command calculates similarities (8.4.2) of records in dataset to given structure pointed out in other dataset or the same one, with succeded sorting. It is executed if icon is dragged from source dataset window and dropped into blinking region of the destination dataset(same or another), and radiobutton Sort structures in<dataset 1> according to similarity of selected compound in <dataset 2> is clicked in Comparison type selection dialog (fig. 8-1) arisen then. The similarities of all compounds in target database with the current structure in source database are calculated. After finishing the command records are sorted in destination database in similarity ‘s discendend order.

The values of similarities are shown in second column (it is called Match) of the table in destination database window

Sorting according to similarity to dataset

This command is executed if icon is dragged from source database window and dropped into blinking region of the destination (same or another) one, and radiobutton Sort structures in <Dataset 1> according to similarity of <dataset 2> is clicked in Comparison type selection dialog (fig. 8-1) arisen then. The Similarities to dataset(8.4.2) of all compounds in target database to the source database are calculated. After finishing the command records are sorted in destination database in similarity ‘s discendend order (fig. 8-9).

The procedure is useful for selection of compounds, which might have some activities. Suppose, one have databases of compounds with appropriate kind of activities. Execution of the command will select from target database a number of compounds, which structures are most similar to source database of compound with an activity.

Diversity sorting

The goal of Diversity calculation and sorting is to select a maximally diverse subset of a given size from a given large pool of a candidate molecules. The diverse data set can be used for screening purposes to reduce expenses for compounds testing or compounds selection.

Algorithm of diversity sorting is:

From initial data set a compound, which is most dissimilar (8.4.2) to all another compounds, is selected.
Calculates diversity (8.4.3) of data set with remaining compounds.
Select from remaining compounds a structure, which is most dissimilar to selected data set (8.4.2).
Repeat steps 2-3 until all compounds will be exhausted or diversity sorting will be aborted by user.

During such procedure, the diversity, calculated from diversity matrix, remains maximal for initial number of compounds.

It should be mentioned that, if a data set contains duplicated structures and screens for these structures are rare-occurred, the duplicates can be selected at beginning of diversity sorting as "good" compounds. To remove the problem, one needs to remove duplicates from data set prior diversity sorting.

The command is executed by icon dragging from database window and dropping it into blinking area of the SAME database. Then one needs to select command Diversity sorting of... in Comparison type selection dialog (fig. 8-1) arisen.

Diversity sorting is time-consumption procedure. Time, required for calculations, increases in square accordingly to data set to be sorted growth. Calculation time for 100,000 compounds is approximately 4 hour (Pentium II, 500mhz). The size of set it defined by pressing of the Cancel button in Job control window in on-line mode. Then the subset of most diverse compounds is created, Yes button in the dialog for Cancel command confirmation is pressed. If not aborted, calculation will lead to the full size of dataset, sorted by diversity.

The result of calculations is sorted by diversity value database and its graphical representation - Diversity Plot (fig. 8-12).

It contains diversity profile versus number of selected compound.

Usual for XY diagram Zoom, Scroll - operations, Cursor position box, Right-button popup menu are available. The profile, for example, can be copied into clipboard to insert in a Microsoft Word document.

Double-click at plot will cause database pointer to be set at the record, corresponding to X-value of of mouse cursor.

Number of screens displayed is total number of structural screens (8.4.2) in data set, from which compounds are selected. It can be used as diversity measure of the dataset.

Selection of most diverse compounds

Suppose, you want to expand the large stock database with the most dissimilar compounds from tested dataset. The problem to be solved is the calculation of stock dataset diversity by addition of new compounds, one by one, in order of dissimilarity to the stock. The result of procedure is sorted test data set, the first records being the most dissimilar to stock data base.

Algorithm of compound selection is:

Calculates diversity of stock data set with each compound added.
Select from tested compounds, a structure, which is most dissimilar to stock data set.
Repeat steps 1-2 until all test dataset compounds will be exhausted or user will cancel diversity sorting.

After finishing of calculations and sorting Diversity plot, presenting the changing in stock’s diversity relative the number of added compounds from tested dataset will be shown. There may be three kinds of diversity curves:

Type 1 – new data set added has low diversity relative stock and is recommended to be rejected.

Type 2. Two cases are possible:

a) The new data set is highly diverse, the large diversities are both for each compound in new data set and with each compound in existing stock. The case is very rare, the whole new data set may be used.

b) Initial stock is not diverse. For example, it is 15,000 compounds size, the 10,000 derivatives of benzhydrazones being present. Addition of any compounds, which is not derivative of benhydrazone, will increase diversity. In this case full stock content can not be used for new compounds selection. One needs to create a representative selection of stock by diversity sorting and select initial compounds with diversity 0.8-0.85.

Type 3 – usual curve for diverse stock. Upon tasks solved, one can select only those compounds, which give maximum on diversity curve or those number compounds, that diversity remains the same. For stock of large sizes very small amount of compounds (order of 1%) from proposed new ones will increase diversity.

It should be mentioned that, if a data set contains duplicated structures with stock and screens for these structures are rare-occurred, the duplicates can be selected at beginning of diversity sorting as "good" compounds. To remove the problem, one needs to remove duplicates from data set proposed prior compounds selection.

To select the compounds the most diverse to the stock database:

Find and remove duplicates in stock database.
Make representative dataset for stock by diversity sorting.
Create a new database and import proposed SDF file.
Find and remove duplicates in tested database.

Drag icon from tested database and drop it into blinking area of stock.

Click radiobutton -Diversity sorting of <proposed database>, the dataset in <stock> being used as already selected one in the Comparison type selection dialog (fig. 8-1) arisen.
Calculation can be aborted by Cancel button pressing in Job control window. In this case subset of most diverse compounds might be created if Yes button is pressed in the dialog for Cancel command confirmation (fig. 8-11).

Software - ChemDBsoft

TimTec LLC 9270 Bay Plaza Boulevard Suite 606 Tampa, Florida 33619 T: 302-292-8500 F: 302-292-8520

Company	Quick Links	Resources	Customer Support	TimTec Network
About Us Our Customers Our Partners Register & Login Contact Site Search	Compound Libraries Natural products Bioscreening Directory	Glossary Database Downloads FAQs	Customer Service Terms of Sale Your Feedback (Un)Subscribe	eChemStore ActiMol.com MyriaScreen.com ChemDBsoft.com www.timtec.org