1 Introduction

In today’s drug discovery, the number of new drugs discovered each year has not kept pace with the enormously increasing investment in pharmaceutical R&D. A recent report shows that the number of new drugs approved per billion US dollars has halved approximately every 9 years since 1950 (). In response to this, finding new uses for existing drugs (known as drug repositioning) has been proposed, which offers opportunities for faster development times and reduced risk (). This is because the repositioning candidates should already have passed through development stages and efficacy/toxicity tests for their original indications. Many repositioning success stories offer great promise for the feasibility and effectiveness of the drug repositioning strategy. For example, GlaxoSmithKline received approval to market bupropion hydrochloride branded as WELLBUTRIN® for depression in 1985 and as ZYBAN® for smoking cessation in 1997 (). Although repositioning existing drugs for alternative indications is not new, it is only recently that large-scale computational methods are being developed and used ().

Computational drug repositioning has become a new frontier (; ; ) in today’s drug discovery research. Recent methods focus on systematically exploring novel drug-disease therapeutic relationships from large-scale molecular data, such as transcriptomics, genome-wide association study (GWAS), and target screening data. For instance, with the availability of the Connectivity Map (CMap) (), which is a comprehensive reference collection of ranked gene expression profiles produced by different drug candidates, several approaches have been developed to leverage such drug molecular information. Iorio et al. used gene expression profiles of drugs in the CMap to compute drug pairwise similarity and the resulting drug-drug network to explore repositioning opportunities for known drugs (). Hu and Agarwal () compared the gene expression profiles of drugs with those of diseases and identified the correlation/anti-correlation between drugs and diseases. They further showed that the anti-correlation relationships in the resulting disease-drug network can suggest new therapeutic uses for existing drugs. In addition to the genomic data, other drug-related information has also been investigated in similarity-based approaches, which assume that similar drugs are indicated for similar diseases. For instance, Campillos et al. () used drug adverse effects to identify novel drug-target relationships (off-target interactions), which further connected drugs to new uses. Li et al. () integrated disease, gene/protein and drug connectivity information based on protein interaction networks and literature mining. Chiang and Butte () presented a ‘Guilt by Association’ (GBA) approach to predict novel drug uses based on the known treatment relationships between drugs and diseases. Gottlieb et al. () developed a computational method called PREDICT where the drug pairwise similarity was measured by similarities of chemical structures, side effects, and drug targets. These computed similarities were then used as features of a logistic regression classifier for predicting the novel associations between drugs and diseases. Li et al. () built a causal network (CauseNet)–a layered drug-target-pathway-gene-disease causal inference network–to identify new therapeutic uses of existing drugs.

In this paper, we describe our previous approach in more detail () for identifying new uses of an existing drug through its relationship to similar drugs (see Figure 1), along with additional experimental results. More specifically, we represent the relationships between drugs and their target proteins as a bipartite graph. As shown in Figure 1, drug d1 is known for treating disease s1 and d2 for s2. If the drug pair (d1, d2) obtains a high similarity score, we predict that they can be repositioned into each other’s therapeutic area. That is, drug d1 is predicted for disease s2 treatment and d2 for s1. In order to validate our predictions, we perform a cross-validation experiment by comparing the predicted drug-disease pairs against known drug uses. In addition, we search evidence from both published biomedical literature and current clinical trials to support our predictions.

Figure 1 

Overview of the bipartite graph-based method for drug repositioning.

Our method is most related to the GBA and PREDICT approach mentioned above in that they all identify a drug’s potential new uses through its similarity to existing drugs. Different from the GBA approach that relied on a single measure, both PREDICT and our method use multiple features (e.g., drug chemical structure and target profile) in computing drug pairwise similarity. More importantly, unlike PREDICT and other similarity-based methods, we adopted a novel bipartite-graph based method when considering common drug target proteins and their interaction information. This method assumes that two objects are similar if they are related to similar objects. By applying it to our data, we are able to boost target similarity by making use of their corresponding interaction information and to obtain target similarity scores for drug pairs in cases where no common targets can be found. In other words, this approach empowers us to take advantage of such information as indirect protein interaction that is implicitly embedded in complex biological systems. Lastly, we differ from both approaches with respect to the dataset assembled. The drug-disease connections in this study were obtained only from public sources (GBA used a private source) and include various kinds of diseases (only OMIM diseases were included in PREDICT).

2 Methods

In our method, a drug’s potential new indications are identified via its similar drugs. For example, two drugs dx and dy are found to be similar, and dx is known to be used for treating disease s; thus dy is potentially useful for disease s treatment. In drug discovery, a drug’s chemical structure and its target profile are two important features evidently associated with its therapeutic use. Hence, when computing pairwise similarity between a drug pair dx and dy, we combine the similarities of their chemical structures SIMchem(dx, dy) and target profiles SIMtarget(dx, dy).

2.1 Similarity of Drug Chemical Structures

Our method for calculating the drug chemical structure similarity SIMchem(dx, dy) is based on the 2D chemical fingerprint descriptor of each drug’s chemical structure in PubChem (; ). That is, each drug is represented by a binary fingerprint f(dx) in which each bit indicates the presence of a predefined chemical structure fragment. The pairwise chemical similarity between two drugs dx and dy is computed as the Tanimoto coefficient of their fingerprints:

(1)
SIMchem(dx,dy)=f(dx)f(dy)|f(dx)|+|f(dy)|f(dx)f(dy)

where |f(dx)| and |f(dy)| are the number of structure fragments drugs dx and dy respectively f(dx). f(dy), the dot product of fingerprints, is the number of structure fragments shared by two drugs.

2.2 Similarity of drug target profiles

Our method for calculating the drug target similarity SIMtarget(dx, dy) is based on both common target proteins and interactions between target proteins. The relationships between drugs and their target proteins can be represented as a bipartite graph G(V, E):

The node set of graph G, V(G) = {D, P}, consists of two types of object (i.e., the drug set D and protein set P).

The edge set of graph G, E(G)D × P, consists of relationships between drugs and their target proteins.

Given a drug d, its target protein set is noted as P(d). Likewise, a protein’s linked drug set is noted as D(p).

Figure 2(A) shows an example bipartite graph, where there are four drugs D = {d1, d2, d3, d4}, two proteins P = {p1, p2}, and five links (proteins p1 and p2 are the targets of drugs {d1, d2} and {d2, d3, d4} respectively). In this example, P(d1) = {p1}, P(d2) = {p1, p2}, P(d3) = {p2}, and P(d4) = {p2}; while D(p1) = {d1, d2} and D(p2) = {d2, d3, d4}.

Figure 2 

Bipartite graph models for computing drug target similarity.

Based on the bipartite graph, drug target similarity SIMtarget(dx, dy). can be computed by counting the number of common proteins shared by two drugs i.e., P(dx, dy) = P(dx) ∩ P(dy). Figure 2(B) shows a bipartite graph G’ where drug pairs are only connected if they share common target proteins. This is not ideal because no target protein stands alone in biological systems.

For better capturing the interactions between target proteins, we derived a bipartite graph model G2(V2, E2):

The node set of graph G2, V2 = {D2, P2} = {D × D, P × P}, consists of two types of object (i.e., the drug pair and protein pair). Let R(dx, dy) and R(pa, pb) represent the similarity of drug pair and protein pair, respectively.

The edge set of graph G2, E2D2× P2, represents connections between drug pairs and drug pairs, which are derived from the edges in the original bipartite graph G.

Figure 2(C) shows the bipartite graph G2 derived from G. The drug pair set contains all possible combinations of any two drugs including self-pairs (e.g., {d1, d1} and {d2, d2}). Similarly, the protein pair set contains all possible protein combinations. An edge exists in G2 between a drug pair {d1, d2} and protein pair {p1, p2} if and only if their respective edges <d1, p1> and <d2, p2> exist in G. Such a G2 graph can capture a common target via edges between non-self-drug pairs and self-protein pairs (e.g., the edge between {d1, d2} and {p1, p1}). Also, it can capture the interaction information between two proteins via the node set of protein pairs.

Given the G2 graph model, we can iteratively compute the pairwise similarity of drug pairs R2k+1(dx, dy) and protein pairs R2k+2(pa, pb) as follows:

(2)
{R2k+1(dx,dy)=1|p(dx)||P(dy|)i=1|P(dx)|j=1|P(dy)|R2k(Pi(dx),Pj(dy))R2k+2(pa,pb)=1|D(pa)||D(pb)|i=1|D(pa)|j=1|D(pb)|R2k+1(Di(pa),Dj(pb))

As can been seen in equation (2), the drug pairwise similarity R2k+1(dx, dy) depends on the similarities of protein pairs that are connected to the drug pair (dx, dy) in the G2 graph. In turn, the protein pairwise similarity R2k+2(pa, pb) also depends on the drug pairwise similarities. The iterative calculation is initialized with the protein pairwise similarity R0(pa, pb):

(3)
R0(pa,pb)={1if a = b0.5if pa interacts with pb when ab0otherwise

where R0(pa, pb) is set as 1 if the pair is self-paired (i.e., a = b) and is set as 0.5 if protein pa interacts with pb.

To demonstrate our G2 graph model, we use the example data in Figure 2 and assume the two proteins p1 and p2 interact with each other. In Table 1, we show comparative results SIMtarget(dx, dy) using the proposed G2 method against the simple method of counting the number of common target proteins |P(dx, dy)| and its variant using Pearson’s correlation. As can be seen, using either Pearson’s correlation or the proposed G2 method allows one to capture and compare the different strengths of drug similarity. For instance, both methods find that the drug pair (d3, d4) has the highest similarity as they share the exactly same target protein p2. Moreover, the G2 method is able to consider the fact that p1 interacts with p2 and produces similarity scores accordingly for the two drug pairs (d1, d3) and (d1, d4) that are assigned with zero similarity otherwise by the two other methods.

Table 1

Comparison of target similarities calculated by different methods.

(d1, d2)(d1, d3)(d1, d4)(d2, d3)(d2, d4)(d3, d4)

|P(dx, dy)|1.000.000.001.001.001.00
Pearson0.710.000.000.710.711.00
R1(dx, dy)0.750.500.500.750.751.00
R3(dx, dy)0.850.710.710.850.851.00
R5(dx, dy)0.910.830.830.910.911.00
R7(dx, dy)0.950.900.900.950.951.00

Also, one can see from Table 1 that the similarity scores in our G2 method are monotonically increasing as k becomes larger. This is because of the propagation of similarities from protein pairs to drug pairs and vice versa. For example, the fact that proteins p1 and p2 share one common drug d2 would contribute to the similarity of protein pair (P1, P3) , which in turn would further increase the similarity between drugs that share target proteins p1 and/or p2.

In theory, the similarity of drug target profiles should be calculated as:

(4)
SIMtarget(dx,dy)=limk(R2k+1(dx,dy))

Because of the rapid convergence with relative rankings stabilizing as discussed in Jeh and Widom, (), we set Starget (dx, dy) = R5(dx, dy) when performing this iterative method on large-scale real data.

2.3 Drug Pairwise Similarity

The final drug pairwise similarity SIM(dx, dy) score is derived by summing up the weighted chemical similarity and target similarity as shown in Eq. (5), which readily integrates drug chemical structure, drug target, and target interaction in one score ranging from 0 to 1.

(5)
SIM(dx,dy)=(1λ)SIMchem(dx,dy)+λSIMtarget(dx,dy)

where λ(0 < λ < 1) is a predefined constant for weighting the target similarity.

2.4 Evaluation of Repositioning Candidates

To assess our method, we first compare the repositioning candidates and their predicted uses with their known uses extracted from the National Drug File-Reference Terminology (NDF-RT). Second, we check evidence of our predictions in published literature and undergoing investigations, respectively. More specifically, given a drug dx and its predicted uses Sx={sx1, sx2, … }, we search for the occurrence of drug-disease pair (dx, sxi) in PubMed and ClinicalTrials.gov. For literature validation, we require the drug-disease pair (dx, sxi) to be co-mentioned in more than two PubMed abstracts. For trial validation, if the drug-disease pair (dx, sxi) is co-mentioned in a clinical trial, we would conclude that the drug dx is being investigated for disease sxi.

3 Materials

The essential information involved in our study includes approved drug uses, drug chemical structures, target proteins, and protein interactions. We collected and integrated all these different types of information from publicly accessible resources.

3.1 Approved Drug List and Target Protein Information

From DrugBank (), a widely used public database of drug data, we collected 1007 approved small-molecule drugs with their corresponding target protein information. Furthermore, we mapped these drugs to several other key drug resources including RxNorm, PubChem (; ), and UMLS in order to extract other drug related information. For instance, we extracted chemical structures of the 1007 drugs from PubChem and used its Score Matrix Service to calculate chemical similarity scores for the 1007*1007 drug pairs. To facilitate collecting target protein information, we mapped target proteins to UniProt Knowledgebase (), a central knowledge base including most comprehensive and complete information on proteins. In the end, we extracted 3,152 relationships between 1,007 drugs and 775 proteins.

3.2 Drug-Disease Treatment Relationships

We obtained a drug’s known use(s) through extracting treatment relationships between drugs and diseases from the National Drug File-Reference Terminology (NDF-RT), which is part of the NLM’s Unified Medical Language System (UMLS). One issue of the NDF-RT data set is lack of the management of drug name variants. For instance, disease ‘Breast Neoplasm’ can be treated by the drugs ‘Tamoxifen’, ‘FULVESTRANT 50MG/ML INJ, SYRINGE, 5ML’, and ‘CAPECITABINE 150MG TAB’. We overcame this issue by normalizing various drug names to their active ingredients and subsequently mapping ingredient names to unique concept identifiers in UMLS. As a result, the normalized treatment relationships in the above example were ‘Tamoxifen’-‘Breast Neoplasm’, ‘Fulvestrant’-‘Breast Neoplasm’, and ‘Capecitabine’-‘Breast Neoplasm’. From the normalized NDF-RT data set, we were able to extract therapeutic uses for 799 drugs out of the 1007 drugs, which constructed a gold standard set of 3,250 treatment relationships between 799 drugs and 719 diseases.

3.3 Protein-Protein Interactions

We extracted protein-protein interaction information from the Human Protein Reference Database (HPRD) (), which contains curated proteomic information pertaining to human proteins. In this study, we used 39,240 binary interactions between 9,673 human proteins in HPRD.

4 Results

4.1 Drug Pairs Known for the Same Therapeutic Uses

In this study, we built our method on the basis that similar drugs are indicated for similar diseases and conditions. To confirm this and to show the strength of our proposed method in boosting the target similarity, we took 177 cardiovascular drugs from our data (e.g., ‘Doxazosin’ and ‘Terazosin’ are known to treat hypertension) and compared their pairwise chemical/target similarities with those of 4,000 randomly selected drug pairs. In Figure 3A, we show the chemical similarity (SIMchem) and target similarity (SIMtarget computed by Pearson’s correlation) for the 4066 drug pairs known for treating cardiovascular diseases. As a comparison, we show in Figure 3B the similarities of 4,000 randomly selected drug pairs. It is clear that compared to the random pairs, the drug pairs with similar therapeutic uses have significantly enriched chemical similarity and target similarity (t-test P value < 2.2 × 10-16).

Figure 3 

Scatter plots of (SIMchem.) and (SIMtarget.) computed by Pearson’s correlation.

In addition to using Pearson’s correlation for computing the target similarity, we show in Figure 4A and 4B two similar scatter plots using the proposed G2 method. By comparison, we can see that our method significantly boosts (SIMtarget) of drugs for the same therapeutic uses (t-test P value = 7.67 × 10-8) (Figure 4A vs. 3A) while having no significant effects on random pairs (t-test P value = 0.21) (Figure 4B vs. 4B). This suggests that our G2 method works selectively by only boosting the similarity of related drugs.

Figure 4 

Scatter plots of (SIMchem) and (SIMtarget) computed by our G2 method.

4.2 Cross Validation Using Known Drug Uses

To assess our method in predicting novel indications, we used the known therapeutic uses of 799 drugs as the gold standard (see Section 3.2). For each drug, we removed its known uses and attempted to recover them through its top N similar drugs found by our method. To show the performance over the entire dataset of 799 drugs, we plotted ROC curves using both sensitivity and specificity. Five plots are shown in Figure 5, each of which represents a different strategy in measuring the drug pairwise similarity depending on: 1) the number of overlapping target proteins (|P(dx, dy)|), 2) Pearson’s correlation of drug targets (Pearson), 3) drug target similarity using the our G2 method (SIMtarget), 4) solely chemical structure similarity (SIMchem), and 5) the linear combination of SIMchem and SIMtarget (λ = 0.8 is empirically determined). We calculated overall sensitivity and specificity trade-offs by varying N, the number of similar drugs, from 1 to 798. As can be seen, our combination method achieved the best performance with an area under the ROC curve (AUC) of 0.888, better than relying on drug target profile (best AUC = 0.876) or chemical structure similarity (AUC = 0.852) alone. Furthermore, we see that when only using drug target profiles, the performance of our G2 method was substantially higher (AUC of 0.876) than using Pearson’s correlation (AUC of 0.842) or simply counting the overlap (AUC of 0.838). Such results suggest that our method is able to better capture interactions between target proteins through iteratively propagating similarities from protein pairs to drug pairs and vice versa.

Figure 5 

ROC curves of different drug-pairwise similarity strategies.

We compared our method with the guilt-by-association (GBA) () and PREDICT methods (). The GBA approach assumes that if two diseases share similar therapies, then other drugs that are currently used for only one of the two diseases may also be therapeutic for the other. We applied the GBA approach to the 799 drugs and their known uses in NDF-RT in our data. The GBA approach obtained a sensitivity of 0.74 and specificity of 0.85, which is below the red ROC curve if plotted in Figure 5. By comparison, the best cut-off point on the red curve (our combination point) corresponds to a sensitivity of 0.77 and specificity of 0.92 (N = 20), respectively. Not only does our method outperform the GBA approach, it is also able to rank its prediction results (the GBA approach cannot), an important feature for prioritizing drug repositioning candidates in practice. Gottlieb et al.’s () study evaluated drug use prediction through cross validation on a gold standard set of 1933 associations between 593 drugs and 313 diseases. As reported in Gottlieb et al., (), they obtained an AUC of 0.90. For direct comparison, we applied our method to their data and achieved a comparable AUC of 0.89.

4.3 Evidence Validation Using Clinical Trials and Biomedical Literature

After cross validation, we further evaluated the validity of our novel drug use prediction by searching the predicted drug-disease pairs against the trials in ClinicalTrials.gov and scientific abstracts in PubMed. For example, given a drug ‘Fluoxetine’, our method would predict 6 indications based on its most similar drug ‘Citalopram’. Two of the predicted uses are known uses (i.e., ‘Depressive Disorder’ and ‘Obsessive-Compulsive Disorder’), thus leaving the other 4 as novel predictions: ‘Alcoholism’, ‘Diabetic Neuropathies’, ‘Tobacco Use Disorder’, and ‘Dementia’. When searching for their evidence, we found that the ‘Alcoholism’ use is indicated in a clinical trial (NCT00027378), which was conducted to study Fluoxetine in treating adolescents with alcohol use disorder and major depression and that the other three uses have been investigated with study results published in the literature (; ; ). Table 2 shows 5 examples of novel drug uses predicted by our method and similar drugs supporting these predictions in our method as well as their supporting statements in trials/publications.

Table 2

Examples of repositioned drugs predicted by our method.

DrugNovel useSimilar drugs supporting this predictionEvidence in clinical trial/literature

FluoxetineAlcoholism Citalopram
SIMchem = 0.66
SIMtarget = 0.53 (Common target P31645 (Sodium-dependent serotonin transporter))
NCT00027378: Study fluoxetine (Prozac) versus a placebo in the treatment of adolescents with alcohol use disorder and major depression
RamiprilRheumatoid Arthritis Enalapril
SIMchem = 0.92
SIMtarget = 1 (Common target P12821 (Angiotensin-converting enzyme))
NCT00273533: Evaluate that Ramipril improves vascular function and reduces markers of low-grade chronic inflammation and oxidative stress in patients with Rheumatoid Arthritis
SildenafilBrain Ischemia Pentoxifylline
SIMchem = 0.38
SIMtarget = 0.19 (Common target O76074 (cGMP-specific 3’,5’-cyclic phosphodiesterase))
PMID 20436396: Study the therapeutic effect of sildenafil citrate on cerebral vasospasm in rat model
CarbidopaProstatic Neoplasms Flutamide
SIMchem = 0.42
SIMtarget = 0.33 (Interaction between Carbidopa’s target P20711(Aromatic-L-amino-acid decarboxylase) and Flutamide’s target P10275 (Androgen receptor))
PMID 16895983: Treatment of nude mice containing prostate neuroendocrine cancer with carbidopa plus amiloride and flumazenil leading to significant reductions in tumor growth
PimecrolimusGraft-versus-host disease (GVHD) Tacrolimus
SIMchem = 0.89
SIMtarget = 0.78 (Common target P62942 (FK506-binding protein 1A); plus interaction between Pimecrolimus’s target P42345 (Serine/threonine-protein kinase mTOR) and the common target P62942)
PMID 20723118: Treatment of disfiguring chronic GVHD in a child with topical pimecrolimus: case report

When setting λ = 0.8 and N = 20 (best performance obtained in cross-validation experiments), our method predicted 30,872 novel indications for the 1007 drugs. 8,564 (~30%) of the predicted novel uses can be found in the literature. In addition, 1,340 of these predictions can be found in clinical trials. As a matter of fact, it is 5 times more likely for our predicted uses to be found in a trial than those drug uses not predicted by our method (Chi2 test P value < 2.2 × 10-16). Hence, we conclude that the novel uses predicted by our method are significantly enriched in both scientific literature and clinical trials.

5 Discussion and Conclusion

Computational drug repositioning offers promise for discovering new uses of existing drugs as drug related molecular, chemical, and clinical information has increased over the past decade and become broadly accessible. In this study, our method was developed based on the hypothesis that a drug can be repositioned to another drug’s therapeutic area if two drugs share similar molecular and/or chemical properties. We confirmed this by comparing drug pairs with similar therapeutic uses vs. randomly selected pairs, as shown in Figures 3 and 4. From the same set of figures, we can also see that although target similarity somewhat correlates with chemical similarity (correlation coefficients ~ 0.3), many drug pairs with similar therapeutic uses share common targets but do not have similar chemical structures and vice versa. This suggests that either similarity may play its own role in finding similar drugs. Indeed, as shown by our results (Figure 4), using either one can already result in good performance. Moreover, Figure 4 shows that a relatively higher AUC score was obtained with λ = 1 (i.e., using only target similarity) vs. λ = 0 (i.e., using only chemical structure similarity), which suggests weighting the former higher than the latter when combining the two. Indeed, we found empirically that the best performance was achieved when λ was set to be 0.8 on our data, confirming our belief that the two similarities can complement each other in identifying similar drugs.

According to Figure 5, overall the proposed bipartite graph based method produced significantly better results than the baseline method of considering the overlap of common drug targets (our AUC = 0.876 vs. overlap AUC = 0.838). In particular, when no common target protein exists between two drugs, this method became critical in establishing the target similarity. For instance, as shown in Table 2, the predicted drug use (Prostatic Neoplasms) for ‘Carbidopa’ would not be found if only common target proteins were considered.

Our method shares some of the same limitations as other drug repositioning methods. First, our method relies on existing knowledge of drug-disease, drug-target, and protein-protein relationships. Unfortunately, such information is currently incomplete from existing resources, thus limiting the prediction power of our method. Second, like any similarity-based approach, our method would fail to identify any reusable drugs for a disease if no current treatment is available for that disease. This is because our predicted indications are based on the known uses of other drugs. Lastly, in this work we limit our method to only the approved small molecules with known target proteins. Hence, this excludes some drugs that are not a small molecular (e.g., Rituximab) or whose protein targets are not known yet (e.g., Mannitol).

In conclusion, we developed a systematic method for mining potential new drug indications by exploring both chemical and molecular features in similar drugs. The proposed bipartite graph model successfully boosted target similarity by iteratively integrating explicit evidence (common target proteins shared by drugs) and implicit evidence (common drugs shared by target proteins). Furthermore, we found evidence from the literature and clinical trials for many of the novel indications predicted by our method. Note that with significantly fewer features, we were able to obtain similar results to PREDICT. It is possible that adding additional features such as side effects, gene sequences, and disease phenotypes could further improve our performance. We plan to investigate this issue in future work.