QSAR Modeling useful in Anti-Cancer Drug Discovery: Prediction of V600 EBRAF-Dependent P-ERK using Monte Carlo Method

Quantitative structure−activity relationship (QSAR) modeling is one of the major computer aided modeling employed in medicinal chemistry. It is used for devel - oping relationships between the effects (e.g. activities and properties of interest) of a series of molecules with their structural properties. The aim of this work was to develop QSAR models to predict the inhibition of V600 EBRAF-dependent extracellular regulated kinase (ERK) phosphorylation in WM266.4 melanoma cell lines (IC 50 p ERK) using the CORAL software (http://www.insilico.eu/coral). These models make use of descriptors based on simplified molecular input-line entry system (SMILES), optimized with the Monte Carlo method. The statistical quality of the newly built models was satisfactory on three random splits of data.


Introduction
Mitogen-activated protein kinase (MAPK) modulates extracellular signals to control several cell functions (from proliferation and survival to differentiation and senescence) [1,2] . One of the most studied MAPK pathways is the extracellular signal-regulated kinase (ERK) pathway. ERK is a subgroup of MAPKs that is activated by external factors such as growth factors and mitogens [3] . The cascade of ERK signalling is triggered by the activation of receptor tyrosine kinases and flows through RAS GTPase [4] . RAF kinases are key components of this pathway: once RAS is turned on, it recruits and activates proteins necessary for the propagation of growth factor and other receptor signals, such as RAF. There are three RAF proteins in mammals, ARAF, BRAF, and CRAF, and they can all activate MAP kinase (MEK) just upstream of ERK [5] . The over-expression or mutations of the components of the RAS-RAF-MEK-ERK-MAP kinase pathway has been found to occur in up to 30 % of human cancers [6] . In different cancer types, the mutation of RAS or RAF causes the hyper activation of ERK followed by un-2 most studied inhibitor for this class of kinases. It was approved by FDA in 2005 for the treatment of renal cell carcinoma and in 2007 for the treatment of hepatocellular carcinoma, and is still undergoing multiple clinical trials in other types of cancer [11,12] .
The experimental measurement of the inhibition activity of chemicals is difficult, expensive and time-consuming. QSAR based analysis can be used as a tool to screen or filter anti-cancer drug candidates, before they are subjected to more intensive calculations, such as docking, or to experimental in vitro measurement of activity and finally under in vivo conditions [13] . In this study, a model was built to predict the half maximal inhibitory concentration (IC 50 ) of V600 EBRAF-dependent ERK phosphorylation in WM266.4 cells. This model made use of descriptors based on simplified molecular input-line entry system (SMILES) [14][15][16][17] , calculated with the CORAL software. Then, chemical information obtained from the calculation of these descriptors was used for drawing up hypothesis of molecular design of Sorafenib derivatives.
The aim was to obtain molecules able to inhibit 50% of ERK phosphorylation in WM266.4 melanoma cells at lower concentrations compared to the template molecule.

Method Data
The dataset of 142 chemical structures used in this study was taken from the literature [18,19] . All these compounds shared a common scaffold ( Figure 2). The dataset contained chemical structures represented as SMILES strings, and experimental values relative to the inhibition of the phosphorylation of extracellular signal-regulated kinases (p-ERK). The selected endpoint was IC 50 of V600 EBRAF-dependent [20] ERK phosphorylation in WM266.4 melanoma cells (IC 50 , pERK) [1,14] . IC 50 values were expressed as negative decimal logarithm (pIC 50 ). In this study ACD ⁄ ChemSketch [21] , was used to build up SMILES of each structure.

CORAL and Optimal descriptors
CORAL is freely available standalone application software for building up regression or classification QSAR models based on the Monte Carlo technique [22] . The calculations are repeated several times for various splits into training, calibration and external validation sets. Optimal descriptors for constructing QSAR models are based on graph or SMILES. The molecular graph includes hydrogen suppressed graph (HSG), hydrogen filled graph (HFG), and graph of atomic orbital's (GAO).
The SMILES-based optimal descriptors include a group of local and one of global attributes. Sk, SSk, and SSSk are local descriptors represented by a sequence of atoms and bonds present in the SMILES string [23] .
A more detailed explanation of CORAL descriptors has been provided by [24] . All the SMILES-based local descriptors and BOND, NOSP and HALO global attributes were selected to build the regression models. Data were randomly split three times into training (60%), calibration (20%) and validation sets (21%). Table 1 The training and calibration sets were used for building the QSAR model and the validation set was used as external set for estimating the predictive potential of the developed model [25] . In order to develop a model with a good predictive potential, the preferable parameters of the Monte Carlo optimization, the threshold (T*) and the number of epochs (N*) that give the best statistics for the calibration set should be defined.
The threshold is a tool for classifying codes as both rare (and thus likely less reliable features, probably introducing noise into the model) and common features, which are used by the model and labelled as active. The optimal descriptors are calculated with the correlation weights (CW) only of active features, excluding those related to rare ones.
The Nepoch is the number of cycles (sequence of modifications of correlation weights for all codes involved in model development) for the optimization [26] . For modelling activity of potential p-ERK inhibitors the descriptor of correlation weights (DCW) was calculated as follows: DCW (Threshold, Nepoch) =∑ CW (Sk) + ∑ CW (SSk) + ∑ CW (SSSk) + ∑ CW (BOND) + ∑ CW (HALO) + ∑ CW (NOSP) (1) where CW(SAk) is correlation weight for a molecular or structural feature (SAk) extracted from k-th SMILES; The endpoint pIC 50 is a function of DCW according to the following equation: Endpoint= C0 + C1 x DCW (T, Nepoch) (2) Where C0 and C1 are the intercept and the slope for the training and calibration set, respectively.
The role of molecular features or structural attributes  (4) can be used to classify the active attributes, and the Defect SMILES defines the domain of applicability for SMILES [26] .

Results and discussion
The models for three random splits are the following: Split 01 pIC 50 = 1.4942 (± 0.0237) + 0.0379 (± 0.0001) * DCW (3, 3) (07) Split 02 pIC 50 = -1.0822 (± 0.0222) + 0.0342 (± 0.0001) *DCW (5, 4) (08) Split 03 pIC 50 = -1.8219 (± 0.0236) + 0.05103 (± 0.0002) *DCW (2, 6) (09) Figure 3(a), Figure 3     The values of the parameters of statistical analysis of p-ERK inhibition models built on the three random distributions of data into training, calibration, and validation sets are collected in Table 2. n is the number of compounds in each set; r² is the coefficient of determination; q² is cross validated r 2 ; MAE is mean absolute error; s is the root-mean-square error; F is the Fischer F ratio; T and N are the preferable values for Threshold and Number of epochs, respectively. CR_p^2 indicates if the models developed are obtained by chance or not, based on the Y-randomization test. For an acceptable model, CR_p^2 should be greater than 0.5 [27,28] . r 2 range from 0.78 to 0.92, and q 2 range from 0.75 to 0.92 (for training, calibration and test sets). According to the criteria defined by Tropsha [26][27][28][29] , a model has high predictive power if the following conditions are fulfilled: q² > 0.5 r² > 0.6 (r² -r²0)/r² < 0.1 or (r² -rʹ0²)/r² < 0.1 0.85 ≤ k ≤ 1.15 or 0.85 ≤ kʹ ≤ 1.15 |r0 2 −rʹ0 2 | < 0.3 Where r0² and rʹ0² are squared correlation coefficients for regression through the origin, calculated between predicted versus experimental values and between experimental versus predicted values, k and kʹ are the slopes in the former and later cases respectively; q² is calculated for the training sets, while all other criteria are calculated for the validation sets [26] . Table 3 contains the values for the criteria iii, iv, and v. Y-randomization test for all models showed that these are not chance correlations, since the CR_p^2 is larger than 0.5 ( Table 2). All the models successfully fulfill the criteria proposed by Tropsha for predictive ability. 86 out of 142 dataset's compounds assigned to the training set are used to develop the QSAR models, 28 compounds are included in the calibration, and 28 compounds, not involved in the model development, are kept out for the external validation . Based on r² and q² values, the three models of equations (07), (08) and (09) have similar statistical behaviours. Indicating that the random selection does not affect the results. The predictability of the models was also checked calculating the statistical parameters r m 2 , average r m 2 , and ∆r m 2 on the calibration sets (Table  3), as proposed in the literature [29,30] . According to the literature, a model has predictive potential if r m 2 > 0.5, average r m 2 > 0.5, and ∆r m 2 < 0.2 [31] . The presented results reveal that the predict ability of all models is very good, especially for the model of equation (09), with q² and r m 2 values equal to e 0.92 and 0.88, respectively (Table 3). The numerical values of (r 2 -rʹ0 2 )/r 2 ; (r 2 -r0 2 )/r 2 ; k; kʹ and |r0 2 -rʹ0 2 |.

Split
(r²-r0 2 )/r 2 (r 2 -rʹ0 2 )/r 2 |r0 2 -rʹ0 2 | k kʹ  The list of all SAk, with the correlation weights for the three splits from the Monte Carlo optimization process of the built QSAR model is given in Table S2,  The mechanistic interpretation obtained from the analysis of SAk can be used in the search and computer aided design of novel p-ERK inhibitors candidates with desired pIC 50 values. Figure 4 shows the structures of proposed kinase inhibitors derivatives, the structure Sorafenib was used as a template for molecular design by added the SMILES notation c...S......., BOND10000000 and c...O....... defined as promoters of increase.
The molecule E has a higher value of pIC 50 in comparison to other molecules.   50 values calculated by using model of split 2 for kinase inhibitors derivatives designed by using the results of QSAR model obtained in this study. Supplementary materials section contains technical details of the models for splits examined in this work.

Conclusion
In this work three QSAR models were developed to predict V600 EBRAF-dependent ERK phosphorylation of 142 molecules as kinase inhibitors, using SMILES based optimal descriptors. The models showed acceptable predictive capability on three random split of data. The analysis of the structural features (obtained from SMILES) with their positive and negative correlation weights allowed designing possible pERK inhibitors with an increased activity compared to Surafenib.
The activity modulating effect of the structural features may serve to draw up preliminary hypothesis of novel anticancer drug candidates, and also to provide a better understanding the drugs mechanisms of action. Table S1. Experimental and calculated of pIC50 p-ERK inhibitors for three splits; training (Tr); calibration (C); and validation (V) sets.