## Machine Learning

In the development of a machine-learning model to predict the binding affinity, for instance, the goal is to determine the relative weight (β_{j}) of the explanatory variables, to bring the predicted values (f

_{i}) close to the experimental values (y

_{i}). In the equation 1 below, we have the response variable (f) expressed as a function of the explanatory variables (x

_{j}),

$$f(x_1,...,x_N)= \beta_0 + \sum_{j=1}^N\beta_jx_j \text{ (Eq. 1).}$$

Where N indicates the number of explanatory variables and β_{0}represents the regression constant.

#### Ordinary Linear Regression

Among the supervised machine learning techniques, the oldest method is the ordinary linear regression method. The idea behind the ordinary linear regression method is to minimize the cost function known as residual sum of squares (RSS). Some authors call this cost function the sum of squared residuals (SSR) (Bell, 2014; Bruce and Bruce, 2017). Below we have the equation for RSS,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2 \text{ (Eq. 2).}$$

Where M is the number of observations, y_{i}is the experimental value, and f

_{i}is the predicted value. RSS is the sum of the differences between the experimental value (y

_{i}) and the predicted value (f

_{i}). The regression method optimizes the weights (β

_{j}) in the equation (1) to minimize the RSS.

#### Least Absolute Shrinkage and Selection Operator (Lasso)

The Lasso method adds a term involving the sum of the absolute values of the relative weights to the RSS equation (Tibshirani, 1996), as indicated below,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2+\lambda_1\sum_{j=1}^N|\beta_j| \text{ (Eq. 3).}$$

In the equation 3, the term λ_{1}≥ 0 indicates a coefficient responsible for controlling the strength of the penalty. The larger is the value of the penalty; the greater is the shrinkage. We call this additional term added to the original RSS equation, the penalty term. In Lasso method, the regression carries out the L1 regularization. This method can generate sparse models with fewer coefficients when compared with the ordinary linear regression method. Furthermore, some coefficients can be zero. When we increase the penalties, the consequences are coefficient values closer to zero. This situation is the ideal for producing models with fewer explanatory variables.

#### Ridge

In the Ridge method (Tikhonov, 1963), we follow the same principle of adding a penalty term to the original expression of RSS (equation 2). The penalty term takes a form of a sum of the squared weights, as indicated below,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2+\lambda_2\sum_{j=1}^N|\beta_j|^2 \text{ (Eq. 4).}$$

In the above equation, λ_{2}≥ 0 is the regularization parameter. The Ridge method performs L2 regularization.

#### Elastic Net

The idea behind the Elastic Net method is to combine the Lasso and the Ridge regression methods (Zou and Hastie, 2005), as indicated below,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2+\lambda_1\sum_{j=1}^N|\beta_j|+\lambda_2\sum_{j=1}^N|\beta_j|^2 \text{ (Eq. 5).}$$

In the above equation, the terms λ_{1}≥ 0 and λ

_{2}≥ 0 are the two regularization parameters.

#### SAnDReS for Machine Learning

The use of machine-learning methods to study biological systems is not new. For instance, we can find applications of artificial neural networks, as old as 1985 (Nanard & Nanard, 1985). Considering the application of supervised machine-learning techniques to the prediction of ligand-binding affinity, we have studies dating back 1994 (Hirst *et al*., 19944a; Hirst *et al*., 1994b).

So, what is new about SAnDReS? SAnDReS (Xavier *et al*., 2016) makes use of supervised machine-learning techniques to generate polynomial equations to predict ligand-binding affinity, which allows improvement of native scoring functions. SAnDReS (Xavier *et al*., 2016) allows training a model making it specific for a biological system. Let us consider the HIV-1 Protease system (Pintro & de Azevedo, 2017), we could make use of a standard scoring function, such as PLANTS score (Korb *et al*., 2009) and fine-tuning its terms to adjust it to predict log(Ki) for the HIV-1 Protease (Pintro & de Azevedo, 2017). We could say that we are integrating computational systems biology and machine-learning techniques to improve the predictive power of scoring functions, which gives you the flexibility to test different scenarios for the biological system you are interested in.

*Schematic diagram illustrating the development of a target-based scoring function to predict log(Ki) for the HIV-! Protease (Pintro & de Azevedo, 2017).*

We could think that we have the Protein Sequence Space (Smith, 1970) and the Chemical Space with all potential binders to elements of the Protein Sequence Space (Smith, 1970). SAnDReS (Xavier *et al*., 2016) allows the construction of a third space, we call it Scoring Function Space (Heck *et al*., 2017), where we find infinite mathematical functions to predict ligand-binding affinity. SAnDReS (Xavier *et al*., 2016) applies machine-learning techniques to explore this Scoring Function Space (Heck *et al*., 2017) finding the function that predicts the experimental binding affinity as closer as possible.

SAnDReS (Xavier *et al*., 2016) has a flexible interface that allows testing the predictive power of regression models generated by machine learning techniques, such as: Linear Regression, Least Absolute Shrinkage and Selection Operator (Lasso), Ridge, Elastic Net, Stochastic Gradient Descent Regressor, and Support Vector Regression. All these methods are available from the scikit-Learn library (Pedregosa *et al*., 2011) and implemented as an intuitive workflow in SAnDReS (Xavier *et al*., 2016).

The SAnDReS (Xavier *et al*., 2016) project has over 25,000 lines of Python code and is able to automatically carry out docking simulations using AutoDock4 (Morris *et al.,* 1998), AutoDock Vina (Trott & Olson, 2010), and (Thomsen & Christensen, 2006) without any worries with input files. But the soul of the program SAnDReS (Xavier *et al*., 2016) is its machine-learning box, that allows you to build a targeted-scoring function for the biological system you are interested in. SAnDReS (Xavier *et al*., 2016) uses scikit-learn library (Pedregosa *et al*., 2011) to build hundreds of polynomial equations where the explanatory variables are taken from the original dataset and determines the relative weight for each explanatory variable in the following polynomial equation,

$$f(x_1,...,x_N)= log(K) = \alpha_0 + \sum_{i=1}^N\alpha_ix_i+ \sum_{i=1}^{N-1}\sum_{j>i}^N\beta_{ij}x_ix_j+\sum_{i=1}^N\omega_ix_i^2$$

where α_{i}, β

_{ij}, ω

_{ij}are the relative weights for the explanatory variables (x

_{i}, x

_{j}), and f(x

_{1}, x

_{2},...,x

_{N}) is the response variable . N is the number of explanatory variables and α

_{0}is the regression constant. The term log(K) represents the log of inhibtion constant (K).

Taking N = 3, we have the following polynomial equation:

$$f(x_1,x_2,x_3) = log(K) = \alpha_0 + \alpha_1x_1 + \alpha_2x_2 + \alpha_3x_3 + \beta_{12}x_1x_2 + \beta_{13}x_1x_3 + \beta_{23}x_2x_3 + \omega_1x_1^2+ \omega_2x_2^2 + \omega_2x_3^2$$Considering that the above equation has 9 independent variables, we have a total of 511 possible polynomial equations. We don't consider the equation log(K)=α_{0}.

**References**

Bell, J. Machine Learning. Hands-On for Developers and Technical Professionals; John Wiley and Sons: Indianapolis, 2015. PDF

Bruce, P.; Bruce, A. Practical Statistics for Data Scientists. 50 Essential Concepts; O’Reilly Media: Sebastopol, 2017. PDF

Heck GS, Pintro VO, Pereira RR, de Ávila MB, Levin NMB, de Azevedo WF. Supervised Machine Learning Methods Applied to Predict Ligand-Binding Affinity. Curr Med Chem. 2017; 24(23): 2459–70. PubMed PDF

Hirst JD, King RD, Sternberg MJ. Quantitative structure-activity relationships by neural networks and inductive logic programming. I. The inhibition of dihydrofolate reductase by pyrimidines. J Comput Aided Mol Des. 1994a; 8(4):405–20. PubMed

Hirst JD, King RD, Sternberg MJ. Quantitative structure-activity relationships by neural networks and inductive logic programming. II. The inhibition of dihydrofolate reductase by triazines. J Comput Aided Mol Des. 1994b; 8(4): 421–32. PubMed

Korb O, Stützle T, Exner TE. Empirical scoring functions for advanced protein-ligand docking with PLANTS. J Chem Inf Model 2009; 49(1): 84–96. PubMed

Morris G, Goodsell D, Halliday R, Huey R, Hart W, Belew R, Olson A. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J Comput Chem. 1998; 19:1639–62. PubMed

Nanard M, Nanard J. A user-friendly biological workstation. Biochimie 1985; 67(5): 429–32. PubMed

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Verplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011; 12: 2825–30. PDF

Pintro VO, Azevedo WF. Optimized Virtual Screening Workflow. Towards Target-Based Polynomial Scoring Functions for HIV-1 Protease. Comb Chem High Throughput Screen. 2017; 20(9): 820-827. PubMed PDF

Smith JM. Natural selection and the concept of a protein space. Nature. 1970; 225(5232):563–4. PDF

Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy molecular docking. J Med Chem. 2006; 49: 3315–21. PubMed

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol., 1996, 58(1), 267–88. PDF

Tikhonov, A.N. On the regularization of ill-posed problems. Dokl. Akad. Nauk SSSR, 1963, 153, 49–52 (Russian). MR 0162378

Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010; 31(2):455–61. PubMed

Xavier MM, Heck GS, de Avila MB, Levin NM, Pintro VO, Carvalho NL, Azevedo WF Jr. SAnDReS a Computational Tool for Statistical Analysis of Docking Results and Development of Scoring Functions. Comb Chem High Throughput Screen. 2016; 19(10): 801–12. Link PubMed Go To SAnDReS PDF

Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol., 2005, 67(2), 301–20. PDF