Analyzing the Effects of Pretreatment Diversity on HCV Drug Treatment Responsiveness Using Bayesian Partition methods.

Traditional therapies for Hepatitis C Virus (HCV) often yield unsatisfactory results. The reason for this may lie in the mechanism of drug resistance of the HCV virus. Despite doing a plain vanilla comparison between the treated and untreated groups, this paper takes a detour and investigates the drug resistance mechanism to interferon plus ribavirin combined therapy by comparing pretreatment sequence data between response and non-response patients in the NS5A region for genotype 1a HCV virus. We use Bayesian probabilistic models to detect single mutation or mutation combinations, and infer interaction structures between these mutations, to investigate the drug resistance combinations differences between those patients. We hope to decipher, at least partially, the reason behind the unsatisfactory results received from interferon plus ribavirin therapy.


AUTHOR SUMMARY
HCV treatment results have been historically suboptimal[1-3]. HCV drug resistance, which further hinders the treatment effects, is caused by mutations of viral proteins that disrupt the drugs' binding but do not affect the viral survival. Due to the high rate and low fidelity of HCV replication, resistant strains quickly become dominant in a viral population under the selection pressure of a drug. M.J. Donlin et al indicate that pretreatment sequence diversity correlates with response effects[15]. We incorporate this idea and use a Bayesian approach to look into the pretreatment sequences diversity of HCV virus between response and non-response groups, under a combined treatment of interferon and ribavirin.


Introduction
As a single-strand RNA virus, Hepatitis C virus (HCV) has been categorized into at least six genotypes each with several subtypes. Spreading in different regions, the genotypes will have dissimilar response patterns to interferon-based therapy. In clinic, less than 20% of chronic patients show sustained response with interferon monotherapy treatment [1] , while the therapy with the IFN and ribavirin combination showed significant improvement in This is an Open access article distributed under the terms of Creative Commons Attribution 4.0 International License. response rates [2,3] . An accurate interpretation of the mechanism behind the antiviral resistance to IFN therapy will be a key factor for developing better treatment strategies.
Many studies have found that variations in HCV sequences play a role in response to IFNbased therapies, especially for the variations in the NS5A region [4,5] . NS5A is a nonstructural protein which leads to the resistance of IFN treatment by blocking the function of an important mediator in IFN response, dsRNA dependent protein kinase (PKR) [6,7] . In 1995, an "interferon sensitivity determining region" (ISDR amino acid [aa] 2209-2248) defined by Enomoto et al. in NS5A, is enriched with mutations related to resistance to IFN [8] . This finding has been confirmed by several studies [9][10][11] . Moreover, some other study showed evidence that in PKR binding domain (PKRbd aa 2209-274) of NS5A mutations can hamper the viral replication [12] . However, conflicting results about these two regions are also found, which complicate the role of NS5A in response to IFN. In addition to ISDR and PKRbd, other domains in NS5A have also been concerned in resistance to virus, such as proline-rich region (PRR aa 2283-2327), AR1 (aa 2144-2185) and AR2 (aa 2221-2272) of nuclear localization signal (NLS), and the variable region 3 (V3 aa 2356-2379) in the C terminus [13] .
This study will concentrate on NS5A region particular for HCV genotype 1a. In NS5A region there are 1344 base pairs, linking to 448 amino acids. The goal of this paper is to compare the pretreatment sequence patterns between those patients who respond positively to the treatment and those who don't, and then infer the possible mutation positions that may affect the treatment effects.
In this study, we employed Bayesian models, which are originally proposed by Zhang et al. for investigating mutation interactions of HIV caused by a certain drug treatment [14] . Based on the Bayesian variable partition (BVP) model, we first used Metropolis-Hastings algorithm on the data of the interferon treatment to sort out mutations associated with drug resistance, and then applied a recursive model selection (RMS) procedure on the selected mutation positions to infer the dependence structure with the interacting effects.

Methods
Here we first employed the Bayesian variable partition (BVP) model [15] to search for the mutation positions. After detecting the interaction mutation positions, we further applied the Recursive Model Selection [16] (RMS) on selected mutations by BVP to infer more detailed dependence structure among the interacting positions.

Bayesian Variable Partition Model
The response group and non-response group can be represented as two data matrices A=[A 1 , …,A m ] (of dimension N A ×m) and B=[B 1 ,…,B m ] (of dimension N B ×m), respectively (each row is a sequence, each column is a position of protein HCV RT). Here N A or N A denotes the number of sequences in response or non-response group respectively, and m denotes the number of positions. The m positions can be partitioned into four sets: set 1 contains positions independent with each other sharing the same distribution in response and nonresponse groups; set 2 contains positions independent with each other but with different distributions in two groups; set 3 contains positions dependent with the same distribution in two groups; set 4 contains positions dependent with different distributions in two groups. These four sets are corresponding to the four hypotheses in the result section. Let I=(I 1 , …,I m ) indicate the membership of the positions with I j =1,2,3 and 4, respectively, and A (1) and B (1) denote the sequences in l th set from two groups. Our goal is to infer the sets of positions with different distributions in two groups (that is I j = 2 or 4).
Assume that there are c j possible values (amino acid types) at position j, and let Θ 1 ={(θ j1 ,θ j2 ,…,θ j cj ):I j =1} be the amino acid frequencies of each position in set 1 in both groups, thus, the likelihood of (A (1) , B (1) ) is (1) where {n j1 ,…,n j cj } are number of sequences taking k th value in (A (1) , B (1) ). Assume a Dirichlet prior on Θ 1 , that is, Θ 1~D irichlet(α) where α(α 1 ,…, α c j ) By integrating out Θ 1 , we have the marginal probability: (2) where |α| is the sum of all elements in α, and N=N A +N B . Different to positions in set 1, two priors, Dirichlet (β A ) and Dirichlet (β B ), are used on the amino acid frequencies of each positions in set 2 in group A and B, respectively. By integrating out frequencies, we obtain Positions in set 3 and 4 influence the resistance statuses through interactions. Thus, each amino acid combination over set 3 or 4 represents a potential mutation. Assume there are c (3) and c (4) possible value combinations for set 3 and 4, respectively, we use Dirichlet(γ) prior on the combination frequencies in set 3 and use two priors, Dirichlet (δ A ) and Dirichlet(δ B ), on the combination frequencies in set 4 for response group and non-response group. By integrating out frequencies, we obtain (5) Combining formulas from (1) to (7), we have the posterior distribution of I as (8) In this study, we assume most positions should be in set 1 or set 3 in prior (i.e. unassociated with drug resistance), P(I i =2)=P(I i =4)=0.01, and . We further set the parameters for all Dirichlet priors to 0.5. A Markov chain Monte Carlo (MCMC) algorithm [17] can be designed to sample from this posterior distribution so as to infer which variables are associated with the treatment status. More details on BVP can be found in [18] .

Recursive Model Selection
We applied the RMS procedure to infer the detailed dependence structure among the interacting positions generated by BVP. Our strategy is to recursively apply a model selection of two classes of cruder models, that is the chain-dependence model and the Vdependence model, until the data do not support more detailed models.
We say that a group of variables X G follow a chain-dependence model if the index set G can be partitioned into three subsets U, V, and W such that X U and X W are independent given X V , such as X U →X V →X W . The joint distribution of a chain-dependence model is (9) We say that a group of variables X G follow a V-dependence model if X U and X W are mutually independent, that is X U →X W ←X V . The joint distribution of a V-dependence model is (10) In these two models, only set W is allowed to be empty, in which case these models become the saturated model. We use a model indicator to imply the membership of the L positions with for the chain-dependence model and for the V-dependence model. Let denote the set partition, the posterior distribution of I CV and is (11) We set equal prior probability for I CV and is An MCMC algorithm is designed to simulate from (11) and to find the optimal model type and variable partition. More details on RMS can be found in [13] . The procedure is applied recursively until only single-variable nodes are available.
BVP model and RMS procedure were utilized sequentially to the data of response and nonresponse patients. For the comparison, we applied BVP to 47 response datasets versus 29 non-response dataset, and contrived the difference between these two, recognizing that there exist different pretreatment patterns that we should account for differently. The detailed accession numbers for total 76 sequences are showed in Table 1, 2. More information about these sequences can be accessed in the reference [19][20] .

Results
As M.J. Donlin et al. indicated, pretreatment sequence diversity correlates with response effects [15] . On observing this finding, our analysis is to test the following four proposed hypotheses. Hypothesis 1: the positions are independent with each other, and the probability distribution of the pretreatment sequences of response and non-response groups is the same; Hypothesis 2: the positions are independent with each other, and the pretreatment sequences of response and non-response groups have different probability distributions; Hypothesis 3: the positions are dependent, and the distribution of the pretreatment sequences of response and non-response groups is the same; Hypothesis 4: the positions are dependent, and the pretreatment sequences of response and non-response groups have different probability distributions.
We applied Bayesian Variable Partition (BVP) model and Recursive Model Selection (RMS) procedure to the pretreatment sequences of response (47 sequences) and nonresponse (29 sequences) samples, as described in detail in the methods section. We run the proposed method with multiple random restarts, and multiple chains will converge to different multiple local modes. So given different results, we consider all of them meaningful. We did not do any heuristic or subjective selection. Table 3 shows the results of the analysis, namely, positions which have 95% or more probability for us to infer one of our four hypotheses. As shown in Table 3 While analyzing single positions as above is helpful, a lot of positions are not mutating independently. [ Figure 3] shows the interacting positions detected by BVP in response samples and [ Figure 4] shows the interacting positions detected by BVP in non-response samples. Table 4 and Table 5 show the dependence structure inferred by RMS in detail, for the response and non-response group respectively. At position 285, we found that the frequency of amino acid E is 13.8% in non-response samples and 8.5% in the response samples. A more significant result was found at position 199, where the frequency of amino acid L decrease from 100% to 87.2%, from non-response samples to response samples. Similar yet less significant patterns were found at position 226, where the amino acid M decrease from 20.75 to 14.9%, from non-response samples to response samples. For dependent positions, we observed similar results, as shown jointly in Table 4 and Table 5. To show the prediction power of the reported mutations by our method, an SVM classifier was used to conduct a classification on the non-response and response sequences. The ROC curves and corresponding AUC values were showed in Figure 5. The SVM was implemented by libsvm [16] , and the hyper-parameters were tuned to be optimal by the grid search. The kernel function applied was Radial basis function, which gave best results comparing to linear, polynomial, and sigmoid kernel functions. We chose four mutation positions combinations to illustrate the prediction power of our findings: Single positions significant under Fisher test also reveal differences between response and non-response samples in terms of the frequency of amino acid. At position 48, the frequency of amino acid R is 100% in non-response samples, while only 70.2% in response samples. At position 81, the frequency of amino acid R is about 15% higher in response samples. These results, combined with the more reliable evidence from Table 4 and Table 5, gave us a relatively complete picture of the differences between non-response and response samples.

Discussion
Utilizing the pioneering method proposed by Zhang et al [14] , which employs Bayesian statistical modeling, we were able to detect and analyze, the complex interactions of mutations of the HCV protease and reverse transcriptase.
While this analysis helps present a relatively comprehensive picture of the different pretreatment structures of non-response and response patients, it admittedly omits many other factors that possibly influence HCV virus mutations. Despite all the possibilities that may emerge, this study has not only confirmed the original findings of HCV drug resistance but also demonstrated the long-puzzled selection pattern of HCV drug treatment effects. We are positive that the method and results presented here will make a stimulation of new and more accurate ways to decipher the myths behind drug resistance of HCV and other related diseases.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.    Table 1 Accession numbers for 47 response sequences.

Response Sequences
Accession number Accession number Accession number  Table 2 Accession numbers for 29 non-response sequences  Table 3 Positions whose posterior probabilities of H2 or H4 are larger than 0.95.  Table 4 Detailed position interaction relations for positions for the pretreatment sequence of patients who respond to the treatment.  Table 5 Detailed position interaction relations for positions for the pretreatment sequence of patients who don't respond to the treatment.