Computational Identification of Phosphorylation Sites around Nuclear Localization Signal Sequence Reveals New Insight into Genes Associated With Human Diseases

Alterations in protein Subcellular localization often contribute to the development of human diseases. Post-transcriptional modifications, such as phosphoryla - tion on the Nuclear Localization Signal (NLS), may change the protein’s localization. However, little is known about the frequency and local effects of phosphorylation near NLS sites. In this study, a computer program was developed to search vari - ous databases in order to find proteins with NLS Phosphorylations, and any dis - eases that are associated with those genes. 308 NLS sequences were found in the NLSdb database, resulting in the identification of 133,448 NLS-containing proteins in the Uniprot database. We cross-referenced these proteins with phosphorylation data available from the PhosphoSitePlus database and found that about 21% of these NLS-containing proteins have evidence of phosphorylation sites. After plugging this into the gene disease association database, 138 of disease-associated genes (1% of NLS-containing proteins)were identified to have phosphorylation sites on their NLS sequences. Further evaluation of the NLS phosphorylation status of these genes in clinical samples may lead to development of new biomarkers for human diseases, and shed new light into the pathogenesis of these gene-associated diseases.


Introduction
All eukaryotic cells have a complex endomembrane system and contain elaborate organelles that provide distinct compartments for different metabolic activities.Protein translation is confined to only one of these compartments, the cytosol, but proteins are needed for nearly all cellular functions.Thus, the translocation of proteins is a fundamental requirement for proteins to exert their functions in different organelles.In fact, approximately half of the proteins generated by a cell have to be transported across at least one cellular membrane to reach their functional destinations [1] .Subcellular localization is essential to protein function, as it determines the access of proteins to interacting partners and the post-translational modification machinery that allows the integration of proteins into functional biological networks.
Considering the importance of the Subcellular localization of proteins, it is not surprising that the disruption of nuclear-cytoplasmic transport is responsible for cause of many diseases, and is a potent mechanism for resistance to drug treatments.Aberrant protein localization can be caused by mutation, altered expression of cargo proteins or transport receptors, or deregulation of components of the trafficking machinery [2] .For example, a number of the major oncogenes and tumor suppressors such as p53, BCRA1, APC, and retinoblastoma (Rb), β-catenin, NF-κB, survivin and cyclin D1 have been reported to have aberrant Subcellular localization in various types of cancers [3,4] .The mislocalization of these proteins can alter their function so that their ability to suppress tumor cells is diminished, or their ability to induce cancer development, metastasis, or drug resistance is increased.
Nuclear transport is highly regulated by the nuclear pore complex (NPC) to ensure that proteins can enter when their functions are required and exit into the cytoplasm when they are not needed.While proteins less than 40 kDa in size are free to traverse the NPC, larger proteins require active transport directed by nuclear localization or nuclear export sequences (NLSs).For nuclear entry, proteins must negotiate with a NPC that is comprised of over 30 different protein components called nucleoporins (Nups) [4] .
Post-transcriptional modification (PTM)-based modulation of the NLS binding affinity to import receptors is one of the most understood mechanisms that regulates the nuclear import of proteins [4][5][6][7] .Our previous study has developed an effective algorithm to predict nuclear import activity, in which molecular interaction energy components (MIECs) were used to characterize NLS-import receptor interaction, and a support vector regression machine(SVR) was used to learn the relationship between the characterized NLS-import receptor interaction and the corresponding nuclear import activity [8] .Based on our model, we developed a systematic framework to precisely predict how potential PTM, such as phosphorylation, regulates nuclear import of human and yeast nuclear proteins.In this study, we developed a computation-based screening method to survey NLS sites for potential modifications like phosphorylations that may lead to protein mislocalization in human disease cells.We hypothesize that the mislocalizations of such proteins, not just expression levels, can serve as novel diagnostic markers or therapeutic targets for human diseases.

Methods
With Python, a bioinformatics tool has been developed to search for phosphorylation sites on localization signals of disease-associated genes.The program starts by reading through all the rows of database of nuclear localization signals (NLSdb, https://rostlab.org/services/nlsdb/).The database contains 114 experimentally determined NLSs that were obtained through an extensive literature search, extended to 308 experimental and potential NLSs using "in silico mutagenesis".This final set matched over 43% of all known nuclear proteins and matched no known non-nuclear proteins.
NLS sequences in the NLSdb are used to identify the predicted nuclear proteins with links to Uniprot (http://www.uniprot.org/).Uniprot provides the FASTA sequence for each protein, as well as the protein's accession ID, which is used to search the phosphorylation site database.PhosphoSitePlus (http://www.phosphosite.org) is an open, comprehensive, manually curetted and interactive resource for studying experimentally observed post-translational modifications, primarily of human and mouse proteins.It encompasses 1,300,000 non-redundant modification sites, primarily phosphorylation, ubiquitinylation, and acetylation sites.If a phosphorylation site was found on the NLS, the phosphorylation site and sequence and a gene symbol were recorded.
This gene symbol was then used to search the gene-disease association dataset (DisGeNET, http://www.disgenet.org).DisGeNET is a discovery platform integrating information on gene-disease associations (GDAs) from several public data sources and the literature [9][10][11] .The current version (DisGeNET v3.0) contains 429,111 associations between 17,181 genes and 14,619 diseases, disorders, and clinical or abnormal human phenotypes.A gene-disease score was provided to rank the strength of the associations based on the level supporting evidence, taking into account the number and type of sources (level of curation, organisms), including the number of publications supporting the association.Searching through the DisGeNET database yielded a list of associated diseases, as well as the gene-disease score.These results were ll written to a table that was further processed by hand to remove duplicates, and filter by score(Table 1).

Genetic Variation
Note: 1 DisGeNET score was developed and calculated based on thenumber of sources that report the association, the type ofcuration of each of these sources, the animal models wherethe association has been studied, and the number of supportingpublications from text-mining based sources as described in the literature [9][10][11] .

Results and Discussion
Aberrant protein localizations are responsible for many diseases.Considerable effort has been devoted to developing reliable methods to predict the effect of mutations on the Subcellular localization of disease related proteins [12][13][14] , and a substantial amount of experimental data has been collected on their mislocalizations.For example, loss of the nuclear localization signal (NLS) due to a missense mutation within the NLS of the sex-determining region of the Y protein (SRY) has been shown to be associated with XY sex reversal in Swyer syndrome [15] .With the development of genomic and proteomic approaches, gene expression at both mRNA and protein levels can be quantitatively analyzed in a global manner [16,17] , but these "omic" methods are not sufficient to detect alterations in the protein localizations.Currently there is no high throughput technology available to screen for mislocalized proteins in a global way.
Changes in localization are often triggered post-transcriptional modifications.While it is currently estimated that 40 to 50% of eukaryotic proteins are phosphorylated, little is known about the frequency and local effects of phosphorylation near nuclear binding sites.NLS signals tend to be conserved between proteins, and are thus sensitive to alterations [13] .Our previous study has shown that phosphorylation on NLS residues has a dramatic impact on the binding ability of nuclear proteins to nuclear import receptors, and therefore affects nuclear localization efficiency [8] .
In this study, we investigated how frequently phosphorylation sites are near the NLS, how they may be modified to affect the protein localization, and the human diseases they are potentially associated with.This was done with the NLS motif database NLSdb, phosphorylation site database Phosphodict, and disease-gene association database disGeNET.The NLSdb website contained 308 potential NLS motifs, which matched over 43% of all known nuclear proteins and no currently known non-nuclear proteins.DisGeNET gave a score for each association, ranging from 0 to 1 that indicated the strength of the association.Our study aims to identify the NLS-containing proteins whose localization may be regulated by post-transcriptional modifications such as phosphorylation.Using the Python language, we developed a bioinformatics module to survey phosphorylation sites on nuclear localization signals and potential disease associations.We first identified 13,448 NLS-containing proteins(Figure .1A).Next, we search through the phosphorylation site database for the NLS-containing proteins.Out of 13448 NLS-containing proteins 2,815 (21%) were also found to have a phosphorylation site listed in the phosphorylation database (Figure.1B).Next, we cross-referenced the disease associated entries for the proteins with potential phosphorylation sites at NLS residues.
Through this screening process, we identified 270 phosphorylation sites on the NLS sequences of 138 proteins which potentially contribute to disease development.These genes were found to be associated with many different types of diseases including cancer, cardiovascular disease, obesity etc.Interestingly, 32 out of the 138 (27.8%) proteins were known to be associated with cancers (Figure .2).These cancer-related genes are found in different types of cancer and different stages of cancer (Figure .2).The fact that there are so many cancer related genes on our list reflects the large amount of publications and research work on cancer in the literature.However, when we restricted associations to those with a score of 0.5 or higher, only 5% of the genes on the shortened list were cancer-associated genes, and the rest of genes were strongly associated other genetic diseases.The reason for this discrepancy may be due to the fact that protein mislocalization was not a factor which was used to calculate the score in the DisGeNET database.This means that a large number of genes can be potential diagnostic and/or prognostic markers if the localization of the gene products is taken into account For example, the disease-association score for metastasis associated 1 (MTA1) gene is only 0.01 in prediction of breast neoplasm, because even in normal cells, MTA1 levels vary a great deal from tissue to tissue [18][19][20] , and little association is found when looking at expression level alone.However, several studies have shown that MTA1 is located in the nucleus, cytoplasm, and the nuclear envelope [21] , and further investigations are needed to identify the exact Sub-cellular localizations of MTA1 proteins.We reviewed the sub-cellular localization patterns of the MTA family members and gavea comprehensive overview of their respective molecular activities in multiple contexts.
www.ommegaonline.orgJ Bioinfo Proteomics Rev |Volume 3: Issue 1 4 Computational identification of Genes associated with Human Diseases Some associations identified in our analysis appear consistent with existing research, like that between p53 phosphorylation and localization.TP 53 is a tumor suppressor gene, i.e., its activity stops the formation of tumors [22,23] .Under normal conditions, the p53 protein, which is encoded by the TP 53 gene, is a labile and inactive protein.Cytoplasmic p53 interacts with MDM2, which serves as an E3 ubiquitin ligase and targets p53 for ubiquitin-proteasome-mediated degradation [23,24] When cells are exposed to DNA damage and other stress, p53 accumulates in the nucleus and becomes active [22] .Previous studies report that phosphorylation of p53 at Ser315 inactivates p53 by enhancing its proteolytic degradation in the cytoplasm [25,26] .Although the mechanism by which Ser315 regulates cytoplasmic retention of p53 is largely unknown, these observations are consistent with our hypothesis that phosphorylation around the NLS may reduce its binding affinity to nuclear import receptors and therefore inhibit its nuclear translocation.
Overall, we developed a program to systematically survey for proteins whose cellular localizations may be altered due to phosphorylation of their nuclear localization signal sequences during the development of human diseases.Further experimental validation of their expression levels and localizations in clinical samples may lead to the identification of a new set of diagnostic and/or prognostic biomarkers for human diseases.

Figure 1 :
Figure 1: Flowchart of building the bioinformatics tool.A) Dataset preparation.The program starts with a list of NLS-containing proteins in NLSdb, find additional information on Uniprot, then narrows the list by searching for phosphorylation sites and disease associations.B) Graphical representation of program search algorithm for identification of the 138 disease associated gene hits as defined in text.

Figure 2 :
Figure 2: Distribution of potential disease-related proteins with phosphorylation sites in their NLS sequences among human diseases, including different types and stages of cancer.

Table 1 :
Selected top gene list with a 0.5 gene-disease score cutoff filter.