PSORT-B Go to the current version of PSORT-B  
PSORT-B Menu

PSORT-B v.1.1 Documentation

For documentation describing the current version of PSORT-B, please refer to PSORT-B v.2.0 Documentation.

A plain text version of the documentation below is available here.

This section contains documentation and references related to PSORT-B v.1.1, described in the manuscript: Jennifer L. Gardy, Cory Spencer, Ke Wang, Martin Ester, Gabor E. Tusnady, Istvan Simon, Sujun Hua, Katalin deFays, Christophe Lambert, Kenta Nakai and Fiona S.L. Brinkman (2003). PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria, Nucleic Acids Research 31(13):3613-17

1. History

Computational prediction of the subcellular localization of proteins is a valuable tool for genome analysis and annotation, since a protein's subcellular localization can provide clues regarding its function in an organism. For bacterial pathogens, the prediction of proteins on the cell surface is of particular interest due to the potential of such proteins to be primary drug or vaccine targets. A protein's subcellular localization is influenced by several features present within the protein's primary structure, such as the presence of a signal peptide or membrane-spanning alpha-helices.

Several algorithms have been developed to analyze single features such as these, however the PSORT family of programs analyzes several features at once, using information obtained from each analysis to generate an overall prediction of localization site. Developed by Kenta Nakai in 1991, PSORT is an algorithm which assigns a probable localization site to a protein given an amino acid sequence alone. Originally developed for prediction of protein localization in Gram-negative bacteria, PSORT was expanded into a suite of programs (PSORT, PSORT II, iPSORT) capable of handling proteins from all classes of organisms.

The Brinkman Laboratory headed development of PSORT-B, an updated version of the PSORT algorithm designed for Gram-negative bacterial proteins. PSORT-B includes new analytical modules designed to capitalize on new discoveries and observations in protein sorting, and benefits from a training dataset of over 1400 proteins of known localization. Its focus is on precision over recall to faciliate accurate predictions, at the expense of not making as many predictions as other methods may make. PSORT-B v.1.1 was released in July, 2003 and has now been succeeded by PSORT-B v.2.0.

2. PSORT-B v.1.1 vs PSORT I
 

The original version of PSORT, still frequently used for prediction of prokaryotic localization sites, used a number of analyses arranged in an if/then rule-based format to determine which of four localization sites a protein might be resident at - cytoplasm, periplasm, inner or outer membrane (see the documentation available at the PSORT WWW server for a full explanation).

PSORT-B v.1.1, however:

  • uses updated versions of several of these analyses, as well as several novel analytical methods
  • utilizes a probabilistic system for determination of a final prediction, rather than a rule-based system
  • is capable of predicting five localization sites, rather than four (PSORT-B also recognizes extracellular proteins)
  • does not force a prediction, returning a prediction of "Unknown" if no prediction is made

PSORT-B v.1.1's precision has been measured at 97%, compared to PSORT I's 69%, and PSORT-B's web interface is able to handle the submission of multiple sequences.

3. PSORT-B v.1.1: Analytical Modules

PSORT-B v.1.1 consists of six analytical modules, each of which analyzes one biological feature known to influence or be characteristic of subcellular localization. The modules may act as a binary predictor, classifying a protein as either belonging or not belonging to a particular localization site, or they may be multi-category, able to assign a protein to one of several localization sites. All modules are capable of returning a negative prediction as well, such that a protein will not be forced into one of the localization sites.

 

3.1 SCL-BLAST, or SubCellular Localization BLAST, is a BLAST-P search against the current local database of proteins of known subcellular localization. An E-value cutoff of 10e-10 is used to ensure that returned HSPs represent true homologs, and an additional length restriction is placed on any subject matches - the length of the query:subject HSP must be within 80-120% of the length of the subject protein, thus reducing potential errors associated with the domain nature of proteins. Note that we are interested in examining other E-value cutoffs for this analysis. SCL-BLAST selects the top-scoring HSP from the list of results, and returns that protein's localization site as its prediction, along with the name of the top-scoring HSP and the associated E-value. SCL-BLAST is capable of assigning a protein to any one of the five localization sites.

3.2 Motif Analysis relies on the observation that a protein's function is closely linked to its localization, and that several PROSITE motifs characteristic of specific functions can be used to infer specific localizations. Several potentially important motifs were used to scan a dataset of 853 proteins, and motifs with a false-positive rate of 0% were built into PSORT. Note that the 0% false-positive rate is based on the dataset of proteins of known localization that we currently have access to. We wish to emphasize that this does not necessarily mean that such motifs will always be 100% accurate. If you identify an incorrect prediction, please contact us. A submitted protein is scanned for the occurrence of any of these motifs, and, if found, the localization site associated with the motif is returned as the program's prediction. Motifs associated with each of the five localization sites are included in PSORT-B.

3.3 Outer Membrane Motif Analysis uses motifs generated from data mining techniques applied to a set of 425 beta-barrel proteins to classify a query protein as outer membrane or non-outer membrane (She et al, 2003). The A Priori algorithm was used to mine for short motifs found more often in outer membrane proteins than in proteins at the other four localization sites. Over 250 such motifs were identified, and a query protein is scanned for the co-occurrence of two or more of these motifs. A prediction of outer membrane is returned if successful.

3.4 HMMTOP (Tusnady, 1998) is used to identify transmembrane alpha helices, which can then be used to identify proteins spanning the inner, or cytoplasmic, membrane. Our analyses have shown that when three or more TMHs are predicted in a protein, there is a 94% chance of that protein being an inner membrane protein. PSORT-B uses HMMTOP (Tusnady, 1998), a hidden Markov model-based method, to identify transmembrane helices and returns a prediction of inner membrane if 3 or more are found.

3.5 SubLocC relies on the observation that proteins resident in different environments within a cell tend to have different overall amino acid composition. PSORT-B uses SubLoc (Hua, 2001), a support vector machine-based approach to differentiate cytoplasmic from non-cytoplasmic proteins.

3.6 A Signal Peptide directs a protein for export past the cytoplasmic membrane, and thus can be further used to differentiate cytoplasmic and non-cytoplasmic proteins. A hidden Markov model was trained on the dataset used to train the SignalP program, and is used to predict potential signal peptide cleavage sites. If a cleavage site with a high probability value is not found, the first 70 amino acids of the protein are passed to a support vector machine module trained on the same data. If the SVM is unable to recognize a signal peptide, the protein is predicted not to have one and is classified as cytoplasmic. However, a protein may possess a non-traditional signal peptide, so the results of this analysis carry less weight than do other modules when generating a final prediction.

4. PSORT-B v.1.1: Final Prediction

In order to generate a final prediction, the results of each module are combined and assessed. A probabilistic method and 5-fold cross validation were used to assess the likelihood of a protein being at a specific localization given the prediction of a certain module. These likelihoods are used to generate a probability value for each of the five localization sites for a user's query protein.

PSORT-B v.1.1 returns a list of the five localization sites and the associated probability value for each, ranked in descending order. We consider 7.5 to be a good cutoff above which localization can be assigned, and our precision and recall values for PSORT-B v.1.1 are calculated using this cutoff.

The user must carefully inspect the results list and the probability values. Two localization sites may often have similar values, which could indicate a protein with distinct domains present in two localization sites, the inner membrane and the periplasm, for example. Additionally, if no prediction can be made, an even distribution of scores across the five sites will be visible.

5. Using PSORT-B v.1.1

This section of the documentation will be updated as changes are made to the web interface. Please check back often for up-to-date instructions on program use.

5.1 Accessing PSORT-B

5.1.1 WWW Access: PSORT-B is available online at http://www.psort.org. The sequence submission form itself is located at http://www.psort.org/psortb.

5.2 Submitting a Sequence for Analysis

5.2.1 Sequence Submission: The sequence submission form can be found at http://www.psort.org/psortb/. Paste your protein sequence(s) from a Gram-negative bacterium into the box, or browse for a local FASTA format file on your computer containing the sequences. Select the desired output format from the dropdown menu. Output formats are described in section 3.1. Press the Submit button to begin the analysis.

5.2.2 Acceptable Organisms: PSORT-B presently only accepts protein sequences from Gram-negative bacteria. All protein sequences from Gram-positive and eukaryotic organisms must be analyzed using one of the other PSORT programs, available at the PSORT WWW Server.

5.2.3 Acceptable Formats: PSORT-B requires that a PROTEIN sequence be submitted in FASTA format. A FASTA format file contains a definition line, preceded by a ">" character and containing any information the user wishes to identify their sequence with. The definition line ends with a newline character, and is followed by the sequence information itself. A newline indicates the end of the sequence information. An example of FASTA format is shown below:

>gi|31562958|sp|Q8CWD2|BTUF_ECOL6
MAKSLFRALVALSFLAPLWLNAAPRVITLSPANTELAFAAGITPVGVSSYSDYPLQAQKIEQVSTWQGMN
LERIVALKPDLVIAWRGGNAERQVDQLASLGIKVMWVDATSIEQIANALRQLAPWSPQPDKAEQAAQSLL
DQYAQLKAQYADKPKKRVFLQFGINPPFTSGKESIQNQVLEVCGGENIFKDSRVPWPQVSREQVLARSPQ
AIVITGGPDQIPKIKQYWGEQLKIPVIPLTSDWFERASPRIILAAQQLCNALSQVD

5.2.4 Number of Sequences Allowed: PSORT-B can handle both single and multiple sequences in the web-based submission form. For multiple sequence submission, submit a FASTA file containing all of the protein sequences. A maximum of 600,000 characters is allowed in the submit textbox.

5.2.5 Alternate Prediction Options: If your sequence does not meet the organism criteria, please use one of the other PSORT programs to analyze it. These programs are available at the PSORT WWW Server, and links can also be found at the www.psort.org index page. Other resources for subcellular localization prediction can be found on the Resources page.

5.2.6 Whole Genome Analysis: In order to reduce the load on the PSORT-B servers, precalculated results for whole bacterial genomes will be made available on the PSORT-B site, one the Genomes page.

5.3 Understanding the Output

5.3.1 Output Formats: PSORT-B allows the user to select one of three output formats from the sequence submission screen: Normal, Tab-delimited (terse format) and Tab-delimited (long format). Normal output is recommended for analysis of one or a few sequences, whereas tab-delimited output in either format is recommended for the analysis of a large number of sequences. The output formats are described below.

5.3.2 Normal Output: The Normal output option displays the results of each of PSORT-B's analytical modules, the localization scores for each of the 5 sites, as well as a final prediction and associated score (if one site scores above the 7.5 cutoff). The output appears in the format:

SeqID: gi|31562958|sp|Q8CWD2|BTUF_ECOL6
Analysis Report:  
HMMTOP Unknown [1 internal helix found]
Motif Unknown [No motifs found]
OMPMotif Unknown [No motifs found]
SCL-BLAST Periplasmic [Matched P37028: Periplasmic protein]
Signal Unknown [No signal peptide detected]
SubLocC Unknown [No details]
Localization Scores:
Periplasmic 8.950617
Inner Membrane 0.370370
Outer Membrane 0.308642
Cytoplasmic 0.246914
Extracellular 0.123457
Final Prediction:
Periplasmic 8.950617

SeqID returns whatever was found on the definition line of the FASTA format input file. The Analysis Report contains the results of each of PSORT-B's analytical modules. The module name is listed in the left-most column, the centre column contains the localization site predicted by that module (or "Unknown" if the module did not generate a prediction), and the right-most column contains comments related to the modules' findings. In the Localization Scores area, the confidence value for each of the 5 localization sites is given, in descending order. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section.

5.3.3 Tab-delimited (Terse Format) Output: Tab-delimited terse format output returns a list of inputted sequences, each one on a new line, with 3 columns: SeqId contains the information from the FASTA file definition line, Localization contains the final prediction of localization site (or "Unknown" is no site scored above 7.5), and Score contains the confidence value associated with this localization site. Tab characters occur between the columns, and, in the case of a multiple sequence submission, each sequence record is separated by newline characters. This format can be easily read into a spreadsheet, using a program such as MS Excel.

5.3.4 Tab-delimited (Long Format) Output: Tab-delimited long format output returns a list of inputted sequences, each one on a new line, and with all of the information from the PSORT-B results placed into columns. The SeqId, module results and comments from the analysis report, localizations and scores, and the final prediction and score are each placed into their own column.

6. Limitations

PSORT-B is designed to emphasize precision (or specificity) over recall (or sensitivity), and as a result, some classes of proteins are not predicted well. The following issues must be considered when performing an analysis using the current version of PSORT-B:

6.1 Integral membrane proteins with 1-2 transmembrane helices: A large number of integral membrane proteins contain less than 3 helices. PSORT-B's HMMTOP module, however, will only identify proteins with 3 or more TMHs as inner membrane-localized in order to reduce the number of false positive inner membrane predictions. Thus, a protein with 1-2 helices and no other localization information may not yield a confident prediction. Examination of the HMMTOP results in the PSORT-B analysis report will help to identify such proteins.

6.2 Proteins resident at multiple localization sites: Many proteins can exist at multiple localization sites. Examples of such proteins include integral membrane proteins with large periplasmic domains, or autotransporters, which contain an outer membrane pore domain and a cleaved extracellular domain. The current version of PSORT-B handles this situation by providing a list of localization scores for each of the 5 sites - proteins with multiple localizations will typically show a distribution of localization scores favouring two sites, rather than one. It is important to examine the distrubtion of localization scores carefully in order to determine if your submitted protein may have multiple localization sites. In later versions of the program, as more proteins with multiple localization sites are incporated into PSORT-B, other methods of returning multiple localization predictions will be explored.

6.3 Lipoproteins: The current version of PSORT-B does not detect lipoprotein motifs. Lipoproteins in the SCL-BLAST database, however, are annonated as such.

6.4 Precision vs. Recall: PSORT-B version 1.1 has been designed to yield as high a precision level as possible, at the expense of recall. Programs which make predictions at all costs often provide incorrect or incomplete results, which can be propagated through annotated databases, datasets and reports in the literature. We believe that a confident prediction is more valuable than any prediction, and we have designed the program to this end. Note, however, that a user may choose to use their own reduced cutoff score in generating final predictions.