PSORT-B Submit Sequences | Updates | Resources | Contact  
PSORT-B Menu

PSORTb v.3.0 Documentation

This section contains documentation and references related to PSORTb v.3.0. When using PSORTb please cite:

PSORTb v3.0: N.Y. Yu, J.R. Wagner, M.R. Laird, G. Melli, S. Rey, R. Lo, P. Dao, S.C. Sahinalp, M. Ester, L.J. Foster, F.S.L. Brinkman (2010) PSORTb 3.0: Improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes, Bioinformatics 26(13):1608-1615

A plain text version of the documentation is available here.

1. History

Computational prediction of the subcellular localization of proteins is a valuable tool for genome analysis and annotation, since a protein's subcellular localization can provide clues regarding its function in an organism. For bacterial pathogens, the prediction of proteins on the cell surface is of particular interest due to the potential of such proteins to be primary drug or vaccine targets. A protein's subcellular localization is influenced by several features present within the protein's primary structure, such as the presence of a signal peptide or membrane-spanning alpha-helices.

Several algorithms have been developed to analyze single features such as these, however the PSORT family of programs analyzes several features at once, using information obtained from each analysis to generate an overall prediction of localization site. Developed by Kenta Nakai in 1991, PSORT is an algorithm which assigns a probable localization site to a protein given an amino acid sequence alone. Originally developed for prediction of protein localization in Gram-negative bacteria, PSORT was expanded into a suite of programs (PSORT, PSORT II, iPSORT) capable of handling proteins from all classes of organisms.

The Brinkman Laboratory headed development of PSORTb, an updated version of the PSORT algorithm with significantly higher accuracy. PSORTb includes new analytical modules designed to capitalize on new discoveries and observations in protein sorting, and benefits from a training dataset of over 11600 proteins of known localization. Its focus is on precision over recall to facilitate accurate predictions, at the expense of not making as many predictions as other methods may make. PSORT-B v.1.1 was released in July, 2003, with an updated version of PSORTb v.2.0 released in 2004, and has now been succeeded by PSORTb v.3.0.

2. PSORTb v.3.0 vs. previous versions of PSORT and PSORT-B

The original version of PSORT, still frequently used for prediction of prokaryotic localization sites, used a number of analyses arranged in an if/then rule-based format to determine which of four localization sites a protein might be resident at - cytoplasm, periplasm, inner or outer membrane (see the documentation available at the PSORT WWW server for a full explanation).

The PSORTb algorithm, however:

  • uses updated versions of several of these analyses, as well as several novel analytical methods
  • utilizes a probabilistic system for determination of a final prediction, rather than a rule-based system
  • is capable of predicting all localization sites (PSORT I does not predict extracellular proteins)
  • does not force a prediction, returning a prediction of "Unknown" if no prediction is made
  • displays a 28% increase in precision (% of correct predictions) relative to PSORT I

Furthermore, PSORTb v.2.0 offers several improvements over v.1.1:

  • prediction of Gram-positive proteins added
  • increased coverage (more predictions are made)
  • automated flagging of proteins with potential multiple localization sites

PSORTb v.3.0 has the following improvements over v.2.0:

  • Prediction capability for the domain of Archaea implemented
  • Prediction capability for bacteria whose Gram-stains do not reflect classical physical structures. For example, organisms that stain Gram-negative but have no outer membrane, as well as organisms that stain Gram-positive but have an outer membrane.
  • Sub-category localization predictions added (predicts flagellar, fimbrial, type III secretion apparatus, host-associated, and spore localizations)
  • Increased recall and coverage (more predictions are made for each bacterial genome)
  • Simplified software installation process, if local installation is preferred over using the web. There are fewer packages to install, since HMMTOP and its associated license are no longer required.
  • Web server now allows batch sequence processing with the option of returning results by email
  • Motifs are updated and ones that are no longer 100% specific are removed, improving software precision

Note also the change in name between PSORT-B v.1.1 and PSORTb v.2.0 / v.3.0 - the hyphen was eliminated in order to avoid conflicts during pattern matching or other searches.

3. PSORTb v.3.0: Analytical Modules

PSORTb v.3.0 consists of multiple analytical modules, each of which analyzes one biological feature known to influence or be characteristic of subcellular localization. The modules may act as a binary predictor, classifying a protein as either belonging or not belonging to a particular localization site, or they may be multi-category, able to assign a protein to one of several localization sites. When analyzing a Gram-negative organism (organism with two cell membranes), possible localization sites are: cytoplasm, cytoplasmic membrane, periplasm, outer membrane and extracellular space. Gram-positive and archaeal localization sites include: cytoplasm, cytoplasmic membrane, cell wall and extracellular space. The new version also offers prediction options for specialized organisms, such as those that stain Gram-negative but have no outer membrane or cell wall (eg. Mycoplasma spp), as well as organisms that stain Gram-positive but have an outer membrane (eg. Deinococcus radiodurans). All modules are capable of returning a negative prediction as well, such that a protein will not be forced into one of the localization sites.

3.1 SCL-BLAST & SCL-BLASTe, or SubCellular Localization BLAST, is a BLAST-P search against the current local database of proteins of known subcellular localization. An E-value cutoff of 10e-9 is used to ensure that returned HSPs represent true homologs, and an additional length restriction is placed on any subject matches - the length of the query:subject HSP must be within 80-120% of the length of the subject protein, thus reducing potential errors associated with the domain nature of proteins. SCL-BLAST selects the top-scoring HSP from the list of results, and returns that protein's localization site as its prediction, along with the name of the top-scoring HSP and the associated E-value. SCL-BLAST is capable of assigning a protein to any one of the possible localization sites. SCL-BLASTe is a specialized implementation of this analysis, in which a user's query protein is checked to see if it is an exact match to a protein in the SCL-BLAST database. If an exact match is found (100% similarity and within 1aa length), the protein is immediately predicted as residing at that localization site, and is not passed to subsequent modules.

3.2 Support Vector Machines (SVMs) are machine learning-based classifiers trained to classify a protein as belonging or not belonging to the set of proteins at a specific localization site. PSORTb v.3.0 contains 13 SVMs, one for each of the localization sites (5 Gram-negative, 4 Gram-positive and 4 archaeal). Trained using frequent sequences mined from proteins resident at a specific localization site, each SVM will examine a query protein and determine whether it does or does not belong at the localization site in question. If the SVM believes to the protein to belong to that particular site, that result is returned. Otherwise, an unknown prediction is returned.

3.3 Motif & Profile Analysis relies on the observation that a protein's function is closely linked to its localization, and that several PROSITE motifs characteristic of specific functions can be used to infer specific localizations. Several potentially important motifs were used to scan our current dataset, and motifs with a false-positive rate of 0% were built into PSORTb, as were expanded versions of the motifs termed "profiles. Note that the 0% false-positive rate is based on the dataset of proteins of known localization that we currently have access to. We wish to emphasize that this does not necessarily mean that such motifs and profiles will always be 100% accurate. If you identify an incorrect prediction, please contact us. A submitted protein is scanned for the occurrence of any of these motifs or profiles, and, if found, the localization site associated with the motif/profile is returned as the program's prediction. Motifs associated with each of the possible localization sites are included in PSORTb.

3.4 Outer Membrane Motif Analysis uses motifs generated from data mining techniques applied to a set of 425 beta-barrel proteins to classify a query protein as outer membrane or non-outer membrane (She et al, 2003). The A Priori algorithm was used to mine for short motifs found more often in outer membrane proteins than in proteins at the other four localization sites. Over 250 such motifs were identified, and a query protein is scanned for the co-occurrence of two or more of these motifs. A prediction of outer membrane is returned if successful.

3.5 ModHMM was derived from PRODIV-HMM (Viklund and Elofsson, 2004), a hidden Markov model-based method that identifies transmembrane alpha helices, which in turn identifies proteins spanning the cytoplasmic membrane. Our analyses have shown that when three or more TMHs are predicted in a protein, there is a >95% chance of that protein being an inner membrane protein. PSORTb uses a modified version of PRODIV-HMM to identify transmembrane helices and returns a prediction of cytoplasmic membrane if 3 or more are found.

3.6 A Signal Peptide directs a protein for export past the cytoplasmic membrane, and thus can be further used to differentiate cytoplasmic and non-cytoplasmic proteins. A hidden Markov model was trained on the dataset used to train the SignalP program, and is used to predict potential signal peptide cleavage sites. If a cleavage site with a high probability value is not found, the first 70 amino acids of the protein are passed to a support vector machine module trained on the same data. If the SVM is unable to recognize a signal peptide, the protein is predicted not to have one and is classified as cytoplasmic. However, a protein may possess a non-traditional signal peptide, so the results of this analysis carry less weight than do other modules when generating a final prediction.

4. PSORTb v.3.0: Final Prediction

In order to generate a final prediction, the results of each module are combined and assessed. A probabilistic method and 5-fold cross validation were used to assess the likelihood of a protein being at a specific localization given the prediction of a certain module. These likelihoods are used to generate a probability value for each of the five localization sites for a user's query protein.

PSORTb v.3.0 returns a list of the five localization sites and the associated probability value for each. We consider 7.5 to be a good cutoff above which a single localization can be assigned, and our precision and recall values for the program are calculated using this cutoff.

In certain cases, two localization sites may both exhibit high scores, which may indicate a protein with domains present in neighbouring localization sites. In cases where a localization site has a score between 4.5 (for Gram-negative) and 5.0 (for Gram-positive) and 7.49, the result returned to the user will say "Unknown - This protein may have multiple localization sites". In cases like these, we recommend you examine the long format output of the program's prediction to draw your own conclusion.

For organisms with specialized structures, such as those who stain Gram-negative but have no cell walls, the predictor may predict a cell wall localization but the final result will say "Unknown - predicted localization does not exist". This may mean that the protein is a surface protein or that it is a false prediction. In cases like this, we recommend you examine the long format output of the program's prediction to draw your own conclusion.

5. Using PSORTb v.3.0

This section of the documentation will be updated as changes are made to the web interface. Please check back often for up-to-date instructions on program use.

5.1 Accessing PSORTb

5.1.1 WWW Access: PSORTb is available online at http://www.psort.org. The sequence submission form for the current version of the program is located at http://www.psort.org/psortb. The older version 2 of the program is accessible at http://www.psort.org/psortb2/index.html.

5.1.2 Standalone PSORTb: PSORTb is also available as a standalone program to run in a Linux environment. The file, as well as instructions for installation, is available at the PSORTb Downloads page.

5.2 Submitting a Sequence for Analysis on the WWW

5.2.1 Sequence Submission: The sequence submission form can be found at http://www.psort.org/psortb/. One or more sequences can be pasted into the text box, or the "upload from file" option can be used to analyze a file of one or more sequences stored on your computer. When using the text box option, please note that a maximum of 600,000 characters can be pasted into the box.

5.2.2 Selecting Gram Stain: PSORTb v.3.0 performs different analyses depending on the class of organism. You are required to choose the appropriate Gram-stain and organism domain (Bacteria or Archaea) for your sequences. Not sure which option to select? Our Genomes page lists the classifications we used when we analyzed sequenced genomes. If your organism is not found there, try the NCBI Taxonomy Browser which provides a rough taxonomy for many bacterial species which may be helpful (for example, there is an association between proteobacteria and Gram-negative stain properties) or see the authoritative Bergey's Manual for Gram-stain properties for your microbe of interest.

5.2.3 Selecting Gram Stain - Advanced: There are some organisms whose Gram stains do not accurately reflect their cellular structure. Two additional analysis options are provided for these organisms by PSORTb v.3.0 -- "Positive with outer membrane" and "Negative without outer membrane". By selecting "Advanced" in the Gram stain option, users can choose to analyze organisms that stain Gram-positive but also have an outer membrane, such as Deinococcus radiodurans, Mycobacterium spp, and Veillonellaceae family of the Firmicutes phylum. The latter option allows users to analyze organisms that stain Gram-negative but have no outer membrane, such as organisms of the Tenericutes phylum, eg. Mycoplasma spp.

5.2.4 Receiving results: There are two ways results from PSORTb can be received, via email or on screen display. We recommend people use the email results method, a submission uploaded via this method won't be limited to the 100 proteins per submission the web display mode is. Because of processing constraints and the large demand on the service we limit users to 50 submissions per 24 hours, however there is no practical limit to the number of proteins per submission when using the email results method, therefore if you have a large number of proteins to analyze, please batch them up and use this method to process your results.

5.2.5 Acceptable Organisms: PSORTb v.3.0 accepts protein sequences from Gram-negative and Gram-positive bacteria as well as Archaea. All protein sequences from eukaryotic organisms must be analyzed using a different tool. See the Resources page for possible options.

5.2.6 Acceptable Formats: PSORTb requires that a PROTEIN sequence be submitted in FASTA format.

A sequence within a FASTA sequence file consists of three parts:

  • A title line, which must begin with a `>' symbol, and may be followed by any type of text
  • A newline character at the end of the title line
  • The sequence itself, which continues until the end of file or the next `>' is reached

An example of FASTA format is shown below:

>gi|31562958|sp|Q8CWD2|BTUF_ECOL6
MAKSLFRALVALSFLAPLWLNAAPRVITLSPANTELAFAAGITPVGVSSYSDYPLQAQKIEQVSTWQGMN
LERIVALKPDLVIAWRGGNAERQVDQLASLGIKVMWVDATSIEQIANALRQLAPWSPQPDKAEQAAQSLL
DQYAQLKAQYADKPKKRVFLQFGINPPFTSGKESIQNQVLEVCGGENIFKDSRVPWPQVSREQVLARSPQ
AIVITGGPDQIPKIKQYWGEQLKIPVIPLTSDWFERASPRIILAAQQLCNALSQVD

For more information, see the description at NCBI or contact us.

5.2.7 Whole Genome Analysis: In order to reduce the load on the PSORTb servers, pre-calculated results for whole bacterial genomes are available on the PSORTb site, on the Genomes page.

5.3 Submitting a Sequence for Analysis to Standalone PSORTb

5.3.1 Sequence File: One or more sequences in FASTA format can be submitted to standalone PSORTb, provided they are all contained within one file (e.g. mysequences.txt) and are all from the same Gram class of organism. If you have both Gram-negative and Gram-positive sequences you wish to analyze, they must be divided into two files and run separately.

5.3.2 Command line syntax: Standalone PSORTb contains several options and arguments, which are described below. The most basic command, however, which will be sufficient for most instances, is:

$ psort [-p|-n|-a] mysequences.txt > mysequences.out

  • psort calls the PSORTb program
  • -p (Gram-positive) or -n (Gram-negative) or -a (Archaea) tells the program which predictive model to use
  • mysequences.txt is the name of your FASTA file containing the sequences to be analyzed
  • > mysequences.out sends the output to a new file that will be created called mysequences.out. If no > is used, the output will be written to the terminal display.

    Usage: psort [-p|-n|-a] [OPTIONS] [SEQFILE]
    Runs psort on the sequence file SEQFILE . If SEQFILE isn't provided then sequences will be read from STDIN.
    --help, -h Displays usage information
    --positive, -p Gram positive bacteria
    --negative, -n Gram negative bacteria
    --archaea, -a Archaea

    --cutoff, -c Sets a cutoff value for reported results
    --divergent, -d Sets a cutoff value for the multiple
    localization flag
    --matrix, -m Specifies the path to the pftools installation. If
    not set, defaults to the value of the PSORT_PFTOOLS
    environment variable.
    --format, -f Specifies sequence format (default is FASTA)
    --output, -o Specifies the format for the output (default is
    'normal' Value can be one of: terse, long or normal
    --root, -r Specify PSORT_ROOT for running local copies. If
    not set, defaults to the value of the PSORT_ROOT
    environment variable.
    --server, -s Specifies the PSORT server to use
    --verbose, Be verbose while running
    --version, Print the version of PSORTb used

5.3.3 Help: Typing psort -h at the command prompt will bring up a list of available options and usage instructions.

5.4 Understanding the Output

5.4.1 Output Formats: PSORTb allows the user to select one of three output formats from the sequence submission screen: Normal, Tab-delimited (terse format) and Tab-delimited (long format). Normal output is recommended for analysis of one or a few sequences, whereas tab-delimited output in either format is recommended for the analysis of a large number of sequences. The output formats are described below. If you would like to try the examples given below for yourself, input sequences are below:

Gram-positive input sequence:

>SAK_BPP42
MLKRSLLFLTVLLLLFSFSSITNEVSASSSFDKGKYKKGDDASYFEPTGPYLMVNVTGVDGKRNELLSPR
YVEFPIKPGTTLTKEKIEYYVEWALDATAYKEFRVVELDPSAKIEVTYYDKNKKKEETKSFPITEKGFVV
PDLSEHIKNPGFNLITKVVIEKK

Gram-negative input sequence:

>NP_949347.1
MQGHHFGGDMSNSEAIDNTTAKLRLAQSSSLLALALLIGSAPAQAADTDWGWLAIGAPAATAQGWTGKGV
VIGVVDTGIDFSHPALSGRAFDYNYGSFVAGSNHPHATHVAGIIGATDINRGMEGVAPDVRFSSMKIFTG
AGGSYLGDAAVADAYDGAIGSGVRIFNNSWGSSDSIANFTSREELLAHEPLLVGAFTRAVNADAVLVWST
GNDGRSQPSWQAAAPYYIQELKANWIAVTSVGENGTIASYANACGVAKAWCLAAPGGDFNPGIYSTIPGK
DYGYMSGTSMAAPYVTGATAIARQMFPKASGAQLAQIVLQTSRDIGAPGIDDVYGWGLLAVDNIVDTINP
RGAALFASAAWGRFTTLSAIGNTVLDRISDLRNGRGDVVTAPLAFAGQNGAFSQSGSNPRNAYAADLAAA
PQPSPLGFGSVWARGLAGRATLSGSASSPQTTADISGGLLGFDLVNNQNLLVGIAGGGSNTNLTASGISD
KAGAQAWHVLGYAAAMYGPAFVNVAGGWNSFDQSYQRRVIPGTAGTVFASTISAAQSSSTDVAYFFQGRG
GWTFQTEVGRIEPYVHGATRNQSFGGFSETNASIFSLSVPSASLSEAEYGAGVRWACAPIKTVDQRVAVA
PTIDLAYVRFTNDGPIQVETNLLGTSVVGQTAALGADAIRVAAGLSLTSLAGISGSFGYTGTVRDAATAH
TVSGGLSIKF

Archaeal input sequence:

>YP_001689002.1 MFEFITDEDERGQVGIGTLIVFIAMVLVAAIAAGVLINTAGYLQSKGSATGEEASAQVSNRINIVSAYGN VNNEKVDYVNLTVRQAAGADNINLTKSTIQWIGPDRATTLTYSSNSPSSLGENFTTESIKGSSADVLVDQ SDRIKVIMYASGVSSNLGAGDEVQLTVTTQYGSKTTYWAQVPESLKDKNA

5.4.2 Normal Output: The Normal output option displays the results of each of PSORTb's analytical modules, the localization scores for each of the 5 sites, as well as a final prediction and associated score (if one site scores above the 7.5 cutoff). Below are examples of both Gram-positive, Gram-negative and archaeal output, using the input sequences given in 5.3.1. Descriptions of the output fields can be found beneath each output example.

Gram-positive sample output:

SeqID: SAK_BPP42
Analysis Report:  
CMSVM+ Unknown [No details]
CWSVM+ Unknown [No details]
CytoSVM+ Unknown [No details]
ECSVM+ Extracellular [No details]
ModHMM+ Unknown [1 internal helix found]
Motif+ Unknown [No motifs found]
Profile+ Unknown [No matches to profiles found]
SCL-BLAST+ Extracellular [matched 134189: Extracellular protein]
SCL-BLASTe+ Unknown [No matches against database]
Signal+ Non-cytoplasmic [Signal peptide detected]
Localization Scores:
Cytoplasmic 0.0
CytoplasmicMembrane 0.0
Cellwall 0.2
Extracellular 9.98
Final Prediction:
Extracellular 9.98

SeqID returns whatever was found on the title line of the FASTA format input file.

The Analysis Report contains the results of each of PSORTb's analytical modules. The module name is listed in the left-most column, the centre column contains the localization site predicted by that module (or "Unknown" if the module did not generate a prediction), and the right-most column contains comments related to the modules' findings. The modules in the Gram-positive version are as follows:

  • CMSVM+: The Gram-positive version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.
  • CWSVM+: The support vector machine trained to identify cell wall proteins (Gram-positive and Archaea). Returns cell wall or unknown.
  • CytoSVM+: The Gram-positive version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.
  • ECSVM+: The Gram-positive version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.
  • ModHMM+: Predicts transmembrane helices within the sequence. The presence of 3 or more transmebrane helices causes the module to return a prediction of cytoplasmic membrane, otherwise unknown is returned. The Details column returns the number of predicted helices.
  • Motif+: Searches the sequence for Gram-positive motifs indicative of a specific localization site. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.
  • Profile+: Searches the sequence for Gram-positive profiles indicative of a specific localization site. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.
  • SCL-BLAST+: Performs a BLASTP search against the Gram-positive subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
  • SCL-BLASTe+: Like SCL-BLAST, but only returns a match if the query and subject have 100% similarity and are within 1aa in length of each other. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
  • Signal+: Searches the sequence for the presence of a Gram-positive cleavable N-terminal signal peptide. If a signal peptide is detected, the module returns a prediction of non-cytoplasmic, otherwise a result of unknown is returned.

In the Localization Scores area, the confidence values for each of the localization sites are given. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section. If two sites have high scores, a flag of "This protein may have multiple localization sites" is also returned in the Final Prediction field.

Gram-negative sample output (to illustrate multiple localization):

SeqID: NP_949347.1
Analysis Report:  
CMSVM- Unknown [No details]
CytoSVM- Unknown [No details]
ECSVM- Extracellular [No details]
ModHMM- Unknown [No internal helices found]
Motif- Unknown [No motifs found]
OMPMotif- Unknown [No motifs found]
OMSVM- OuterMembrane [No details]
PPSVM- Unknown [No details]
Profile- Unknown [No matches to profiles found]
SCL-BLAST- OuterMembrane, Extracellular [matched 3646417: Outer membrane (Autotransporter)]
SCL-BLASTe- Unknown [No matches against database]
Signal- Non-cytoplasmic [Signal peptide detected]
Localization Scores:
Cytoplasmic 0.00
CytoplasmicMembrane 0.00
Periplasm 0.00
OuterMembrane 5.87
Extracellular 4.13
Final Prediction:
Unknown (This protein may have multiple localization sites)

The modules which differ between those described for the Gram-positive version of PSORTb are listed below:

  • CMSVM-: The Gram-negative version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.
  • CytoSVM-: The Gram-negative version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.
  • ECSVM-: The Gram-negative version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.
  • ModHMM-: Predicts transmembrane helices within the sequence. The presence of 3 or more transmebrane helices causes the module to return a prediction of cytoplasmic membrane, otherwise unknown is returned. The Details column returns the number of predicted helices.
  • Motif-: Searches the sequence for Gram-negative motifs indicative of a specific localization site. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.
  • OMPMotif-: Searches the sequence for Gram-negative outer membrane protein motifs. If a match occurs, outer membrane is reported, otherwise unknown is returned. The details column returns the numerical identifiers of the motifs found.
  • OMSVM-: The support vector machine trained to identify outer membrane proteins. Returns outer membrane or unknown (Gram-negative only).
  • PPSVM-: The support vector machine trained to identify periplasmic proteins. Returns periplasm or unknown (Gram-negative only).
  • Profile-: Searches the sequence for Gram-negative profiles indicative of a specific localization site. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.
  • SCL-BLAST-: Performs a BLASTP search against the Gram-negative subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
  • SCL-BLASTe-: See above
  • Signal-: Searches the sequence for the presence of a Gram-negative cleavable N-terminal signal peptide. If a signal peptide is detected, the module returns a prediction of non-cytoplasmic, otherwise a result of unknown is returned.

For the Gram-stain - Advanced options, the output for "Gram-positive with outer membrane" is similar to the normal Gram-negative output (with predictions for periplasmic and outer membrane localizations). The output for "Gram-negative without outer membrane" option is similar to the normal Gram-positive output, except that the cell wall localization is not predicted in the final output, since Mycoplasma spp. and most Tenericutes are more phylogenetically similar to Gram-positive organisms but lack a peptidoglycan cell wall. If the modules predict "cell wall" as a protein's localization, the final localization will be flagged as "Unknown - predicted localization does not exist". From what we have observed, proteins with this prediction sometimes have a surface (cytoplasmic membrane) localization. Users should use their own discretions for interpreting the results of PSORTb prediction results in this case.

In the Localization Scores area, the confidence values for each of the localization sites are given. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section. If two sites have high scores, a flag of "This protein may have multiple localization sites" is also returned in the Final Prediction field.

Archaeal sample output (to illustrate sub-category localization detection):

SeqID: YP_001689002.1
Analysis Report:  
CMSVM_a Unknown [No details]
CWSVM_a Unknown [No details]
CytoSVM_a Unknown [No details]
ECSVM_a Extracellular [No details]
ModHMM_a Unknown [1 internal helix found]
Motif_a Unknown [No motifs found]
Profile_a Unknown [No matches to profiles found]
SCL-BLAST_a Extracellular [matched 47117675: Flagellin B1 precursor]
Signal_a Non-Cytoplasmic [Signal peptide detected]
Localization Scores:
Cytoplasmic 0.00
CytoplasmicMembrane 0.00
Cellwall 0.02
Extracellular 9.98
Final Prediction:
Extracellular 9.98
Secondary localization(s): Flagellar
  • CMSVM_a: The archaeal version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.
  • CWSVM_a: The archaeal version of the support vector machine trained to identify cell wall proteins. Returns cell wall or unknown.
  • CytoSVM_a: The archaeal version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.
  • ECSVM_a: The archaeal version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.
  • ModHMM_a: Predicts transmembrane helices within the sequence. The presence of 3 or more transmebrane helices causes the module to return a prediction of cytoplasmic membrane, otherwise unknown is returned. The Details column returns the number of predicted helices.
  • Motif_a: Searches the sequence for Gram-positive motifs indicative of a specific localization site, with the ones not applicable to Archaea removed. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.
  • Profile_a: Searches the sequence for Gram-positive profiles indicative of a specific localization site, with the ones not applicable to Archaea removed. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.
  • SCL-BLAST_a: Performs a BLASTP search against the Gram-positive and archaeal subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
  • SCL-BLASTe_a: Like SCL-BLAST, but only returns a match if the query and subject have 100% similarity and are within 1aa in length of each other. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
  • Signal_a: Searches the sequence for the presence of a Gram-positive cleavable N-terminal signal peptide. If a signal peptide is detected, the module returns a prediction of non-cytoplasmic, otherwise a result of unknown is returned.

5.4.3 Tab-delimited (Terse Format) Output: Tab-delimited terse format output returns a list of inputted sequences, each one on a new line, with 3 columns: SeqId contains the information from the FASTA file definition line, Localization contains the final prediction of localization site (or "Unknown" is no site scored above 7.5), and Score contains the confidence value associated with this localization site. Tab characters occur between the columns, and, in the case of a multiple sequence submission, each sequence record is separated by newline characters. This format can be easily read into a spreadsheet, using a program such as MS Excel.

5.4.4 Tab-delimited (Long Format) Output: Tab-delimited long format output returns a list of inputted sequences, each one on a new line, and with all of the information from the PSORTb results placed into columns. The SeqId, module results and comments from the analysis report, localizations and scores, and the final prediction and score are each placed into their own column.

5.5 Options for Retrieving Results

5.5.1 View results via the web: PSORTb prediction results are displayed as a webpage, in the output format chosen by the user. This is the most convenient way to view results if you are only analyzing a few proteins.

5.5.2 Send results by email: PSORTb prediction results are sent to user-provided email address in the output format chosen by the user. This method is suitable for analysis of larger number of proteins and if the output is to be transferred to another document and/or to be used for further analyses.

6. Limitations

PSORTb is designed to emphasize precision (or specificity) over recall (or sensitivity), and as a result, some classes of proteins are not predicted well. The following issues must be considered when performing an analysis using the current version of PSORTb:

6.1 Proteins resident at multiple localization sites: Many proteins can exist at multiple localization sites. Examples of such proteins include integral membrane proteins with large periplasmic domains, or autotransporters, which contain an outer membrane pore domain and a cleaved extracellular domain. The current version of PSORTb handles this situation by flagging proteins which show a distribution of localization scores favouring two sites, rather than one. It is important to examine the distribution of localization scores carefully in order to determine if your submitted protein may have multiple localization sites and if so, which two sites are involved.

6.2 Lipoproteins: The current version of PSORTb does not detect lipoprotein motifs.

6.3 Precision vs. Recall: PSORTb is designed to emphasize precision (or specificity) over recall (or sensitivity). Programs which make predictions at all costs often provide incorrect or incomplete results, which can be propagated through annotated databases, datasets and reports in the literature. We believe that a confident prediction is more valuable than any prediction, and we have designed the program to this end. Note, however, that a user may choose to use their own reduced cutoff score in generating final predictions.

 

[ Submit Sequences | Updates | Resources | Contact ]