PSORTb v.2.0: Documentation

	Submit Sequences \| Updates \| Resources \| Contact

PSORTb v.2.0 Documentation

This section contains documentation and references related to PSORTb v.2.0. Documentation pertaining to the old version of PSORT-B (v.1.1) is still acessible here.

A plain text version of the documentation is available here.

1. History
2. PSORTb v.2.0 vs previous versions of PSORT and PSORT-B
3. PSORTb v.2.0: Analytical Modules
4. PSORTb v.2.0: Generating a Final Prediction
5. Using PSORTb v.2.0
6. Limitations of PSORTb v.2.0

1. History

Computational prediction of the subcellular localization of proteins is a valuable tool for genome analysis and annotation, since a protein's subcellular localization can provide clues regarding its function in an organism. For bacterial pathogens, the prediction of proteins on the cell surface is of particular interest due to the potential of such proteins to be primary drug or vaccine targets. A protein's subcellular localization is influenced by several features present within the protein's primary structure, such as the presence of a signal peptide or membrane-spanning alpha-helices.

Several algorithms have been developed to analyze single features such as these, however the PSORT family of programs analyzes several features at once, using information obtained from each analysis to generate an overall prediction of localization site. Developed by Kenta Nakai in 1991, PSORT is an algorithm which assigns a probable localization site to a protein given an amino acid sequence alone. Originally developed for prediction of protein localization in Gram-negative bacteria, PSORT was expanded into a suite of programs (PSORT, PSORT II, iPSORT) capable of handling proteins from all classes of organisms.

The Brinkman Laboratory headed development of PSORT-B, an updated version of the PSORT algorithm designed for Gram-negative bacterial proteins. PSORT-B includes new analytical modules designed to capitalize on new discoveries and observations in protein sorting, and benefits from a training dataset of over 1400 proteins of known localization. Its focus is on precision over recall to faciliate accurate predictions, at the expense of not making as many predictions as other methods may make. PSORT-B v.1.1 was released in July, 2003 and has now been succeeded by PSORTb v.2.0.

2. PSORTb v.2.0 vs previous versions of PSORT and PSORT-B

The original version of PSORT, still frequently used for prediction of prokaryotic localization sites, used a number of analyses arranged in an if/then rule-based format to determine which of four localization sites a protein might be resident at - cytoplasm, periplasm, inner or outer membrane (see the documentation available at the PSORT WWW server for a full explanation).

The PSORTb algorithm, however:

uses updated versions of several of these analyses, as well as several novel analytical methods
utilizes a probabilistic system for determination of a final prediction, rather than a rule-based system
is capable of predicting all localization sites (PSORT I does not predict extracellular proteins)
does not force a prediction, returning a prediction of "Unknown" if no prediction is made
displays a 28% increase in precision (% of correct predictions) relative to PSORT I

Furthermore, PSORTb v.2.0 offers several improvements over v.1.1:

prediction of Gram-positive proteins added
increased coverage (more predictions are made)
automated flagging of proteins with potential multiple localization sites

Note also the change in name between PSORT-B v.1.1 and PSORTb v.2.0 - the hyphen was eliminated in order to avoid conflicts during pattern matching or other searches.

3. PSORTb v.2.0: Analytical Modules

PSORTb v.2.0 consists of multiple analytical modules, each of which analyzes one biological feature known to influence or be characteristic of subcellular localization. The modules may act as a binary predictor, classifying a protein as either belonging or not belonging to a particular localization site, or they may be multi-category, able to assign a protein to one of several localization sites. When analyzing a Gram-negative organism, possible localization sites are: cytoplasm, cytoplasmic membrane, periplasm, outer membrane and extracellular space. Gram-positive localization sites include: cytoplasm, cytoplasmic membrane, cell wall and extracellular space. All modules are capable of returning a negative prediction as well, such that a protein will not be forced into one of the localization sites.

3.1 SCL-BLAST & SCL-BLASTe, or SubCellular Localization BLAST, is a BLAST-P search against the current local database of proteins of known subcellular localization. An E-value cutoff of 10e-10 is used to ensure that returned HSPs represent true homologs, and an additional length restriction is placed on any subject matches - the length of the query:subject HSP must be within 80-120% of the length of the subject protein, thus reducing potential errors associated with the domain nature of proteins. SCL-BLAST selects the top-scoring HSP from the list of results, and returns that protein's localization site as its prediction, along with the name of the top-scoring HSP and the associated E-value. SCL-BLAST is capable of assigning a protein to any one of the possible localization sites. SCL-BLASTe is a specialized implementation of this analysis, in which a user's query protein is checked to see if it is an exact match to a protein in the SCL-BLAST database. If an exact match is found (100% similarity and within 1aa length), the protein is immediately predicted as residing at that localization site, and is not passed to subsequent modules.

3.2 Support Vector Machines (SVMs) are machine learning-based classifiers trained to classify a protein as belonging or not belonging to the set of proteins at a specific localization site. PSORTb v.2.0 contains 9 SVMs, one for each of the localization sites (5 Gram-negative and 4 Gram-positive). Trained using frequent sequences mined from proteins resident at a specific localization site, each SVM will examine a query protein and determine whether it does or does not belong at the localization site in question. If the SVM believes to the protein to belong to that particular site, that result is returned. Otherwise, an unknown prediction is returned.

3.3 Motif & Profile Analysis relies on the observation that a protein's function is closely linked to its localization, and that several PROSITE motifs characteristic of specific functions can be used to infer specific localizations. Several potentially important motifs were used to scan our current dataset, and motifs with a false-positive rate of 0% were built into PSORTb, as were expanded versions of the motifs termed "profiles. Note that the 0% false-positive rate is based on the dataset of proteins of known localization that we currently have access to. We wish to emphasize that this does not necessarily mean that such motifs and profiles will always be 100% accurate. If you identify an incorrect prediction, please contact us. A submitted protein is scanned for the occurrence of any of these motifs or profiles, and, if found, the localization site associated with the motif/profile is returned as the program's prediction. Motifs associated with each of the possible localization sites are included in PSORTb.

3.4 Outer Membrane Motif Analysis uses motifs generated from data mining techniques applied to a set of 425 beta-barrel proteins to classify a query protein as outer membrane or non-outer membrane (She et al, 2003). The A Priori algorithm was used to mine for short motifs found more often in outer membrane proteins than in proteins at the other four localization sites. Over 250 such motifs were identified, and a query protein is scanned for the co-occurrence of two or more of these motifs. A prediction of outer membrane is returned if successful.

3.5 HMMTOP (Tusnady, 1998) is used to identify transmembrane alpha helices, which can then be used to identify proteins spanning the cytoplasmic membrane. Our analyses have shown that when three or more TMHs are predicted in a protein, there is a 94% chance of that protein being an inner membrane protein. PSORTb uses HMMTOP (Tusnady, 1998), a hidden Markov model-based method, to identify transmembrane helices and returns a prediction of cytoplasmic membrane if 3 or more are found.

3.6 A Signal Peptide directs a protein for export past the cytoplasmic membrane, and thus can be further used to differentiate cytoplasmic and non-cytoplasmic proteins. A hidden Markov model was trained on the dataset used to train the SignalP program, and is used to predict potential signal peptide cleavage sites. If a cleavage site with a high probability value is not found, the first 70 amino acids of the protein are passed to a support vector machine module trained on the same data. If the SVM is unable to recognize a signal peptide, the protein is predicted not to have one and is classified as cytoplasmic. However, a protein may possess a non-traditional signal peptide, so the results of this analysis carry less weight than do other modules when generating a final prediction.

4. PSORTb v.2.0: Final Prediction

In order to generate a final prediction, the results of each module are combined and assessed. A probabilistic method and 5-fold cross validation were used to assess the likelihood of a protein being at a specific localization given the prediction of a certain module. These likelihoods are used to generate a probability value for each of the five localization sites for a user's query protein.

PSORTb v.2.0 returns a list of the five localization sites and the associated probability value for each. We consider 7.5 to be a good cutoff above which a single localization can be assigned, and our precision and recall values for the program are calculated using this cutoff.

In certain cases, two localization sites may both exhibit high scores, which may indicate a protein with domains present in neighbouring localization sites. In cases where a localization site has a score between 4.5 (for Gram-negative) and 5.0 (for Gram-positive) and 7.49, the result returned to the user will say "Unknown - This protein may have multiple localization sites". In cases like these, we recommend you examine the long format output of the program's prediction to draw your own conclusion.

5. Using PSORTb v.2.0

This section of the documentation will be updated as changes are made to the web interface. Please check back often for up-to-date instructions on program use.

5.1 Accessing PSORTb

5.1.1 WWW Access: PSORTb is available online at http://www.psort.org. The sequence submission form for the current version of the program is located at http://www.psort.org/psortb2. The first release of the program is still accessible, at http://www.psort.org/psortb/v1index.html.

5.1.2 Standalone PSORTb: PSORTb is also available as a standalone program to run in a Linux environment. The file, as well as instructions for installation, is available at the PSORTb Downloads page.

5.2 Submitting a Sequence for Analysis on the WWW

5.2.1 Sequence Submission: The sequence submission form can be found at http://www.psort.org/psortb/. One or more sequences can be pasted into the text box, or the "upload from file" option can be used to analyze a file of one or more sequences stored on your computer. When using the text box option, please note that a maximum of 600,000 characters can be pasted into the box.

5.2.2 Selecting Gram Stain: PSORTb v.2.0 performs different analyses depending on the class of organism. You are required to choose the appropriate Gram-stain for your sequences. Not sure which option to select? Our Genomes page lists the classifications we used when we analyzed sequenced genomes. If your organism is not found there, try the NCBI Taxonomy Browser which provides a rough taxonomy for many bacterial species which may be helpful (for example, there is an association between proteobacteria and Gram-negative stain properties) or see the authoratative Bergey's Manual for Gram-stain properties for your microbe of interest.

5.2.3 Acceptable Organisms: PSORTb v.2.0 accepts protein sequences from Gram-negative and Gram-positive bacteria. All protein sequences from Archaea and eukaryotic organisms must be analyzed using a different tool. See the Resources page for possible options.

5.2.4 Acceptable Formats: PSORTb requires that a PROTEIN sequence be submitted in FASTA format.

A sequence within a FASTA sequence file consists of three parts:

A title line, which must begin with a `>' symbol, and may be followed by any type of text

A newline character at the end of the title line

The sequence itself, which continues until the end of file or the next `>' is reached

An example of FASTA format is shown below:

>gi|31562958|sp|Q8CWD2|BTUF_ECOL6
MAKSLFRALVALSFLAPLWLNAAPRVITLSPANTELAFAAGITPVGVSSYSDYPLQAQKIEQVSTWQGMN
LERIVALKPDLVIAWRGGNAERQVDQLASLGIKVMWVDATSIEQIANALRQLAPWSPQPDKAEQAAQSLL
DQYAQLKAQYADKPKKRVFLQFGINPPFTSGKESIQNQVLEVCGGENIFKDSRVPWPQVSREQVLARSPQ
AIVITGGPDQIPKIKQYWGEQLKIPVIPLTSDWFERASPRIILAAQQLCNALSQVD

For more information, see the description at NCBI or contact us.

5.2.5 Whole Genome Analysis: In order to reduce the load on the PSORTb servers, precalculated results for whole bacterial genomes are available on the PSORTb site, on the Genomes page.

5.3 Submitting a Sequence for Analysis to Standalone PSORTb

5.3.1 Sequence File: One or more sequences in FASTA format can be submitted to standalone PSORTb, provided they are all contained within one file (e.g. mysequences.txt) and are all from the same Gram class of organism. If you have both Gram-negative and Gram-positive sequences you wish to analyze, they must be divided into two files and run separately.

5.3.2 Command line syntax: Standalone PSORTb contains several options and arguments, which are described below. The most basic command, however, which will be sufficient for most instances, is:

$ psort [-p|-n] mysequences.txt > mysequences.out

psort calls the PSORTb program

-p (Gram-positive) or -n (Gram-negative) tells the program which predictive model to use

mysequences.txt is the name of your FASTA file containing the sequences to be analyzed

> mysequences.out sends the output to a new file that will be created called mysequences.out. If no > is used, the output will be written to the terminal display.
Usage: psort [-p|-n] [OPTIONS] [SEQFILE]
Runs psort on the sequence file SEQFILE . If SEQFILE isn't provided then sequences will be read from STDIN.
--help, -h Displays usage information
--positive, -p Gram positive bacteria
--negative, -n Gram negative bacteria
--cutoff, -c Sets a cutoff value for reported results
--divergent, -d Sets a cutoff value for the multiple
localization flag
--hmmtop, -h Specifies the path to the HMMTOP installation. If
not set, defaults to the value of the PSORT_HMMTOP
environment variable.
--matrix, -m Specifies the path to the pftools instalation. If
not set, defaults to the value of the PSORT_PFTOOLS
environment variable.
--format, -f Specifies sequence format (default is FASTA)
--output, -o Specifies the format for the output (default is
'normal' Value can be one of: terse, long or normal
--root, -r Specify PSORT_ROOT for running local copies. If
not set, defaults to the value of the PSORT_ROOT
environment variable.
--server, -s Specifies the PSort server to use
--verbose, -v Be verbose while running

5.3.3 Help: Typing psort -h at the command prompt will bring up a list of available options and usage instructions.

5.4 Understanding the Output

5.4.1 Output Formats: PSORTb allows the user to select one of three output formats from the sequence submission screen: Normal, Tab-delimited (terse format) and Tab-delimited (long format). Normal output is recommended for analysis of one or a few sequences, whereas tab-delimited output in either format is recommended for the analysis of a large number of sequences. The output formats are described below. If you would like to try the examples given below for yourself, input sequences are below:

Gram-positive input sequence:

>SAK_BPP42
MLKRSLLFLTVLLLLFSFSSITNEVSASSSFDKGKYKKGDDASYFEPTGPYLMVNVTGVDGKRNELLSPR
YVEFPIKPGTTLTKEKIEYYVEWALDATAYKEFRVVELDPSAKIEVTYYDKNKKKEETKSFPITEKGFVV
PDLSEHIKNPGFNLITKVVIEKK

Gram-negative input sequence:

>NP_949347.1
MQGHHFGGDMSNSEAIDNTTAKLRLAQSSSLLALALLIGSAPAQAADTDWGWLAIGAPAATAQGWTGKGV
VIGVVDTGIDFSHPALSGRAFDYNYGSFVAGSNHPHATHVAGIIGATDINRGMEGVAPDVRFSSMKIFTG
AGGSYLGDAAVADAYDGAIGSGVRIFNNSWGSSDSIANFTSREELLAHEPLLVGAFTRAVNADAVLVWST
GNDGRSQPSWQAAAPYYIQELKANWIAVTSVGENGTIASYANACGVAKAWCLAAPGGDFNPGIYSTIPGK DYGYMSGTSMAAPYVTGATAIARQMFPKASGAQLAQIVLQTSRDIGAPGIDDVYGWGLLAVDNIVDTINP
RGAALFASAAWGRFTTLSAIGNTVLDRISDLRNGRGDVVTAPLAFAGQNGAFSQSGSNPRNAYAADLAAA
PQPSPLGFGSVWARGLAGRATLSGSASSPQTTADISGGLLGFDLVNNQNLLVGIAGGGSNTNLTASGISD
KAGAQAWHVLGYAAAMYGPAFVNVAGGWNSFDQSYQRRVIPGTAGTVFASTISAAQSSSTDVAYFFQGRG
GWTFQTEVGRIEPYVHGATRNQSFGGFSETNASIFSLSVPSASLSEAEYGAGVRWACAPIKTVDQRVAVA
PTIDLAYVRFTNDGPIQVETNLLGTSVVGQTAALGADAIRVAAGLSLTSLAGISGSFGYTGTVRDAATAH
TVSGGLSIKF

5.4.2 Normal Output: The Normal output option displays the results of each of PSORTb's analytical modules, the localization scores for each of the 5 sites, as well as a final prediction and associated score (if one site scores above the 7.5 cutoff). Below are examples of both Gram-positive and Gram-negative output, using the input sequences given in 5.3.1. Descriptions of the output fields can be found beneath each output example.

Gram-positive sample output:

SeqID: SAK_BPP42

Analysis Report:

CMSVM+ Unknown [No details]

CWSVM+ Unknown [No details]

CytoSVM+ Unknown [No details]

ECSVM+ Extracellular [No details]

HMMTOP Unknown [1 internal helix found]

Motif+ Unknown [No motifs found]

Profile+ Unknown [No matches to profiles found]

SCL-BLAST+ Extracellular [matched 134189: Extracellular protein]

SCL-BLASTe+ Unknown [No matches against database]

Signal+ Non-cytoplasmic [Signal peptide detected]

Localization Scores:

Cytoplasmic 0.0

CytoplasmicMembrane 0.0

Cellwall 0.2

Extracellular 9.98

Final Prediction:

Extracellular 9.98

SeqID returns whatever was found on the title line of the FASTA format input file.

The Analysis Report contains the results of each of PSORTb's analytical modules. The module name is listed in the left-most column, the centre column contains the localization site predicted by that module (or "Unknown" if the module did not generate a prediction), and the right-most column contains comments related to the modules' findings. The modules in the Gram-positive version are as follows:

CMSVM+: The Gram-positive version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.

CWSVM+: The support vector machine trained to identify cell wall proteins (Gram-positive only). Returns cell wall or unknown.

CytoSVM+: The Gram-positive version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.

ECSVM+: The Gram-positive version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.

HMMTOP: Predicts transmembrane helices within the sequence. The presence of 3 or more transmebrane helices causes the module to return a prediction of cytoplasmic membrane, otherwise unknown is returned. The Details column returns the number of predicted helices.

Motif+: Searches the sequence for Gram-positive motifs indicative of a specific localization site. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.

Profile+: Searches the sequence for Gram-positive profiles indicative of a specific localization site. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.

SCL-BLAST+: Performs a BLASTP search against the Gram-positive subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.

SCL-BLASTe+: Like SCL-BLAST, but only returns a match if the query and subject have 100% similarity and are within 1aa in length of each other. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.

Signal+: Searches the sequence for the presence of a Gram-positive cleavable N-terminal signal peptide. If a signal peptide is detected, the module returns a prediction of non-cytoplasmic, otherwise a result of unknown is returned.

In the Localization Scores area, the confidence value for each of the localization sites are given. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section. If two sites have high scores, a flag of "This protein may have multiple localization sites" is also returned in the Final Prediction field.

Gram-negative sample output (to illustrate multiple localization):

SeqID: NP_949347.1

Analysis Report:

CMSVM- Unknown [No details]

CytoSVM- Unknown [No details]

ECSVM- Extracellular [No details]

HMMTOP Unknown [No internal helices found]

Motif- Unknown [No motifs found]

OMPMotif- Unknown [No motifs found]

OMSVM- OuterMembrane [No details]

PPSVM- Unknown [No details]

Profile- Unknown [No matches to profiles found]

SCL-BLAST- OuterMembrane, Extracellular [matched 3646417: Outer membrane (Autotransporter)]

SCL-BLASTe- Unknown [No matches against database]

Signal- Non-cytoplasmic [Signal peptide detected]

Localization Scores:

Cytoplasmic 0.00

CytoplasmicMembrane 0.00

Periplasm 0.00

OuterMembrane 5.87

Extracellular 4.13

Final Prediction:

Unknown (This protein may have multiple localization sites)

The modules which differ between those described for the Gram-positive version of PSORTb are listed below:

CMSVM-: The Gram-negative version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.

CytoSVM-: The Gram-negative version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.

ECSVM-: The Gram-negative version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.

HMMTOP: See above.

Motif-: Searches the sequence for Gram-negative motifs indicative of a specific localization site. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.

OMPMotif-: Searches the sequence for Gram-negative outer membrane protein motifs. If a match occurs, outer membrane is reported, otherwise unknown is returned. The details column returns the numerical identifiers of the motifs found.

OMSVM-: The support vector machine trained to identify outer membrane proteins. Returns outer membrane or unknown (Gram-negative only).

PPSVM-: The support vector machine trained to identify periplasmic proteins. Returns periplasm or unknown (Gram-negative only).

Profile-: Searches the sequence for Gram-negative profiles indicative of a specific localization site. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.

SCL-BLAST-: Performs a BLASTP search against the Gram-negative subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.

SCL-BLASTe-: See above

Signal-: Searches the sequence for the presence of a Gram-negative cleavable N-terminal signal peptide. If a signal peptide is detected, the module returns a prediction of non-cytoplasmic, otherwise a result of unknown is returned.

In the Localization Scores area, the confidence value for each of the localization sites are given. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section. If two sites have high scores, a flag of "This protein may have multiple localization sites" is also returned in the Final Prediction field.

5.4.3 Tab-delimited (Terse Format) Output: Tab-delimited terse format output returns a list of inputted sequences, each one on a new line, with 3 columns: SeqId contains the information from the FASTA file definition line, Localization contains the final prediction of localization site (or "Unknown" is no site scored above 7.5), and Score contains the confidence value associated with this localization site. Tab characters occur between the columns, and, in the case of a multiple sequence submission, each sequence record is separated by newline characters. This format can be easily read into a spreadsheet, using a program such as MS Excel.

5.4.4 Tab-delimited (Long Format) Output: Tab-delimited long format output returns a list of inputted sequences, each one on a new line, and with all of the information from the PSORTb results placed into columns. The SeqId, module results and comments from the analysis report, localizations and scores, and the final prediction and score are each placed into their own column.

6. Limitations

PSORTb is designed to emphasize precision (or specificity) over recall (or sensitivity), and as a result, some classes of proteins are not predicted well. The following issues must be considered when performing an analysis using the current version of PSORTb:

6.1 Proteins resident at multiple localization sites: Many proteins can exist at multiple localization sites. Examples of such proteins include integral membrane proteins with large periplasmic domains, or autotransporters, which contain an outer membrane pore domain and a cleaved extracellular domain. The current version of PSORTb handles this situation by flagging proteins which show a distribution of localization scores favouring two sites, rather than one. It is important to examine the distrubtion of localization scores carefully in order to determine if your submitted protein may have multiple localization sites and if so, which two sites are involved.

6.2 Lipoproteins: The current version of PSORTb does not detect lipoprotein motifs.

6.3 Precision vs. Recall: PSORTb has been designed to yield as high a precision level as possible, at the expense of recall. Programs which make predictions at all costs often provide incorrect or incomplete results, which can be propagated through annotated databases, datasets and reports in the literature. We believe that a confident prediction is more valuable than any prediction, and we have designed the program to this end. Note, however, that a user may choose to use their own reduced cutoff score in generating final predictions.

[ Submit Sequences | Updates | Resources | Contact ]