|
For documentation describing the current version of PSORT-B, please
refer to PSORT-B v.2.0
Documentation.
A plain text version of the documentation below
is available here.
This
section contains documentation and references related to PSORT-B
v.1.1, described in the manuscript: Jennifer L. Gardy, Cory
Spencer, Ke Wang, Martin Ester, Gabor E. Tusnady, Istvan Simon,
Sujun Hua, Katalin deFays, Christophe Lambert, Kenta Nakai and
Fiona S.L. Brinkman (2003). PSORT-B: improving protein subcellular
localization prediction for Gram-negative bacteria, Nucleic
Acids Research 31(13):3613-17
1. History
Computational
prediction of the subcellular localization of proteins is a
valuable tool for genome analysis and annotation, since a protein's
subcellular localization can provide clues regarding its function
in an organism. For bacterial pathogens, the prediction of proteins
on the cell surface is of particular interest due to the potential
of such proteins to be primary drug or vaccine targets. A protein's
subcellular localization is influenced by several features present
within the protein's primary structure, such as the presence
of a signal peptide or membrane-spanning alpha-helices.
Several algorithms have been developed to analyze single features
such as these, however the PSORT family of programs analyzes
several features at once, using information obtained from each
analysis to generate an overall prediction of localization site.
Developed by Kenta Nakai in 1991, PSORT is an algorithm which
assigns a probable localization site to a protein given an amino
acid sequence alone. Originally developed for prediction of
protein localization in Gram-negative bacteria, PSORT was expanded
into a suite of programs (PSORT, PSORT II, iPSORT) capable of
handling proteins from all classes of organisms.
The
Brinkman Laboratory headed development of PSORT-B, an updated
version of the PSORT algorithm designed for Gram-negative bacterial
proteins. PSORT-B includes new analytical modules designed to
capitalize on new discoveries and observations in protein sorting,
and benefits from a training dataset of over 1400 proteins of
known localization. Its focus is on precision over recall to
faciliate accurate predictions, at the expense of not making
as many predictions as other methods may make. PSORT-B v.1.1
was released in July, 2003 and has now been succeeded by PSORT-B
v.2.0.
2.
PSORT-B v.1.1 vs PSORT I
The
original version of PSORT, still frequently used for prediction
of prokaryotic localization sites, used a number of analyses
arranged in an if/then rule-based format to determine which
of four localization sites a protein might be resident at -
cytoplasm, periplasm, inner or outer membrane (see the documentation
available at the PSORT
WWW server for a full explanation).
PSORT-B
v.1.1, however:
- uses
updated versions of several of these analyses, as well as
several novel analytical methods
-
utilizes a probabilistic system for determination of a final
prediction, rather than a rule-based system
-
is capable of predicting five localization sites, rather
than four (PSORT-B also recognizes extracellular proteins)
- does not force
a prediction, returning a prediction of "Unknown"
if no prediction is made
PSORT-B v.1.1's precision has been measured
at 97%, compared to PSORT I's 69%, and PSORT-B's web interface
is able to handle the submission of multiple sequences.
3. PSORT-B v.1.1:
Analytical Modules
PSORT-B v.1.1 consists of six analytical modules,
each of which analyzes one biological feature known to influence
or be characteristic of subcellular localization. The modules
may act as a binary predictor, classifying a protein as either
belonging or not belonging to a particular localization site,
or they may be multi-category, able to assign a protein to one
of several localization sites. All modules are capable of returning
a negative prediction as well, such that a protein will not
be forced into one of the localization sites.
3.1 SCL-BLAST, or SubCellular
Localization BLAST, is a BLAST-P search against the current
local database of proteins of known subcellular localization.
An E-value cutoff of 10e-10 is used to ensure that returned
HSPs represent true homologs, and an additional length restriction
is placed on any subject matches - the length of the query:subject
HSP must be within 80-120% of the length of the subject protein,
thus reducing potential errors associated with the domain
nature of proteins. Note that we are interested in examining
other E-value cutoffs for this analysis. SCL-BLAST selects
the top-scoring HSP from the list of results, and returns
that protein's localization site as its prediction, along
with the name of the top-scoring HSP and the associated E-value.
SCL-BLAST is capable of assigning a protein to any one of
the five localization sites.
3.2 Motif Analysis relies
on the observation that a protein's function is closely linked
to its localization, and that several PROSITE motifs characteristic
of specific functions can be used to infer specific localizations.
Several potentially important motifs were used to scan a dataset
of 853 proteins, and motifs with a false-positive rate of
0% were built into PSORT. Note that the 0% false-positive
rate is based on the dataset of proteins of known localization
that we currently have access to. We wish to emphasize that
this does not necessarily mean that such motifs will always
be 100% accurate. If you identify an incorrect prediction,
please contact us. A submitted
protein is scanned for the occurrence of any of these motifs,
and, if found, the localization site associated with the motif
is returned as the program's prediction. Motifs
associated with each of the five localization sites are
included in PSORT-B.
3.3 Outer Membrane Motif Analysis
uses motifs generated from data mining techniques
applied to a set of 425 beta-barrel proteins to classify a
query protein as outer membrane or non-outer membrane (She
et al, 2003). The A Priori algorithm was used to mine for
short motifs found more often in outer membrane proteins than
in proteins at the other four localization sites. Over 250
such motifs were identified, and a query protein is scanned
for the co-occurrence of two or more of these motifs. A prediction
of outer membrane is returned if successful.
3.4 HMMTOP (Tusnady, 1998) is used to identify transmembrane alpha helices,
which can then be used to identify proteins spanning the inner,
or cytoplasmic, membrane. Our analyses have shown that when
three or more TMHs are predicted in a protein, there is a
94% chance of that protein being an inner membrane protein.
PSORT-B uses HMMTOP (Tusnady,
1998), a hidden Markov model-based method, to identify
transmembrane helices and returns a prediction of inner membrane
if 3 or more are found.
3.5 SubLocC relies on the
observation that proteins resident in different environments
within a cell tend to have different overall amino acid composition.
PSORT-B uses SubLoc (Hua,
2001), a support vector machine-based approach to differentiate
cytoplasmic from non-cytoplasmic proteins.
3.6 A Signal Peptide
directs a protein for export past the cytoplasmic membrane,
and thus can be further used to differentiate cytoplasmic
and non-cytoplasmic proteins. A hidden Markov model was trained
on the dataset used to train the SignalP program, and is used
to predict potential signal peptide cleavage sites. If a cleavage
site with a high probability value is not found, the first
70 amino acids of the protein are passed to a support vector
machine module trained on the same data. If the SVM is unable
to recognize a signal peptide, the protein is predicted not
to have one and is classified as cytoplasmic. However, a protein
may possess a non-traditional signal peptide, so the results
of this analysis carry less weight than do other modules when
generating a final prediction.
4.
PSORT-B v.1.1: Final Prediction
In
order to generate a final prediction, the results of each module
are combined and assessed. A probabilistic method and 5-fold
cross validation were used to assess the likelihood of a protein
being at a specific localization given the prediction of a certain
module. These likelihoods are used to generate a probability
value for each of the five localization sites for a user's query
protein.
PSORT-B v.1.1 returns a list of the five localization sites
and the associated probability value for each, ranked in descending
order. We consider 7.5 to be a good cutoff
above which localization can be assigned, and our precision
and recall values for PSORT-B v.1.1 are calculated using this
cutoff.
The user must carefully inspect the results list and the probability
values. Two localization sites may often have similar values,
which could indicate a protein with distinct domains present
in two localization sites, the inner membrane and the periplasm,
for example. Additionally, if no prediction can be made, an
even distribution of scores across the five sites will be visible.
This section of the documentation will be updated
as changes are made to the web interface. Please check back
often for up-to-date instructions on program use.
5.1 Accessing PSORT-B
5.1.1
WWW Access: PSORT-B is available online at http://www.psort.org. The sequence submission form itself
is located at http://www.psort.org/psortb.
5.2
Submitting a Sequence for Analysis
5.2.1
Sequence Submission: The sequence submission form
can be found at http://www.psort.org/psortb/.
Paste your protein sequence(s) from a Gram-negative bacterium
into the box, or browse for a local FASTA format file on your
computer containing the sequences. Select the desired output
format from the dropdown menu. Output formats are described
in section 3.1. Press the Submit button to begin the analysis.
5.2.2
Acceptable Organisms: PSORT-B presently only accepts
protein sequences from Gram-negative bacteria. All protein
sequences from Gram-positive and eukaryotic organisms must
be analyzed using one of the other PSORT programs, available
at the PSORT
WWW Server.
5.2.3
Acceptable Formats: PSORT-B requires that a PROTEIN
sequence be submitted in FASTA format. A FASTA format file
contains a definition line, preceded by a ">"
character and containing any information the user wishes to
identify their sequence with. The definition line ends with
a newline character, and is followed by the sequence information
itself. A newline indicates the end of the sequence information.
An example of FASTA format is shown below:
>gi|31562958|sp|Q8CWD2|BTUF_ECOL6
MAKSLFRALVALSFLAPLWLNAAPRVITLSPANTELAFAAGITPVGVSSYSDYPLQAQKIEQVSTWQGMN
LERIVALKPDLVIAWRGGNAERQVDQLASLGIKVMWVDATSIEQIANALRQLAPWSPQPDKAEQAAQSLL
DQYAQLKAQYADKPKKRVFLQFGINPPFTSGKESIQNQVLEVCGGENIFKDSRVPWPQVSREQVLARSPQ
AIVITGGPDQIPKIKQYWGEQLKIPVIPLTSDWFERASPRIILAAQQLCNALSQVD
5.2.4
Number of Sequences Allowed: PSORT-B can handle both
single and multiple sequences in the web-based submission
form. For multiple sequence submission, submit a FASTA file
containing all of the protein sequences. A maximum of 600,000
characters is allowed in the submit textbox.
5.2.5
Alternate Prediction Options: If your sequence does
not meet the organism criteria, please use one of the other
PSORT programs to analyze it. These programs are available
at the PSORT
WWW Server, and links can also be found at the www.psort.org
index page. Other resources for subcellular localization prediction
can be found on the Resources
page.
5.2.6
Whole Genome Analysis: In order to reduce the load
on the PSORT-B servers, precalculated results for whole bacterial
genomes will be made available on the PSORT-B site, one the
Genomes page.
5.3
Understanding the Output
5.3.1
Output Formats: PSORT-B allows the user to select
one of three output formats from the sequence submission screen:
Normal, Tab-delimited (terse format) and Tab-delimited (long
format). Normal output is recommended for analysis of one
or a few sequences, whereas tab-delimited output in either
format is recommended for the analysis of a large number of
sequences. The output formats are described below.
5.3.2
Normal Output: The Normal output option displays
the results of each of PSORT-B's analytical modules, the localization
scores for each of the 5 sites, as well as a final prediction
and associated score (if one site scores above the 7.5 cutoff).
The output appears in the format:
| SeqID:
gi|31562958|sp|Q8CWD2|BTUF_ECOL6 |
| Analysis
Report: |
|
|
HMMTOP |
Unknown |
[1 internal
helix found] |
| Motif |
Unknown |
[No motifs
found] |
| OMPMotif |
Unknown |
[No motifs
found] |
| SCL-BLAST |
Periplasmic |
[Matched
P37028: Periplasmic protein] |
| Signal |
Unknown |
[No signal
peptide detected] |
| SubLocC |
Unknown |
[No details] |
| Localization
Scores: |
| Periplasmic |
8.950617 |
| Inner
Membrane |
0.370370 |
| Outer
Membrane |
0.308642 |
| Cytoplasmic |
0.246914 |
| Extracellular |
0.123457 |
| Final
Prediction: |
| Periplasmic |
8.950617 |
SeqID
returns whatever was found on the definition line of the FASTA
format input file. The Analysis Report contains the results
of each of PSORT-B's analytical modules. The module name is
listed in the left-most column, the centre column contains
the localization site predicted by that module (or "Unknown"
if the module did not generate a prediction), and the right-most
column contains comments related to the modules' findings.
In the Localization Scores area, the confidence value for
each of the 5 localization sites is given, in descending order.
If one of the sites has a score of 7.5 or greater, this site
and its score are returned in the Final Prediction section.
5.3.3
Tab-delimited (Terse Format) Output: Tab-delimited
terse format output returns a list of inputted sequences,
each one on a new line, with 3 columns: SeqId contains the
information from the FASTA file definition line, Localization
contains the final prediction of localization site (or "Unknown"
is no site scored above 7.5), and Score contains the confidence
value associated with this localization site. Tab characters
occur between the columns, and, in the case of a multiple
sequence submission, each sequence record is separated by
newline characters. This format can be easily read into a
spreadsheet, using a program such as MS Excel.
5.3.4
Tab-delimited (Long Format) Output: Tab-delimited
long format output returns a list of inputted sequences, each
one on a new line, and with all of the information from the
PSORT-B results placed into columns. The SeqId, module results
and comments from the analysis report, localizations and scores,
and the final prediction and score are each placed into their
own column.
PSORT-B is designed to emphasize precision
(or specificity) over recall (or sensitivity), and as a result,
some classes of proteins are not predicted well. The following
issues must be considered when performing an analysis using
the current version of PSORT-B:
6.1
Integral membrane proteins with 1-2 transmembrane helices:
A large number of integral membrane proteins contain
less than 3 helices. PSORT-B's HMMTOP module, however, will
only identify proteins with 3 or more TMHs as inner membrane-localized
in order to reduce the number of false positive inner membrane
predictions. Thus, a protein with 1-2 helices and no other
localization information may not yield a confident prediction.
Examination of the HMMTOP results in the PSORT-B analysis
report will help to identify such proteins.
6.2
Proteins resident at multiple localization sites:
Many proteins can exist at multiple localization sites. Examples
of such proteins include integral membrane proteins with large
periplasmic domains, or autotransporters, which contain an
outer membrane pore domain and a cleaved extracellular domain.
The current version of PSORT-B handles this situation by providing
a list of localization scores for each of the 5 sites - proteins
with multiple localizations will typically show a distribution
of localization scores favouring two sites, rather than one.
It is important to examine the distrubtion of localization
scores carefully in order to determine if your submitted protein
may have multiple localization sites. In later versions of
the program, as more proteins with multiple localization sites
are incporated into PSORT-B, other methods of returning multiple
localization predictions will be explored.
6.3
Lipoproteins: The current version of PSORT-B does
not detect lipoprotein motifs. Lipoproteins in the SCL-BLAST
database, however, are annonated as such.
6.4
Precision vs. Recall: PSORT-B version 1.1 has been
designed to yield as high a precision level as possible, at
the expense of recall. Programs which make predictions at
all costs often provide incorrect or incomplete results, which
can be propagated through annotated databases, datasets and
reports in the literature. We believe that a confident prediction
is more valuable than any prediction, and we have designed
the program to this end. Note, however, that a user may choose
to use their own reduced cutoff score in generating final
predictions.
|