PSORT-M  
PSORT-M Menu

PSORTm Docker Documentation

This section contains documentation and references related to PSORTm. The PSORTm manuscript has been published online - pending for proof. Note that the initial online version of the paper was misformatted before the proof stage. We will add an update here once the final version of the paper is published.

Please use the citation below if you use this resource:

Peabody, M.A., Lau, W.Y.V., Hoad, G., Jia, B., Maguire, F., Gray, K.L., Beiko, R.G. & Brinkman, F.S.L. (2020). PSORTm: a bacterial and archaeal protein subcellular localization prediction tool for metagenomics data. Bioinformatics. doi: 10.1093/bioinformatics/btaa136

1. History

Computational prediction of the subcellular localization of proteins is a valuable tool for genome analysis and annotation, since a protein's subcellular localization can provide clues regarding its function in an organism. For bacterial pathogens, the prediction of proteins on the cell surface is of particular interest due to the potential of such proteins to be primary drug or vaccine targets. A protein's subcellular localization is influenced by several features present within the protein's secondary structure, such as the presence of a signal peptide or membrane-spanning alpha-helices.

Several algorithms have been developed to analyze single features such as these, however the PSORT family of programs analyzes several features at once, using information obtained from each analysis to generate an overall prediction of localization site. Developed by Kenta Nakai in 1991, PSORT is an algorithm which assigns a probable localization site to a protein given an amino acid sequence alone. Originally developed for prediction of protein localization in Gram-negative bacteria, PSORT was expanded into a suite of programs (PSORT, PSORT II, iPSORT) capable of handling proteins from all classes of organisms.

The Brinkman Laboratory headed development of PSORTb, an updated version of the PSORT algorithm with significantly higher accuracy. PSORTb includes new analytical modules designed to capitalize on new discoveries and observations in protein sorting, and benefits from a training dataset of over 11600 proteins of known localization. Its focus is on precision over recall to facilitate accurate predictions, at the expense of not making as many predictions as other methods. PSORT-B v.1.1 was released in July, 2003, with an updated version of PSORTb v.2.0 released in 2004, and has now been succeeded by PSORTb v.3.0.

2. Introduction to PSORTm

PSORTm was developed to support the analysis of metagenomics data. The user enters two files: 1) a file of predicted protein sequences (translated sequence reads), and 2) a file of taxonomic assignments (csv format). PSORTm will then sort the sequences into appropriate organism/cell envelope categories, and run each of these files of categorized sequences through the appropriate set of modules. Alternatively, users may choose to run PSORTm on a certain subset of sequences of the same organism/cell envelope category. For example, if researchers are interested in the subcellular localization of proteins of a certain taxon (e.g. Pseudomonas), they could take the reads assigned to Pseudomonas, and run these all by specifying the organism/cell envelope type (in this case by specifying Gram-negative). When all sequences are of a single category, only the file of protein sequences is required as input. PSORTm is based on the code of PSORTb v.3.0.2, with modifications (the signal peptide module is removed, and there is no length restriction in the SCL-BLAST module). We provide results in short or long tab-delimited formats which optimally displays results from large numbers of input protein sequences.

3. PSORTm: Analytical Pipeline

3.1. Cell Type Classification of Metagenomic Reads

A taxonomic-based classification tool was also incorporated into PSORTm to sort input sequences according to type of organism and cell envelope: Archaea, Gram-negative, Gram-positive, Gram-negative without an outer membrane, and Gram-positive with an outer membrane. This tool requires the user to provide an input file of reads along with their associated taxonomic classification and will provide output files of the reads sorted into the 5 aforementioned organism/cell envelope categories, as well as an additional file of reads that could not be categorized. The categorization scheme is derived from that used in the protein SCL database, PSORTdb (Peabody et al., 2016; Yu et al., 2011; Rey et al., 2005) which provides pre-computed PSORTb analyses of microbial genomes. This classification tool uses specific marker sequences (such as Omp85, an essential outer membrane protein diagnostic of classical Gram-negative bacteria) to categorize newly sequenced bacterial and archaeal genomes into organism/cell envelope categories so that the appropriate input option could be chosen. Using this tool, we could globally assign taxa to particular cell envelope categories. Then, when a given metagenomics read is evaluated in PSORTm, its taxonomic prediction is used to analyze the sequence using the appropriate modules for its cell envelope structure.

3.2. Subcellular Localization Prediction

PSORTm consists of multiple analytical modules for SCL prediction, each of which analyzes one biological feature known to influence or be characteristic of subcellular localization. The modules may act as a binary predictor, classifying a protein as either belonging or not belonging to a particular localization site, or they may be multi-category, able to assign a protein to one of several localization sites. When analyzing a Gram-negative organism (organism with two cell membranes), possible localization sites are: cytoplasm, cytoplasmic membrane, periplasm, outer membrane and extracellular space. Gram-positive and archaeal localization sites include: cytoplasm, cytoplasmic membrane, cell wall and extracellular space. The new version also offers prediction options for specialized organisms, such as those that stain Gram-negative but have no outer membrane or cell wall (eg. Mycoplasma spp), as well as organisms that stain Gram-positive but have an outer membrane (eg. Deinococcus radiodurans). All modules are capable of returning a negative prediction as well, such that a protein will not be forced into one of the localization sites.

3.2.1 SCL-BLAST & SCL-BLASTe, or SubCellular Localization BLAST, is a BLAST-P search against the current local database of proteins of known subcellular localization. An E-value cutoff of 10e-9 is used to ensure that returned HSPs represent true homologs. SCL-BLAST selects the top-scoring HSP from the list of results, and returns that protein's localization site as its prediction, along with the name of the top-scoring HSP and the associated E-value. SCL-BLAST is capable of assigning a protein to any one of the possible localization sites. SCL-BLASTe is a specialized implementation of this analysis, in which a user's query protein is checked to see if it is an exact match to a protein in the SCL-BLAST database. If an exact match is found (100% similarity and within 1aa length), the protein is immediately predicted as residing at that localization site, and is not passed to subsequent modules.

3.2.2 Support Vector Machines (SVMs) are machine learning-based classifiers trained to classify a protein as belonging or not belonging to the set of proteins at a specific localization site. PSORTm contains 13 SVMs, one for each of the localization sites (5 Gram-negative, 4 Gram-positive and 4 archaeal). Trained using frequent sequences mined from proteins resident at a specific localization site, each SVM will examine a query protein and determine whether it does or does not belong at the localization site in question. If the SVM believes to the protein to belong to that particular site, that result is returned. Otherwise, an unknown prediction is returned.

3.2.3 Motif & Profile Analysis relies on the observation that a protein's function is closely linked to its localization, and that several PROSITE motifs characteristic of specific functions can be used to infer specific localizations. Several potentially important motifs were used to scan our current dataset, and motifs with a false-positive rate of 0% were built into PSORTm, as were expanded versions of the motifs termed "profiles. Note that the 0% false-positive rate is based on the dataset of proteins of known localization that we currently have access to. We wish to emphasize that this does not necessarily mean that such motifs and profiles will always be 100% accurate. If you identify an incorrect prediction, please contact us. A submitted protein is scanned for the occurrence of any of these motifs or profiles, and, if found, the localization site associated with the motif/profile is returned as the program's prediction. Motifs associated with each of the possible localization sites included in PSORTm.

3.2.4 Outer Membrane Motif Analysis uses motifs generated from data mining techniques applied to a set of 425 beta-barrel proteins to classify a query protein as outer membrane or non-outer membrane (She et al, 2003). The A Priori algorithm was used to mine for short motifs found more often in outer membrane proteins than in proteins at the other four localization sites. Over 250 such motifs were identified, and a query protein is scanned for the co-occurrence of two or more of these motifs. A prediction of outer membrane is returned if successful.

3.2.5 ModHMM was derived from PRODIV-HMM (Viklund and Elofsson, 2004), a hidden Markov model-based method that identifies transmembrane alpha helices, which in turn identifies proteins spanning the cytoplasmic membrane. Our analyses have shown that when three or more TMHs are predicted in a protein, there is a >95% chance of that protein being an inner membrane protein. PSORTm uses a modified version of PRODIV-HMM to identify transmembrane helices and returns a prediction of cytoplasmic membrane if 3 or more are found.

4. PSORTm: Final Prediction

In order to generate a final prediction, the results of each module are combined and assessed. A probabilistic method is used to assess the likelihood of a protein being at a specific localization given the prediction of a certain module. These likelihoods are used to generate a probability value for each of the five localization sites for a user's query protein.

PSORTm returns a list of the five localization sites and the associated probability value for each. We consider 7.5 to be a good cutoff above which a single localization can be assigned, and our precision and recall values for the program are calculated using this cutoff.

In certain cases, two localization sites may both exhibit high scores, which may indicate a protein with domains present in neighbouring localization sites. In cases where a localization site has a score between 4.5 (for Gram-negative) and 5.0 (for Gram-positive) and 7.49, the result returned to the user will say "Unknown - This protein may have multiple localization sites". In cases like these, we recommend you examine the long format output of the program's prediction to draw your own conclusion.

For organisms with specialized structures, such as those who stain Gram-negative but have no cell walls, the predictor may predict a cell wall localization but the final result will say "Unknown - predicted localization does not exist". This may mean that the protein is a surface protein or that it is a false prediction. In cases like this, we recommend you examine the long format output of the program's prediction to draw your own conclusion.

The final prediction for PSORTm is generated in the same way as for PSORTb v.3.0, but without reporting on signal peptides (signal peptide motif searches are excluded from PSORTm).

5. Using PSORTm

5.1 Accessing PSORTm

PSORTm is available for download in 2 different Docker containers (what is docker?). Which of these you choose to install will depend on how you want to input your metagenomic sequences:

5.1.1 Web Service Access: This container allows you to access PSORTm using a web browser

  • Prebuilt container: https://hub.docker.com/r/brinkmanlab/psortm
  • Container code (for custom build): https://github.com/brinkmanlab/psortm-docker
  • 5.1.2 Command-line Access: This Docker container allows you to run PSORTm directly from the Linux command line

  • Prebuilt container: https://hub.docker.com/r/brinkmanlab/psortm_commandline
  • Container code (for custom build): https://github.com/brinkmanlab/psortm_commandline_docker
  • 5.2 Submitting Metagenomic Sequences for PSORTm Analysis

    PSORTm can be run either with a mixture of different organism types or with a single organism type. Here are the guidelines for using the PSORTm input form:

    5.2.1 Sequence Submission: One or more sequences in FASTA format can be submitted to this form, provided they are all contained within one file (e.g. mysequences.txt). Please upload this file from your computer to PSORTm using the 'Browse' button in the 'FASTA Protein Sequences' section.

    5.2.2 Acceptable Sequence Formats: PSORTm, like PSORTb, requires that PROTEIN sequences are submitted in FASTA format.

    A sequence within a FASTA sequence file consists of three parts:

    • A title line, which must begin with a '>' symbol, and may be followed by any type of text
    • A newline character at the end of the title line
    • The sequence itself, which continues until the end of file or the next '>' is reached

    An example of protein FASTA format is shown below:

    >read|001
    MAKSLFRALVALSFLAPLWLNAAPRVITLSPANTELAFAAGITPVGVSSY
    >read|002
    LERIVALKPDLVIAWRGGNAERQVDQLASLGIKVMWVDATSIEQIANALR
    >read|003
    DQYAQLKAQYADKPKKRVFLQFGINPPFTSGKESIQNQVLEVCGGENIFK
    >read|004
    AIVITGGPDQIPKIKQYWGEQLKIPVIPLTSDWFERASPRIILAAQQLCN

    For more information, see the description at NCBI or contact us.

    5.2.3 Acceptable Organisms: PSORTm accepts protein sequences from Gram-negative bacteria, Gram-positive bacteria, Gram-negative bacteria without an outer membrane, Gram-positive bacteria with an outer membrane and archaea.

    When entering a mixture of these organism types into PSORTm, it is possible that our predictor of Gram stain and organism domain will not be able to classify some proteins. These sequences will be added to the output file undetermined.fasta. If a user wishes to obtain subcellular localization predictions for these sequences, they may run them as a specific organism/cell wall type (e.g. Gram-negative). Another option is to run the sequences/reads associated with these predicted protein sequences on another taxonomic classification program, which may give assignments to a specific enough rank that the sequences can be sorted into the appropriate types.

    All protein sequences from eukaryotic organisms must be analyzed using a different tool. See the Resources page for possible options.

    5.2.4 Specifying organism type

    5.2.4.1 Entering sequences of a single organism type: To use this feature please select the 'Same taxonomic type' radio button in the 'Taxonomic Classification' section of the PSORTm submission form. This should be used when all the sequences in your input file are of the same organism type (e.g. all the sequences are Gram-positive or all the sequences are Gram-negative etc).

    Organism type: If the file of sequences contains only one type of organism, for example archaea, then please pick the 'Archaea' option from the 'Organism type'. The option of 'Bacteria' is available and when selected, gives you the ability to choose the Gram stain.

    Gram stain: Please choose from the pull-down menu if your sequence file contains sequences from only Gram-positive or only from Gram-negative bacteria. There is also an 'Advanced' option which will allow you to choose further options from the 'Advanced Gram stain options' pull down menu.

    Advanced Gram stain options: If the bacterial sequences in your sequence file belong to organisms that do not follow the usual trend of Gram-positive and Gram-negative bacteria with respect to the outer membrane configuration, please choose either 'Negative without outer membrane' or 'Positive with outer membrane' from the pull down menu labelled 'Advanced Gram stain options'.

    5.2.4.2 Entering sequences of mixed organism types (environmental sample): PSORTm performs different analyses depending on the class of organism. To use this feature please select the 'Mixed taxonomic type' radio button in the 'Taxonomic Classification' section of the PSORTm submission form.

    To accompany your mixture of sequence types, you are required to upload a tab- or comma-delimited file of sequence IDs with their predicted taxonomic classifications (provided as taxonomic name or NCBI taxonomy ID) so the appropriate Gram-stain and organism domain (Bacteria or Archaea) can be chosen for each of your sequences. There are tools available which can create such a file e.g. Megan6.

    Please use comma-delimited or tab-delimited format in this file and load it up using the 'Browse' button in the 'Taxonomic Classification' section of the submission form. In the 'Format used' pull down menu, please specify if you have used taxonomic name or NCBI taxonomy ID.

    An example of using a sequence ID with taxonomic name:
    r1.1,Pseudomonas

    An example of using a sequence ID with NCBI Taxonomy ID:
    r1.1,286

    5.3 Whole Genome Analysis: Please note that in order to reduce the load the PSORTb servers, pre-calculated results for whole bacterial genomes are available in the PSORTdb database (cPSORTdb section of the database, versus the ePSORTdb section of experimentally determined proteins).

    5.4 Understanding the Output

    5.4.1 Output Formats: PSORTm allows the user to select one of two output formats from the sequence submission screen: Tab-delimited (terse format) and Tab-delimited (long format). The output formats are described below. If you would like to try the examples given below for yourself, input FASTA sequences are below:

    Gram-positive input sequence:

    >SAK_BPP42
    MLKRSLLFLTVLLLLFSFSSITNEVSASSSFDKGKYKKGDDASYFEPTGPYLMVNVTGVDGKRNELLSPR
    YVEFPIKPGTTLTKEKIEYYVEWALDATAYKEFRVVELDPSAKIEVTYYDKNKKKEETKSFPITEKGFVV
    PDLSEHIKNPGFNLITKVVIEKK

    Gram-negative input sequence:

    >NP_949347.1
    MQGHHFGGDMSNSEAIDNTTAKLRLAQSSSLLALALLIGSAPAQAADTDWGWLAIGAPAATAQGWTGKGV
    VIGVVDTGIDFSHPALSGRAFDYNYGSFVAGSNHPHATHVAGIIGATDINRGMEGVAPDVRFSSMKIFTG
    AGGSYLGDAAVADAYDGAIGSGVRIFNNSWGSSDSIANFTSREELLAHEPLLVGAFTRAVNADAVLVWST
    GNDGRSQPSWQAAAPYYIQELKANWIAVTSVGENGTIASYANACGVAKAWCLAAPGGDFNPGIYSTIPGK
    DYGYMSGTSMAAPYVTGATAIARQMFPKASGAQLAQIVLQTSRDIGAPGIDDVYGWGLLAVDNIVDTINP
    RGAALFASAAWGRFTTLSAIGNTVLDRISDLRNGRGDVVTAPLAFAGQNGAFSQSGSNPRNAYAADLAAA
    PQPSPLGFGSVWARGLAGRATLSGSASSPQTTADISGGLLGFDLVNNQNLLVGIAGGGSNTNLTASGISD
    KAGAQAWHVLGYAAAMYGPAFVNVAGGWNSFDQSYQRRVIPGTAGTVFASTISAAQSSSTDVAYFFQGRG
    GWTFQTEVGRIEPYVHGATRNQSFGGFSETNASIFSLSVPSASLSEAEYGAGVRWACAPIKTVDQRVAVA
    PTIDLAYVRFTNDGPIQVETNLLGTSVVGQTAALGADAIRVAAGLSLTSLAGISGSFGYTGTVRDAATAH
    TVSGGLSIKF

    Archaeal input sequence:

    >YP_001689002.1 MFEFITDEDERGQVGIGTLIVFIAMVLVAAIAAGVLINTAGYLQSKGSATGEEASAQVSNRINIVSAYGN VNNEKVDYVNLTVRQAAGADNINLTKSTIQWIGPDRATTLTYSSNSPSSLGENFTTESIKGSSADVLVDQ SDRIKVIMYASGVSSNLGAGDEVQLTVTTQYGSKTTYWAQVPESLKDKNA

    Modules used to predict sub-cellular localization in Gram-positive samples:

    The output results report shows the results from each of PSORTm's analytical modules. The modules in the Gram-positive version are as follows:

    • CMSVM+: The Gram-positive version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.
    • CWSVM+: The support vector machine trained to identify cell wall proteins (Gram-positive and Archaea). Returns cell wall or unknown.
    • CytoSVM+: The Gram-positive version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.
    • ECSVM+: The Gram-positive version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.
    • ModHMM+: Predicts transmembrane helices within the sequence. The presence of 3 or more transmebrane helices causes the module to return a prediction of cytoplasmic membrane, otherwise unknown is returned. The Details column returns the number of predicted helices.
    • Motif+: Searches the sequence for Gram-positive motifs indicative of a specific localization site. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.
    • Profile+: Searches the sequence for Gram-positive profiles indicative of a specific localization site. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.
    • SCL-BLAST+: Performs a BLASTP search against the Gram-positive subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
    • SCL-BLASTe+: Like SCL-BLAST, but only returns a match if the query and subject have 100% similarity and are within 1aa in length of each other. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.

    In the Localization Scores area, the confidence values for each of the localization sites are given. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section. If two sites have high scores, a flag of "This protein may have multiple localization sites" is also returned in the Final Prediction field.

    Modules used to predict sub-cellular localization in Gram-negative samples:

    The modules which differ between those described for the Gram-positive version of PSORTm are listed below:

    • CMSVM-: The Gram-negative version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.
    • CytoSVM-: The Gram-negative version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.
    • ECSVM-: The Gram-negative version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.
    • ModHMM-: Predicts transmembrane helices within the sequence. The presence of 3 or more transmebrane helices causes the module to return a prediction of cytoplasmic membrane, otherwise unknown is returned. The Details column returns the number of predicted helices.
    • Motif-: Searches the sequence for Gram-negative motifs indicative of a specific localization site. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.
    • OMPMotif-: Searches the sequence for Gram-negative outer membrane protein motifs. If a match occurs, outer membrane is reported, otherwise unknown is returned. The details column returns the numerical identifiers of the motifs found.
    • OMSVM-: The support vector machine trained to identify outer membrane proteins. Returns outer membrane or unknown (Gram-negative only).
    • PPSVM-: The support vector machine trained to identify periplasmic proteins. Returns periplasm or unknown (Gram-negative only).
    • Profile-: Searches the sequence for Gram-negative profiles indicative of a specific localization site. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.
    • SCL-BLAST-: Performs a BLASTP search against the Gram-negative subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
    • SCL-BLASTe-: See above

    For the Gram-stain - Advanced options, the output for "Gram-positive with outer membrane" is similar to the normal Gram-negative output (with predictions for periplasmic and outer membrane localizations). The output for "Gram-negative without outer membrane" option is similar to the normal Gram-positive output, except that the cell wall localization is not predicted in the final output, since Mycoplasma spp. and most Tenericutes are more phylogenetically similar to Gram-positive organisms but lack a peptidoglycan cell wall. If the modules predict "cell wall" as a protein's localization, the final localization will be flagged as "Unknown - predicted localization does not exist". From what we have observed, proteins with this prediction sometimes have a surface (cytoplasmic membrane) localization. Users should use their own discretions for interpreting the results of PSORTm prediction results in this case.

    In the Localization Scores area, the confidence values for each of the localization sites are given. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section. If two sites have high scores, a flag of "This protein may have multiple localization sites" is also returned in the Final Prediction field.

    Modules used to predict sub-cellular localization in archaeal samples:

    • CMSVM_a: The archaeal version of the support vector machine trained to identify cytoplasmic membrane proteins. Returns cytoplasmic membrane or unknown.
    • CWSVM_a: The archaeal version of the support vector machine trained to identify cell wall proteins. Returns cell wall or unknown.
    • CytoSVM_a: The archaeal version of the support vector machine trained to identify cytoplasmic proteins. Returns cytoplasmic or unknown.
    • ECSVM_a: The archaeal version of the support vector machine trained to identify extracellular proteins. Returns extracellular or unknown.
    • ModHMM_a: Predicts transmembrane helices within the sequence. The presence of 3 or more transmebrane helices causes the module to return a prediction of cytoplasmic membrane, otherwise unknown is returned. The Details column returns the number of predicted helices.
    • Motif_a: Searches the sequence for Gram-positive motifs indicative of a specific localization site, with the ones not applicable to Archaea removed. If a match occurs, the localization site associated with that motif is reported, otherwise unknown is returned. The details column returns a link to the motif in PROSITE.
    • Profile_a: Searches the sequence for Gram-positive profiles indicative of a specific localization site, with the ones not applicable to Archaea removed. If a match occurs, the localization site associated with that profile is reported, otherwise unknown is returned. The details column returns a link to the profile in PROSITE.
    • SCL-BLAST_a: Performs a BLASTP search against the Gram-positive and archaeal subset of the current PSORTdb dataset. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.
    • SCL-BLASTe_a: Like SCL-BLAST, but only returns a match if the query and subject have 100% similarity and are within 1aa in length of each other. If a match is found, its associated localization site is returned and a link to that protein's record at NCBI is provided in the Details column.

    In the Localization Scores area, the confidence values for each of the localization sites are given. If one of the sites has a score of 7.5 or greater, this site and its score are returned in the Final Prediction section. If two sites have high scores, a flag of "This protein may have multiple localization sites" is also returned in the Final Prediction field.

    5.4.3 Tab-delimited (Short Format) Output: Tab-delimited terse format output returns a list of inputted sequences, each one on a new line, with 3 columns: SeqId contains the information from the FASTA file definition line, Localization contains the final prediction of localization site (or "Unknown" is no site scored above 7.5), and Score contains the confidence value associated with this localization site. Tab characters occur between the columns, and, in the case of a multiple sequence submission, each sequence record is separated by newline characters. This format can be easily read into a spreadsheet, using a program such as MS Excel.

    5.4.4 Tab-delimited (Long Format) Output: Tab-delimited long format output returns a list of inputted sequences, each one on a new line, and with all of the information from the PSORTm results placed into columns. The SeqId, module results and comments from the analysis report, localizations and scores, and the final prediction and score are each placed into their own column.

    Whichever results format you choose, if you submit a file of taxonomic classifications with your sequences, the results will contain two extra columns at the end of each row for organism and run type. These extra fields will be absent if a file of taxonomic classifications is not provided (i.e. if a single organism type is specified when uploading the sequence file).

    • The organism value is taken from the file of taxonomic classifications.

    • The run type will indicate which set of options where used to calculate the result. Run type values can be 'pos', 'neg', 'adv_pos', 'adv_neg' or 'archaea'.


    5.5 Quick Tutorial

    Input file preparation

    Step 1. Gather FASTQ files containing sequences of the metagenomics sample

    Step 2. Predict coding sequences (CDS) without sequence assembly

  • Input: FASTQ file
  • Output: protein FASTA file
  • Recommended tools:

    Step 3. Assign taxonomy to all translated reads in the metagenomic samples

    *Each sequence in the protein FASTA file should be assigned a taxonomy in this step.
    Recommended taxonomic classifiers for metagenomics:

    Step 4. Format the taxonomy file into the following two columns separated by a tab character

  • 1st column: read ID (identical to the read ID in the protein FASTA file)
  • 2nd column: NCBI taxonomy ID or taxonomy name
  • Running PSORTm

    Step 5. Input (i) protein FASTA file and (ii) taxonomy file into PSORTm

    Step 6. Select output format (see Section 5.4.3 and 5.4.4 for more details)

    Step 7. Let PSORTm run the analysis and generate SCL predictions

    6. Limitations

    PSORTm is designed to emphasize precision (or specificity) over recall (or sensitivity), and as a result, some classes of proteins are not predicted well. The following issues must be considered when performing an analysis using the current version of PSORTm:

    6.1 Organism-type prediction: In order to provide a more useful tool for predicting the localization of sequences within an environmental sample, we need to provide a way to predict the source organism of the sequence. Providing us with a file of taxonomic classifications allows us to make a good guess at which type of organism we are dealing with so we run the most appropriate modules for localization prediction for each input sequence. In cases where we cannot confidently categorize the organism type, we will send the uncategorized sequences back to you by email so that you are able to run these through PSORTm separately, having chosen the most appropriate organism type.

    6.2 Proteins resident at multiple localization sites: Many proteins can exist at multiple localization sites. Examples of such proteins include integral membrane proteins with large periplasmic domains, or autotransporters, which contain an outer membrane pore domain and a cleaved extracellular domain. The current version of PSORTm handles this situation by flagging proteins which show a distribution of localization scores favouring two sites, rather than one. It is important to examine the distribution of localization scores carefully in order to determine if your submitted protein may have multiple localization sites and if so, which two sites are involved.

    6.3 Lipoproteins: The current version of PSORTm does not detect lipoprotein motifs.

    6.4 Precision vs. Recall: PSORTb and PSORTm are designed to emphasize precision (or specificity) over recall (or sensitivity). Programs which make predictions at all costs often provide incorrect or incomplete results, which can be propagated through annotated databases, datasets and reports in the literature. We believe that a confident prediction is more valuable than any prediction, and we have designed the program to this end. Note, however, that a user may choose to use their own reduced cutoff score in generating final predictions.

     

    [ Resources | Contact ]