Pipeline for lncRNA annotation from RNA-seq data (PLAR)

PLAR paper is now published in Cell Reports.

To get updates about PLAR (new versions etc.), please join the mailing list on Google groups.
To ask questions post the user support group or e-mail us.

New (Updated April 2021) - Human GENCODE vs other species and Human vs. Mouse GENCODE orthologs

Due to several requests, we are releasing an assingment of orthologs, determined using the same methods used in Hezroni et al. (BLAST, Whole Genome Alignment (WGA), and synteny). One is comparing human GENCODE genes (from GENCODE v30) to lncRNAs from other species identified by PLAR. Available here.

Another is comparing the human GENCODE (v30) to the mouse GENCODE (vM21). Available here.

Feel free to contact us if you have questions, and please cite this website and Hezroni et al. Cell Reports 2015 if you use these in a publication.

Transcriptomes and lincRNA annotations - UCSC genome browser hub

All the tracks for the PLAR-reconstructed transcripts and most corresponding RNA-seq data are available as a hub that can be added to the UCSC genome browser using (just paste the link into the “Add hub” window in UCSC):
ftp://ftp-igor.weizmann.ac.il/pub/hubPLAR.txt

Transcriptomes and lincRNA annotations - Download

There are our annotations described in Hezroni et al. Click on the “Download” links to obtain gzipped BED files. The naming convention that we’re using is as follows: TYPE|3P|GENE_ID|TRANSCRIPT_ID|GENE_NAME:FPKM, where:

TYPE is the assigned type of the transcript:
- EnsCodingFull - the transcript overlaps a protein-coding gene and contains the annotated start and stop codons
- EnsCodingFragment - the transcript overlaps a protein-coding gene but does not contain either the start and the stop codon (or does not contain both)
- linc - a lincRNA transcript
- EnsASCoding - an antisense transcript
- EnsShortNoncoding - a precursor for small RNAs
- TooLow - a transcript that did not pass the minimum expression level thresholds
- TooShort - a transcript that did not pass the minimum lenght thresholds
- SingleExonStringent - a transcript that did not pass the more stringent criteria for single-exon transcripts
- Repetitive - a transcript more than 50% of which is contained within a single repeat
- IntronContained - a single exon transcript that overlaps a multi-exon transcript by more than 90% of its length
3P - fixed value, unless 3P-seq was used and the transcript is overlapping a 3P-seq tag instead of ending close to it, and then this token will be 3PO
GENE_ID - the gene identified, assigned by cuffmerge (typically XLOC_XXXX)
TRANSCRIPT_ID - the transcript/isoform identified, assigned by cuffmerge (typically TCONS_XXXX)
GENE_NAME - for protein-coding genes in some species, this will be the Ensembl Gene Symbol
FPKM - the maximum expression level of the transcript across all the samples used

Species	Assembly	Code	Transcriptome	lncRNAs	Protein-coding
Human	hg19	hg19	Download	Download	Download
Rhesus	rheMac3	rm3	Download	Download	Download
Marmoset	calJac3	cj3	Download	Download	Download
Mouse	mm9	mm9	Download	Download	Download
Rabbit	oryCun2	oc2	Download	Download	Download
Dog	canFam3	cf3	Download	Download	Download
Ferret	musFur1	oa3	Download	Download	Download
Opossum	monDom5	md5	Download	Download	Download
Chicken	galGal4	gg4	Download	Download	Download
Lizard	anoCar2	ac2	Download	Download	Download
Coelacanth	latCha1	lc1	Download	Download	Download
Zebrafish	danRer7	dr7	Download	Download	Download
Stickleback	gasAcu1	ga1	Download	Download	Download
Nile tilapia	oreNil2	ot2	Download	Download	Download
Spotted gar	lepOcu1	lo1	Download	Download	Download
Elephant shark	calMil1	cm1	Download	Download	Download
Sea urchin	strPur4	sp4	Download	Download	Download

PLAR manual

The following describes the execution of PLAR using RNA-seq data in an organism for which UCSC tables and Ensembl annotations are available. All steps are designed to run on a Linux operating system, ideally with a workload management software, such as LSF, installed, to avoid very long running times for CPC and HMMER steps.

Part 0 - Obtain annotation files for the studied species and make sure required software is installed

PLAR relies on having existing annotation files, obtained from the UCSC genome browser or from Ensembl via BioMart. The following files are needed (examples provided for stickleback), capital-letter names (e.g., 2BIT_GENOME) are later used in description of the individual PLAR steps:
From the UCSC genome browser (note that the sample files we provide are gzipped, so you will have to unpack them prior to use):

2BIT_GENOME - Unmasked genome sequence in 2bit format: http://hgdownload.soe.ucsc.edu/goldenPath/gasAcu1/bigZips/gasAcu1.2bit
2BIT_MASKED - Repeat-masked genome sequence in 2bit format - typically you will need to obtain the masked FASTA file from UCSC, e.g., from: http://hgdownload.soe.ucsc.edu/goldenPath/gasAcu1/bigZips/chromFaMasked.tar.gz Then unite it into a single file (“cat *.fa > gasAcu1.masked.fa”) and then use faToTwoBit to convert to a 2bit format
REFSEQ_GENES - Annotations of RefSeq genes (if available), e.g., ga1.refseq_genes.txt
REFSEQ_OTHER - Annotations of RefSeq genes mapped from other species (Other RefSeq), e.g., ga1.other_refseq.txt.gz
HUMAN_PROTEINS - Annotations of human proteins mapped to the genome (if available), e.g.: ga1.human_proteins.txt.gz
GAP_FILE - Annotations of gaps in the genome obtained through Table Browser, e.g., ga1.gaps.txt
REPEAT_FILE - Annotations of repetitive elements (RepeatMasker) obtained through the Table Browser, e.g., ga1.repeats.txt
CHROM_SIZES - A file with sizes of the chromosomes that can be obtained using either by the fetchChromSizes or the twoBitInfo utilities from the UCSC genome browser.

From Ensembl:
The following tables can be downloaded through the Ensembl BioMart inferface:

ENSEMBL_GENES - a table containing the following columns (available from the “Structures” option in Biomart, sample file for Stickleback ga1.ens77.txt):
- Ensembl Gene ID
- Ensembl Transcript ID,
- Chromosome Name
- Gene Start (bp)
- Gene End (bp)
- Transcript Start (bp)
- Transcript End (bp)
- Strand
- Gene name
- 5' UTR Start
- 5' UTR End
- 3' UTR Start
- 3' UTR End
- Gene type
- Exon stable ID
- Exon region Start (bp)
- Exon region End (bp)
ENSEMBL_INFO1 - a table containing the following columns (available from the “Features” option in Biomart, sample file ga1.ensData1.txt):
- Ensembl Gene ID
- Ensembl Transcript ID
- Transcript Biotype
- Gene Biotype
- Status (gene)
- Status (transcript)
- Ensembl Family Description
- Ensembl Protein Family ID(s)
ENSEMBL_INFO2 - a table containing the following columns (available from the “Structures” option in Biomart, sample file ga1.ensData2.txt):
- Ensembl Gene ID
- Ensembl Transcript ID
- Associated Gene Name
- Associated Gene Source
- CDS Length
- Description
- Gene Biotype

Make sure required software is installed:

Java needs to be installed for PLAR to run
PLAR java implementation needs to be downloaded from here.
CPC - http://cpc.cbi.pku.edu.cn/ needs to be installed in a writable directory. CPC requires a local copy of a sequence database. We used the RefSeq database in our study. As described in the manuscript we used a slightly modified version of the bin/run_predict.sh filter to run CPC. This modified version is available here.
HMMER - http://hmmer.janelia.org/ need to be installed, and Pfam databases (A and B, .hmm files) need to be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release and an HMM needs to be build for them using hmmpress .
(Optional) RNAcode - http://wash.github.io/rnacode/ needs to be installed
(Optional) Trinity - http://trinityrnaseq.sourceforge.net/ needs to be installed for the optional step of filtering lincRNAs using de novo assemblies
(Optional, required if using Trinity) Blat http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/blat/ needs to be installed in order to map the Trinity assemblies back to the genome.

Part I - Build transcriptome and quantify expression levels

Align RNA-seq reads: You can use the aligner of your choice, we used Tophat in two rounds, first one fed with known junctions from Ensembl, RefSeq, and other sources (all can be obtained from the UCSC genome browser and fed to TopHat via the -G option). For the second round, we compiled together the junctions from all the runs in the first round and reran using the -j option.
Assemble transcript models using CuffLinks with Ensembl annotations as seed - In principle other assmblers can work as well, but we only experimented with Cufflinks so far. The Ensembl annotations (as a GTF file that can be obtained from the UCSC Table Browser) are used as baseline annotations (-g option with GTF_FILE as path to the GTF file), e.g., ga1.ens.gtf. Sample command:
cufflinks -g GTF_FILE -q -p 32 -L Blood -o Blood.bam
Create a text file gtfs.txt that contains the paths to all the gtf files from the cufflinks runs. Combine transcript models using Cuffmerge, again using Ensembl protein-coding genes as reference:
cuffmerge -p 32 -g GTF_FILE -o CUFFMERGE_OUT gtfs.txt
Compute expression levels in FPKM using cuffdiff:
cuffdiff -p 32 -q --no-diff -o CUFFDIFF_OUT CUFFMERGE_OUT/merged.gtf BAM_FILES

Part II - Annotate transcriptome

The next steps use different steps of PLAR, implemented in java and available for download here. Throughout the steps VERSION_NAME indicates the current custom transcriptome assembly name, that you can select. All output and temporary files from different steps will be placed in the PATH, which is expected to have a subdirectory called “sequences/”. :

Process the Cuffmerge file: The first step traverses the cuffmerge output (merged.gtf):
./find_lincs.csh plar_process_cuffmerge PATH VERSION_NAME
CUFFMERGE_OUT/merged.gtf ENSEMBL_FILE CHROM_SIZES
Instead of ENSEMBL_FILE you can provide a BED file - if the filename has “.bed” in it, it is treated as a BED file, otherwise as an Ensembl file formatted as described above (the ENSEMBL_GENES file).
Annotate the transcriptome and identify potential lincRNAs:
./find_lincs.csh plar_identify_lincs PATH VERSION_NAME SAMPLE_NAMES ENSEMBL_FILE ENSEMBL_DATA1
ENSEMBL_DATA2 REFSEQ_GENES REFSEQ_OTHER HUMAN_PROTEINS CUFFDIFF_OUT/isoforms.fpkm_tracking REPEAT_FILE MIN_SPLICED_LENGTH MIN_UNSPLICED_LENGTH MIN_SPLICED_FPKM MIN_UNSPLICED_FPKM ADD_CHR_TO_ENS_CHROMS
- Instead of ENSEMBL_FILE you can provide a BED file - if the filename has “.bed” in it, it is treated as a BED file, otherwise as an Ensembl file formatted as described above. If you are providing a BED file, ADD_CHR_TO_ENS_CHROMS is ignored.
- You can use NONE instead of the file names for ENSEMBL_DATA1, ENSEMBL_DATA2, REFSEQ_GENES, REFSEQ_OTHER, HUMAN_PROTEINS if the corresponding file is not available.
- SAMPLE_NAMES are coma-separated names of the samples that were processed using cuffdiff
- CUFFDIFF_OUT is the directory where cuffdiff output was stored (needs to contain isoforms.fpkm_tracking file)
- MIN_SPLICED_LENGTH and MIN_UNSPLICED_LENGTH are the minium lengths of the mature sequences that will be considered as candidates, for spliced (multi-exon) and unspliced transcripts, respectively (we used 200 and 2000)
- MIN_SPLICED_FPKM and MIN_UNSPLICED_FPKM are the threshold expression levels for the isoforms that will be considered, for spliced and unspliced transcripts, respectively (we used 0.1 and 5)
- ADD_CHR_TO_ENS_CHROMS is set to TRUE or FALSE, depending on whether “chr” should be added to the chromosome names in the Ensembl file to make them match the names in the other files
Analyze the ORF sizes in candidate lincRNAs and prepare input files for CPC, HMMer, RNAcode:
./find_lincs.csh plar_analyze_sequences PATH VERSION_NAME TWO_BIT_GENOME TWO_BIT_MASKED SEQUENCES_PATH CPC_PATH
- SEQUENCES_PATH is the full path to the sequences/ directory (placed under PATH)
- CPC_PATH is the full path to the CPC installation
- In order to substantially speed up CPC and HMMER, this step splits up the lincRNAs into batches of 100 sequences that can be processed in parallel by a workload management platform (we use LSF), and the output can then be combined. This step also produces scripts that are used for running the tools, if LSF is installed. If you are using another system you will have to adapt the scripts accordingly.
Run CPC, HMMER, and (optionally RNAcode) on the sequences and multiple alignments.
- CPC: As described above, CPC needs to be installed locally. The previous step produces a run_VERSION_NAME.csh file and places it in the CPC_PATH. This script for running CPC on each batch of 100 sequences separately and combining the output in the end.
- Hmmer: As described above, HMMER needs to be installed and Pfam files need to be downloaded. Links to the .hmm, .hmm.h3f, .hmm.h3i, .hmm.h3m, and .hmm.h3p files for both Pfam-A and Pfam-B (or the files themselves) need to be found in the SEQUENCES_PATH directory. The previous step will place a script VERSION_NAME_hmmer_all.csh in the SEQUENCES_PATH directory. Use this script to run HMMER on the 100 sequence batches and combine the results.
- (Optional - if whole genome alignments are available) RNAcode: As described above, RNAcode needs to be installed. The previous steps generated a VERSION_NAME.exons.bed file that contains a single entry for each BED file. This file can be used to extract subsets of whole genome alignments corresponding to the exons of the lncRNA candidates. One way to do this is to upload the BED file as a custom track to the UCSC genome browser, and then to use the Table Browser interface to download only the alignments that intersect with this track. Then RNAcode needs to be executed on these alignments and the output needs to be stored in a text file. Following the exection of RNAcode, you’ll need to process the RNAcode text output file using ./find_lincs.csh plar_process_rnacode RNA_CODE_OUTPUT_FILE RNACODE_BED_FILE SPECIES_CODE PVAL_THRESOLD
  - RNACODE_BED_FILE is a BED file that will be generated and contain an entry for each RNAcode-predicted coding stretch. This file is used for filtering in the next stem
  - SPECIES_CODE is the name of the species that you’re studying in the whole genome alignment (e.g., hg19 for human in the 100way human whole genome alignment)
  - PVAL_THRESHOLD is the p-value threshold below which RNAcode predictions will be considered significant (we used 1e-4)
Combine the results of the coding predictors: The next steps combine the results of CPC, HMMER, (and RNAcode if used) and filters predicted coding transcripts: ./find_lincs.csh plar_parse_predictors PATH VERSION_NAME RNACODE_BED_FILE HMMER_OUTPUT CPC_OUTPUT
- RNA_BED_FILE is the converted RNAcode file described in the previous step - use “NONE” is not applicable
- HMMER_OUTPUT is the combined HMMER output file
- CPC_OUTPUT is the combined CPC output file
Further filter transcripts near coding genes etc.: ./find_lincs.csh plar_filter PATH VERSION_NAME ENSEMBL_GENES ENSEMBL_DATA1 ENSEMBL_DATA2 GAPS_FILE CHROM_SIZES MIN_DISTANCE_UNSPLICED MIN_DISTANCE_SPLICED ADD_CHR_TO_ENS_CHROMS
- MIN_DISTANCE_UNSPLICED is the minimum distance required from the end of a protein-coding gene and the lincRNA if the lincRNA is multi-exon (we used 500)
- MIN_DISTANCE_SPLICED is the minimum distance required from the end of a protein-coding gene and the lincRNA if the lincRNA is single-exon (we used 2000)
- This step will result in a gzipped file merge_v1.lincs.f1.pure.bed.gz that contains the resulting lncRNA models that survived filtering
- ADD_CHR_TO_ENS_CHROMS is set to TRUE or FALSE, depending on whether “chr” should be added to the chromosome names in the Ensembl file to make them match the names in the other files
(Optional) filtering transcripts using Trinity. Trinity and BLAT need to be installed for this step.
- You first need to run trinity independently on each sample. We were using the following arguments, that use the LSF support of Trinity: Trinity --grid_conf trinityConfig.csh --seqType fq --SS_lib_type FR --CPU 64 --JM 80G --output OUTPUT_FILE --left <(zcat READ1_FASTQ_GZIPPED) --right <(zcat READ1_FASTQ_GZIPPED)
- Once Trinity concludes, the resulting transcript sequences need to be mapped to the genome using BLAT (or any other aligner whose output can be converted into a PSL file. The paths of all the resulting PSL files need to be placed in a text file TRINITY_FILE_LIST
- You can now use the Trinity models to filter the lncRNA candidates using ./find_lincs.csh plar_filter_trinity PATH VERSION_NAME TRINITY_FILE_LIST FALSE
- If the trinity models are antisense to your real models, please use TRUE as the last argument.
- This step will result in a gzipped file merge_v1.lincs.f1.pristine.bed.gz that contains the lncRNA models that passed the Trinity filter