Identification of eRNA production centers (EPCs) using DHS and GRO-seq data

This page contains the accompanying software for Gil & Ulitsky, Cell Systems 2018.

Table of contents:

Prerequisites:

Java version 1.8.0 or higher is required. Download jar files:

Instructions for running the code (Unix)

Run (from the directory in which the jar files are found):

java -Xmx48000m -cp compbio.jar:log4j-1.2.15.jar:compbioLib.jar:bigWig.jar scripts.lincs.AnalyzeEnhancerRNAs analyze_dhs GROSEQ_FILE SELECT_REGIONS DHS_FILE -minSum XX CHROM_SIZES_FILE OUTPUT_BASE

Required arguments:

  • GROSEQ_FILE – basename for two normalized GRO-seq bigWig files, that should be named GROSEQ_FILE.plus.bw and GROSEQ_FILE.minus.bw, for the plus and minus strand read coverage in the GRO-seq data (note that values in both files should be positive!). The bigWig files should contain the signal from the 5’ of the read (for example, if you are using STAR for mapping the reads, use --outWigType bedGraph read1 5p)
  • SELECT_REGIONS – a bed file with genomic coordinates (for example, enhancer annotations); only DHSs that overlap these SELECT_REGIONS will be used for the analysis. If SELECT_REGIONS is set to NONE (as in the analysis in Gil & Ulitsky), all DHSs will be used.
  • DHS_FILE – a BED file with DHS coordinates.
  • -minSum XX – minimal signal on each strand in the probed region required for further consideration (in Gil & Ulitsky we used a minSum of 2 for HepG2/MCF7/K562 and 8 for mES).
  • CHROM_SIZES_FILE – a file with sizes of the chromosomes. This tab-delimited file has two columns – chromosome name and size. Can be obtained by either the fetchChromSizes or the twoBitInfo utilities from the UCSC genome browser.
  • OUTPUT_BASE – the base name for the output files.

Optional arguments (not used in the analysis presented in the paper):

  • -exclude BED_FILE – if both SELECT_REGIONS and -exclude are specified, all DHS sites except those that overlap regions in SELECT_REGIONS will be used.
  • -regionSize XX – size of the region flanking the DHS on both sides, within which -minSum should be computed. Default is 2,000 bp.
  • -minSideLen XX – the minimum length of each BED element required for the eRNAs to be reported. Default is 0.
  • -maxSignalRatio XX – the maximal permitted ratio between the signals on the plus strand and minus strand required for the eRNAs to be reported. Default is infinity.

Software overview:
The script iterates over the DHS regions (or only those overlapping SELECT_REGIONS if this parameter is specified) and computes, for each DHS, the GRO-seq reads in the region between the DHS start and a certain window downstream of its end [i.e. (DHS start) to (DHS end+regionSize) – referred to as ‘plus_signal’], and from upstream of its start to the DHS end [i.e. (DHS start-regionSize) to (DHS end) – referred to as ‘minus_signal’]. Only DHSs with min(plus_signal,minus_signal)>=minSum are considered further. In those regions a ‘separation’ score for each position in the DHS ±regionSize is computed. The separation score for position P is defined as the geometric mean of (a) the fraction of the GRO-seq signal on the minus strand to the left of P out of minus_signal, and (b) the fraction of the GRO-seq signal on the plus strand to the right of P out of plus_signal. The position achieving the highest score (Pmax) is defined as the EPC. If multiple genomic positions have the same P, the midpoint of those positions is defined as the EPC. To compute eRNA boundaries, Pfirst is defined as the leftmost position within the considered region where 90% of the minus_signal is found to its right, and Plast is defined as the rightmost position within the considered region where 90% of the plus_signal is found to its left. The regions (Pfirst,Pmax) and (Pmax,Plast) are reported as the eRNA pair.

Description of output files:
The script generates the following files:

  • OUTPUT.eRNA.bed – a BED12 file containing eRNA pair coordinates. Each two lines correspond to the two eRNA transcripts comprising the bidirectional eRNA pair. The ‘end’ coordinate of the minus-strand eRNA of each pair, or the ‘start’ coordinate of the plus-strand eRNA of each pair, is the EPC. The ‘name’ column includes the original DHS coordinates as well as the GRO-seq signal sum.
  • OUTPUT.nooverlap.eRNA.bed – a similar BED12 file, except removing all overlapping eRNA pairs (retaining the pair with the higher GRO-seq signal sum).
  • OUTPUT.enhancerStats.txt – a tab-delimited file with each row corresponding to one eRNA pair reported, and with the following columns
  1. DHS position
  2. Names of SELECT_REGIONS it overlaps (if parameter is provided)
  3. Length of the combined region in column 3 (if relevant)
  4. Sum of the GRO-seq signal on the plus strand
  5. Sum of the GRO-seq signal on the minus strand
  6. EPC coordinate
  7. Separation score
  8. Sum of minus strand signal at the position giving optimal separation score
  9. Sum of plus strand signal at the position giving optimal separation score
  10. Fraction of minus strand signal to the left of the position giving optimal separation score
  11. Fraction of plus strand signal to the right of the position giving optimal separation score
  12. Beginning coordinate of the minus strand eRNA
  13. End coordinate of the plus strand eRNA
  • OUTPUT.acc1.wig – a wiggle file of the plus signal for each position in the DHS ±regionSize region.
  • OUTPUT.acc2.wig – a wiggle file ofthe minus signal for each position in the DHS ±regionSize region.
  • OUTPUT.score.wig – a wiggle file ofthe ‘separation’ score for each position in the DHS ±regionSize region.

Sample input and output files:

The following files were used in the paper and can be used as an example

To run the analyze_dhs script:

java -Xmx48000m -cp compbio.jar:log4j-1.2.15.jar:compbioLib.jar:bigWig.jar scripts.lincs.AnalyzeEnhancerRNAs analyze_dhs K562_GroSeq NONE DHS_K562.bed -minSum 2 hg19.chrom.sizes K562

Examples of output files: