Alignments

Arriba takes the main output file of STAR (Aligned.out.bam) as input (parameter -x). If STAR was run with the parameter --chimOutType WithinBAM, then this file contains all the information needed by Arriba to find fusions. When STAR was run with the parameter --chimOutType SeparateSAMold, the main output file lacks chimeric alignments. Instead, STAR writes them to a separate output file named Chimeric.out.sam. In this case, the file needs to be passed to Arriba via the parameter -c in addition to the main output file Aligned.out.bam.

Arriba extracts three types of reads from the alignment file(s):

  1. Split-reads, i.e., reads composed of segments which map in a non-linear way. STAR stores such reads as supplementary alignments.

  2. Discordant mates, i.e., paired-end reads which originate from the same fragment but which align in a non-linear way.

  3. Alignments which cross the boundaries of annotated genes, because these alignments might arise from focal deletions. In RNA-Seq data deletions of up to several hundred kb are hard to distinguish from splicing. They are represented identically as gapped alignments, because the sizes of many introns are in fact of this order of magnitude. STAR applies a rather arbitrary measure to decide whether a gapped alignment arises from splicing or from a genomic deletion: The parameter --alignIntronMax determines what gap size is still assumed to be a splicing event and introns are used to represent these gaps. Only gaps larger than this limit are classified as potential evidence for genomic deletions and are stored as chimeric alignments. Most STAR-based fusion detection tools only consider chimeric alignments as evidence for gene fusions and are blind to focal deletions, hence. As a workaround, these tools recommend reducing the value of the parameter --alignIntronMax. But this impairs the quality of alignment, because it reduces the scope that STAR searches to find a spliced alignment. To avoid compromising the quality of alignment for the sake of fusion detection, the only solution would be to run STAR twice - once with settings optimized for regular alignment and once for fusion detection. This would double the runtime. In contrast, Arriba does not require to reduce the maximum intron size. It employs a more sensible criterion to distinguish splicing from deletions: Arriba considers all those reads as potential evidence for deletions that span the boundary of annotated genes.

The alignment files can be in SAM, BAM, and CRAM format. They need not be sorted for Arriba to accept them, but doing so comes with benefits: Often, this reduces the file size. And more importantly, the supporting reads of a fusion can be inspected visually using a genome browser like IGV, which typically requires BAM files to be sorted by coordinate.

Single-end and paired-end data and even mixtures are supported. Arriba automatically determines the data type on a read-by-read basis using the flag BAM_FPAIRED.

Assembly

Arriba takes the assembly as input (parameter -a) to find mismatches between the chimeric reads and the reference genome, as well as to find alignment artifacts and homologous genes.

The assembly must be provided in FastA format and may be gzip-compressed. An index with the file extension .fai must exist only if CRAM files are processed.

Annotation

The gene annotation (parameter -g) is used for multiple purposes:

  • annotation of breakpoints with genes

  • increased sensitivity for breakpoints at splice-sites

  • calculation of transcriptomic distances

  • determining the putative orientation of fused genes (i.e., 5' and 3' end)

GENCODE annotation is recommended over RefSeq annotation, because the former has a more comprehensive annotation of transcripts and splice-sites, which boosts the sensitivity. The file must be provided in GTF format and may be gzip-compressed. It does not need to be sorted.

Blacklist

It is strongly advised to run Arriba with a blacklist (parameter -b). Otherwise, the false positive rate increases by an order of magnitude. For this reason, using Arriba with assemblies or organisms which are not officially supported is not recommended. At the moment, the supported assemblies are: hg19, hs37d5, GRCh37, hg38, and GRCh38 (and any other assemblies that have compatible coordinates). Support for mm10 is in development. The blacklists are contained in the release tarballs of Arriba.

The blacklist removes recurrent alignment artifacts and transcripts which are present in healthy tissue. This helps eliminate frequently observed transcripts, such as read-through fusions between neighboring genes, circular RNAs and other non-canonically spliced transcripts. It was trained on RNA-Seq samples from the Human Protein Atlas, the Illumina Human BodyMap2 , the ENCODE project , the Roadmap project, and the NCT MASTER cohort, a heterogeneous cohort of cancer samples, from which highly recurrent artifacts were identified.

The blacklist is a tab-separated file with two columns and may optionally be gzip-compressed. Lines starting with a hash (#) are treated as comments. Each line represents a pair of regions between which events are ignored. A region can be

  • a 1-based coordinate in the format CONTIG:POSITION, optionally prefixed with the strand (example: +9:56743754).

  • a range in the format CONTIG:START-END, optionally prefixed with a strand (example: 9:1000000-1100000).

  • the name of a gene given in the provided annotation.

In addition, special keywords are allowed for the second column:

  • any: Discard all events if one of the breakpoints matches the given region.

  • split_read_donor: Discard fusions only supported by split reads, if all of them have their anchor in the gene given in the first column. This filter is useful for highly mutable loci, which frequently trigger clipped alignments, such as the immunoglobulin loci or the T-cell receptor loci.

  • split_read_acceptor: Discard events only supported by split reads, if all of them have their clipped segment in the given region.

  • split_read_any: Discard events only supported by split reads, regardless of where the anchor is.

  • discordant_mates: Discard fusions, if they are only supported by discordant mates (no split reads).

  • low_support: Discard events, which have few supporting reads relative to expression (as determined by the filter relative_support), even if there is other evidence that the fusion might be a true positive, nonetheless. This keyword effectively prevents recovery of speculative events by filters such as spliced or many_spliced.

  • filter_spliced: This keyword prevents the filter spliced from being applied to a given region. It is triggered under the same circumstances as the keyword low_support, but additionally requires that the breakpoints be at splice-sites for the event to be discarded. Some breakpoints produce recurrent artifacts, but the second breakpoint is always a different one, such that the pair of breakpoints is not recurrent and cannot be blacklisted. Often, such breakpoints are at splice-sites and the filter spliced tends to recover them. This keyword prevents the filter from doing so.

  • not_both_spliced: This keyword discards events, unless both breakpoints are at splice-sites. This is a strict blacklist criterion, which makes sense to apply to genes which are prone to produce artifacts, because they are highly expressed, for example hemoglobins, collagens, or ribosomal genes.

  • read_through: This keyword discards events, if they could arise from read-through transcription, i.e., the supporting reads are oriented like a deletion and are at most 400 kb apart.

Known fusions

Arriba can be instructed to be particularly sensitive towards events between certain gene pairs by supplying a list of gene pairs (parameter -k). A number of filters are not applied to these gene pairs. This is useful to improve the detection rate of expected or highly relevant events, such as recurrent fusions. Occassionally, this leads to false positive calls. But if high sensitivity is more important than specificity, this might be acceptable. Events which would be discarded by a filter and were recovered due to being listed in the known fusions list are usually assigned a low confidence.

A comprehensive list of known fusions can be obtained from CancerGeneCensus in the section titled "Complete Fusion Export". Depending on the gene annotation that is used to run Arriba, some gene names need to be adjusted.

The file has two columns separated by a tab. Each line lists a pair of genes. The order of the genes is irrelevant. Arriba searches for both genes as the 5' end and the 3' end of a fusion. Lines starting with a hash (#) are treated as comments. Optionally, the file can be gzip-compressed.

Structural variant calls from WGS

If whole-genome sequencing (WGS) data is available, the sensitivity and specificity of Arriba can be improved by passing a list of structural variants detected from WGS to Arriba:

  • Certain filters are overruled or run with extra sensitive settings, when an event is confirmed by WGS data.

  • To reduce the false positive rate, Arriba does not report low-confidence events unless they can be matched with a structural variant found in the WGS data.

Both of these behaviors can be disabled by disabling the filters genomic_support and no_genomic_support, respectively. Providing Arriba with a list of structural variant calls then does not influence the calls, but it still has the benefit of filling the columns closest_genomic_breakpoint1 and closest_genomic_breakpoint2 with the breakpoints of the structural variant which is closest to a fusion.

The file must contain four columns separated by tabs. The first two columns contain the breakpoints of the structural variants in the format CONTIG:POSITION. The last two columns contain the orientation of the breakpoints. The accepted values are:

  • downstream or +: the fusion partner is fused downstream of the breakpoint, i.e., at a coordinate higher than the breakpoint

  • upstream or -: the fusion partner is fused at a coordinate lower than the breakpoint

Arriba checks if the orientation of the structural variant matches that of a fusion detected in the RNA-Seq data. If, for example, Arriba predicts the 5' end of a gene to be retained in a fusion, then a structural variant is expected to confirm this, or else the variant is not considered to be related.