Chimeric alignments

Arriba processes two BAM files with alignments which potentially indicate structural rearrangements: the chimeric alignments file (parameter -c) and the read-through alignments file (parameter -r). The chimeric alignments file contains evidence about translocations, inversions, duplications and large deletions. It lacks information about deletions smaller than the usual intron size. The latter is provided by the read-through alignments file.

In RNA-Seq data deletions of up to several hundred kb are hard to distinguish from splicing. Both are represented identically as gapped alignments and the sizes of many introns are in fact of this order of magnitude. STAR applies a rather arbitrary measure to decide whether a gapped alignment arises from splicing or from a genomic deletion: The parameter --alignIntronMax determines what gap size is still assumed to be a splicing event. Only gaps larger than this limit are classified as potential evidence for genomic deletions and are stored in the chimeric alignments file. Effectively, this makes it impossible to detect deletions smaller than the given size from the chimeric alignments file, since it lacks all supporting reads. As a workaround, many STAR-based fusion detection pipelines recommend reducing the value of the parameter --alignIntronMax. But this impairs the quality of alignment, because it reduces the scope that STAR searches to find a spliced alignment. To avoid compromising the quality of alignment for the sake of fusion detection, the only solution would be to run STAR twice - once with settings optimized for normal alignment and once for fusion detection. This would double the runtime.

Arriba offers an alternative, non-compromise solution. It comes with the helper utility extract_read-through_fusions, which employs a more sensible criterion to distinguish splicing from deletions. The utility considers all those reads as potential evidence for deletions that span the boundary of a gene, i.e.,

Since extract_read-through_fusions runs independently from STAR, it is not necessary to use suboptimal alignment parameters in order to detect small deletions. The utility scans through the normal alignments file and extracts all alignments which violate gene boundaries as specified by the gene model passed via the parameter -g. The extracted reads are stored in the read-through alignments file, which has the same format as the chimeric alignments file.

Both files need to be in BAM format, before they are passed to arriba. extract_read-through_fusions generates BAM files by default, but STAR stores the chimeric alignments in SAM format in the file Chimeric.out.sam. This file needs to be converted to BAM first using a utility such as samtools or sambamba.

The files need not be sorted for arriba to accept them, but doing so comes with benefits: Often, this reduces the file size. And more importantly, the supporting reads of a fusion can be inspected visually using a genome browser like IGV, which typically requires BAM files to be sorted by coordinate.

Single-end and paired-end data and even mixtures are supported. Arriba automatically determines the data type on a read-by-read basis using the flag BAM_FPAIRED.

Normal alignments

arriba uses the normal alignments (parameter -x) to lookup the expression at predicted breakpoints. Moreover, the sequencing depth is calculated from the total number of reads. This influences the statistical model used to calculate the significance of the number of supporting reads of a fusion; the higher the depth, the more reads are required. This file must be sorted by coordinate and indexed, because arriba needs to lookup positions by coordinate and query statistics from the index.

Assembly

arriba takes the assembly as input (parameter -a) to find mismatches between the chimeric reads and the reference genome, as well as to find alignment artifacts and homologous genes. The assembly must be provided in FastA format and may be gzip-compressed. It is not necessary to generate an index, because the entire file is loaded into memory anyway.

Annotation

The gene annotation (parameter -g) is used for multiple purposes:

Gencode annotation is recommended over RefSeq annotation, because the former has a more comprehensive annotation of transcripts and splice-sites, which boosts the sensitivity. The file must be provided in GTF format and may be gzip-compressed. It does not need to be sorted.

Blacklist

It is strongly advised to run arriba with a blacklist (parameter -b). Otherwise, the false positive rate increases by an order of magnitude. For this reason, using Arriba with assemblies or organisms which are not officially supported is not recommended. At the moment, the supported assemblies are: hg19, hs37d5, GRCh37, hg38, and GRCh38 (and any other assemblies that have compatible coordinates). Support for mm10 is in development. The blacklists are contained in the release tarballs of Arriba.

The blacklist removes recurrent alignment artifacts and transcripts which are present in healthy tissue. This helps eliminate frequently observed transcripts, such as read-through fusions between neighboring genes, circular RNAs and other non-canonically spliced transcripts. It was trained on RNA-Seq samples from the Human Protein Atlas, the Illumina Human BodyMap2 , the ENCODE project , the Roadmap project, and the NCT MASTER cohort, a heterogeneous cohort of cancer samples, from which highly recurrent artifacts were identified.

The blacklist is a tab-separated file with two columns and may optionally be gzip-compressed. Lines starting with a hash (#) are treated as comments. Each line represents a pair of regions between which events are ignored. A region can be

In addition, special keywords are allowed for the second column:

Known fusions

arriba can be instructed to be particularly sensitive towards events between certain gene pairs by supplying a list of gene pairs (parameter -k). A number of filters are not applied to these gene pairs. This is useful to improve the detection rate of expected or highly relevant events, such as recurrent fusions. Occassionally, this leads to false positive calls. But if high sensitivity is more important than specificity, this might be acceptable. Events which would be discarded by a filter and were recovered due to being listed in the known fusions list are usually assigned a low confidence.

A comprehensive list of known fusions can be obtained from CancerGeneCensus in the section titled "Complete Fusion Export". Depending on the gene annotation that is used to run arriba, some gene names need to be adjusted.

The file has two columns separated by a tab. Each line lists a pair of genes. The order of the genes is irrelevant. arriba searches for both genes as the 5' end and the 3' end of a fusion. Lines starting with a hash (#) are treated as comments. Optionally, the file can be gzip-compressed.

Structural variant calls from WGS

If whole-genome sequencing (WGS) data is available, the sensitivity and specificity of Arriba can be improved by passing a list of structural variants detected from WGS to Arriba:

Both of these behaviors can be disabled by disabling the filters genomic_support and no_genomic_support, respectively. Providing arriba with a list of structural variant calls then does not influence the calls, but it still has the benefit of filling the columns closest_genomic_breakpoint1 and closest_genomic_breakpoint2 with the breakpoints of the structural variant which is closest to a fusion.

The file must contain four columns separated by tabs. The first two columns contain the breakpoints of the structural variants in the format CONTIG:POSITION. The last two columns contain the orientation of the breakpoints. The accepted values are:

Arriba checks if the orientation of the structural variant matches that of a fusion detected in the RNA-Seq data. If, for example, Arriba predicts the 5' end of a gene to be retained in a fusion, then a structural variant is expected to confirm this, or else the variant is not considered to be related.