Intragenic deletions

Arriba can detect intragenic inversions and duplications, but not deletions. This is because deletions within a gene are difficult to distinguish from ordinary splicing in RNA-Seq data. Arriba's statistical model to find significant events is not applicable to the identification of a significant lack of exon coverage. These questions are better answered by indel callers, whole-genome sequencing, or algorithms to identify differential exon expression. For these reasons, Arriba does not report any intragenic deletions.

RefSeq annotation

It is recommended to use annotation from GENCODE or ENSEMBL to run Arriba. RefSeq annotation has less comprehensive annotation of splice sites. Moreover, RefSeq does not annotate the immunoglobulin/T-cell receptor loci. These shortcomings reduce the sensitivity of fusion detection. Users who want to use RefSeq nonetheless are advised to copy the immunoglobulin/T-cell receptor loci annotation from GENCODE/ENSEMBL as a workaround.

Memory consumption

Arriba usually consumes less than 10 GB of RAM. Approximately 1 GB of RAM is consumed per million chimeric read pairs, plus 4 GB of static overhead to load the assembly and gene annotation. Particularly multiple myeloma samples frequently exceed the normal memory requirements due to countless rearrangements in the immunoglobulin loci. In order to reduce the memory footprint, Arriba can be instructed to subsample reads when an event has a sufficient number of supporting reads. By default, further reads are ignored, once an event has reached 300 supporting reads (see parameter -U). Arriba issues a WARNING: some fusions were subsampled, because they have more than 300 supporting reads when this threshold has been hit.

However, excessive memory consumption can indicate a user error. So before reducing the maximum number of supporting reads, users should carefully check their scripts/data for mistakes. For example, if paired-end FastQ files are mistakenly passed to STAR in the wrong order, STAR will align almost all reads as discordant mates. Similarly, if the reads in paired-end FastQ files are not ordered properly (i.e., collated by name), then most of them will be aligned in a discordant fashion. When Arriba consumes an unusual amount of memory, users should interrogate the file Log.final.out of STAR. If the % of chimeric reads reported in the log file is high, then scripts and input files should be checked for errors. The % of chimeric reads is normally in the range of 1-10%, with the exception of very few cancer types (such as multiple myeloma), where they are often much higher.

Adapter trimming

In most cases, it is not necessary to trim adapters, because STAR is capable of performing clipped alignments, such that reads with adapters will simply be aligned partially. For this reason, the demo script run_arriba.sh does not perform adapter trimming. However, if many reads contain a substantial fraction of adapter sequence, then these reads will not be aligned. Reads required for fusion detection will be affected in particular, because they are difficult to align anyway. This problem usually manifests as a high value for % of reads unmapped: too short in the log file of STAR. In this situation, it can be beneficial to remove adapters for improved sensitivity, because STAR dismisses chimeric alignments when a big fraction of a read cannot be aligned (as controlled by the parameter --outFilterMatchNminOverLread). To this end, the STAR parameter --clip3pAdapterSeq can be used or a specialized adapter trimming tool.

Viral detection

Detection of virus expression and viral integration sites is limited to the viruses that are built into the STAR index and assembly used to run Arriba. When the script download_references.sh was used to generate the reference files, all ~12,000 RefSeq virus genomes as well as ~4,500 human-infecting virus strains related to the RefSeq viruses are included into the reference files. This includes all common cancer-associated viruses, such as human papillomavirus, Merkel cell polyomavirus, EBV, HIV, Hepatitis B/C virus, HTLV-1, and others. Users who wish to detect viruses beyond this default set need to append the respective virus genome sequence to the assembly and build the STAR index themselves.

Not all viruses integrate into the human genome. Their presence can often be detected nonetheless by checking if reads align to the respective virus genome. Arriba comes with a script quantify_virus_expression.sh, which counts reads mapping to virus genomes and applies a few filters to remove false positives.

Due to sequence similarity between related virus strains, it happens occasionally that STAR maps reads to a related virus strain instead of to the strain that actually infected a given sample. Arriba then sometimes reports integrations of multiple related strains in a single sample. Although it is usually wrong of Arriba to report that a single sample harbors integrations of multiple strains, the positions of the various integration sites can very well be correct, and therefore all of them are reported - even if their alignments come from related but distinct strains. Similarly, when a cell line was genetically engineered using cloning vectors, the vector may sometimes be detected by Arriba as being integrated near the engineered gene (and possibly other places in the genome in case of off-target integration). Since cloning vectors as such are not part of the RefSeq database, Arriba usually reports integration of a virus with similar sequence to the cloning vector. For example, cloning vectors derived from adeno-associated viruses are frequently reported as the woodchuck hepatitis virus.

Targeted sequencing

Arriba can be used to analyze RNA sequencing data from a panel of genes. However, Arriba was primarily developed for untargeted/unbiased sequencing. Focal, very deep sequencing depths as seen with targeted sequencing do not match Arriba's statistical model of the expected coverage distribution. Estimation of the background noise does not work as well here, which means Arriba will likely produce more false positives. Low-confidence fusions should probably be ignored altogether. Possibly, one should additionally remove fusions with a lower expression than 1% of the coverage, i.e., the sum of split_reads1, split_reads2, and discordant_mates should be at least 1% of the sum of coverage1 and coverage2.

Moreover, one should be aware that the panel only covers cancer-relevant genes. So every reported fusion prediction will affect a potentially interesting gene. Common sense should be applied whether the fusion actually makes sense in the given cancer type. For example, some panels include the fusion SLC45A3-ELK4. This fusion is primarily relevant to prostate cancer. It would be unusual to observe it in the context of other cancer types. Yet it can be detected in many samples, because it is a read-through fusion that is also seen in benign tissue and because targeted sequencing over-amplifies even slightly expressed transcripts. Generally speaking, the false positives induced by targeted sequencing mostly affect neighboring genes, because the close proximity of genes gives rise to random cis-/trans-spliced transcripts, which are amplified by the panel to higher levels than one would expect with untargeted RNA-Seq. So fusion predictions involving genes in proximity should be treated with caution. More advice on identifying potential false positives and evaluating the relevance of predicted fusions can be found in the chapter Interpretation of results.

Lastly, it is recommended to use the parameter -u to enable external duplicate marking instead of using Arriba's internal duplicate marking method. For targeted sequencing, it is common practice to tag fragments with unique molecular identifiers (UMIs) prior to PCR amplification and sequencing. Arriba is oblivious to UMIs. It marks duplicates solely based on the fact that reads have identical mapping coordinates. So it may discard reads with distinct UMIs as PCR duplicates. It is better to perform duplicate marking externally based on the UMIs prior to running Arriba. The parameter -u instructs Arriba to not perform its own detection of duplicates and instead rely on duplicates being marked as such in the BAM file using the flag BAM_FDUP.

Supporting read count vs. coverage

The number of reads (or fragments) supporting a fusion are given in the columns split_reads1/2 and discordant_mates. These columns only report reads which passed all filters and can be thought of as high-quality supporting reads. Reads which failed one or more filters are reported in the column filters. In contrast, the columns coverage1/2 report all reads covering the fusion breakpoints. No filters are applied to coverage calculation, such that these numbers are not afflicted with the negative bias of the supporting reads columns. Most notably, the coverage calculation includes duplicates, whereas the supporting reads lack duplicates. Moreover, Arriba by default ignores supporting reads in excess of 300 for performance reasons (see also parameter -U). Therefore, the coverage values and supporting read counts are only roughly comparable - especially when a high number of duplicates is expected, for example with targeted sequencing libraries or highly expressed genes. Nevertheless, the implications on fusion calling are negligible, because few filters make use of coverage information. But users who desire consistent counting of supporting reads and coverage should remove (not just mark) all duplicates from the BAM file prior to running Arriba. This is currently the only way to obtain comparable counts.

Supported organisms

Arriba officially supports only human (hg19/GRCh37/hs37d5 or hg38/GRCh38) and mouse (mm10/GRCm38 or mm39/GRCm39). Other organisms or genome assemblies can be used in principle, but the results will be less accurate and the annotation incomplete. This is because important reference files are not available, including:

  • protein domain annotation, which is required to populate the column retained_protein_domains

  • a list of known fusions, which (marginally) improves sensitivity for known oncogenic driver fusions

  • the blacklist, which is essential for removing benign transcripts and recurrent artifacts

These reference files are optional. Arriba can be run without them simply by omitting the respective command-line arguments (-p for the protein domain annotation, -k for the known fusions, and -b for the blacklist and, in addition, the blacklist filter must be disabled using the parameter -f blacklist).

In order to improve the fusion calls from unsupported organisms, users can build their own blacklist. To this end, training samples are required. The more samples are used for training the better. For example, the blacklists for the officially supported organisms were trained on several hundred samples. For a robust blacklist, it is advisable to use a wide range of tissue types and sequencing/library preparation protocols so that the whole spectrum of benign transcripts and recurrent artifacts is captured by the blacklist. Ideally, one should choose training samples which are expected to harbor no somatic fusions, i.e., normal samples. Malignant samples can theoretically be used, too, but great care must be taken not to accidentally blacklist recurrent driver alterations.

A blacklist can be built by simply running Arriba on the set of training samples. The breakpoint pairs to be blacklisted can then be extracted from the columns breakpoint1/2 from both the main output file (as specified by the parameter -o) and the discarded fusions file (as specified by the parameter -O). The extracted breakpoint pairs must be stored in a tab-separated file with two columns - one for each breakpoint. Depending on the type of training samples used (normal vs. malignant), the recurrence threshold should be adjusted accordingly. If normal samples were used for training, any breakpoint pair which is found in more than one sample can be blacklisted as a recurrent artifact; for malignant training samples, the threshold should be much higher - at least as high as the most prevalent oncogenic driver fusion in the given disease. After the recurrent breakpoint pairs have been added to the blacklist, the list can optionally be fine-tuned further by adding special keywords. For example, when a certain gene is involved in a lot of artifacts even after the newly built blacklist has been applied, the gene may be blacklisted completely by putting the gene name in the first column and the keyword any in the second column. All valid keywords are described in the section about the blacklist.

Supported aligners

In principle, Arriba is compatible with any RNA-Seq aligner which reports split reads and discordant mates in a format that is compliant with the SAM format specification. That is, paired-end discordant mates must be marked as such by means of having the BAM_FPROPER_PAIR (0x2) flag unset, and split reads must be represented as supplementary alignments with the BAM_FSUPPLEMENTARY (0x800) flag set for the supplementary and a SA tag for the anchor read. However, Arriba currently has the limitation that it can only utilize supplementary alignments if there is exactly one supplementary alignment per read. Reads which have multiple supplementary alignments are ignored. Multi-mapping chimeric reads are recognized by Arriba provided that all SAM records pertaining to the same alignment have a HI tag and the tag has the same value. In other words, when a read maps to multiple loci, all SAM records pertaining to the first alignment must have the tag HI:i:1, and all SAM records pertaining to the second alignment must have the tag HI:i:2, and so on.

Alignment tools that have been tested successfully with Arriba are the STAR aligner and Illumina's Dragen aligner. (Note: The open-source implementation of Dragen, DRAGMAP, is not compatible with Arriba, since it is not suitable for RNA-Seq data at the time of this writing.) STAR is preferred over Dragen, because it is better at aligning split reads and because multi-mapping reads are stored in a way that Arriba can handle. Users who want to use an incompatible aligner in conjunction with Arriba can run the script run_arriba_on_prealigned_bam.sh, which takes a BAM file (aligned by any aligner) as input and uses STAR to realign only those reads which are relevant to fusion detection, namely, clipped and unmapped reads. The fusion calls from this workflow should be close to the recommended workflow based entirely on STAR, but avoids having to realign the entire BAM file just for the sake of fusion detection, thus saving CPU time.