Arriba can detect intragenic inversions and duplications, but not deletions. This is because deletions within a gene are difficult to distinguish from ordinary splicing in RNA-Seq data. Arriba's statistical model to find significant events is not applicable to the identification of a significant lack of exon coverage. These questions are better answered by indel callers, whole-genome sequencing, or algorithms to identify differential exon expression. For these reasons, Arriba does not report any intragenic deletions.
It is recommended to use annotation from GENCODE or ENSEMBL to run Arriba. RefSeq annotation has less comprehensive annotation of splice sites. Moreover, RefSeq does not annotate the immunoglobulin/T-cell receptor loci. These shortcomings reduce the sensitivity of fusion detection. Users who want to use RefSeq nonetheless are advised to copy the immunoglobulin/T-cell receptor loci annotation from GENCODE/ENSEMBL as a workaround.
Arriba usually consumes less than 10 GB of RAM. Approximately 1 GB of RAM is consumed per million chimeric read pairs, plus 4 GB of static overhead to load the assembly and gene annotation. Particularly multiple myeloma samples frequently exceed the normal memory requirements due to countless rearrangements in the immunoglobulin loci. In order to reduce the memory footprint, Arriba can be instructed to subsample reads when an event has a sufficient number of supporting reads. By default, further reads are ignored, once an event has reached 300 supporting reads (see parameter
-U). Arriba issues a
WARNING: some fusions were subsampled, because they have more than 300 supporting reads when this threshold has been hit.
However, excessive memory consumption can indicate a user error. So before reducing the maximum number of supporting reads, users should carefully check their scripts/data for mistakes. For example, if paired-end FastQ files are mistakenly passed to STAR in the wrong order, STAR will align almost all reads as discordant mates. Similarly, if the reads in paired-end FastQ files are not ordered properly (i.e., collated by name), then most of them will be aligned in a discordant fashion. When Arriba consumes an unusual amount of memory, users should interrogate the file
Log.final.out of STAR. If the
% of chimeric reads reported in the log file is high, then scripts and input files should be checked for errors. The
% of chimeric reads is normally in the range of 1-10%, with the exception of very few cancer types (such as multiple myeloma), where they are often much higher.
In most cases, it is not necessary to trim adapters, because STAR is capable of performing clipped alignments, such that reads with adapters will simply be aligned partially. For this reason, the demo script
run_arriba.sh does not perform adapter trimming. However, if many reads contain a substantial fraction of adapter sequence, it can be beneficial to remove adapters for improved sensitivity, because STAR dismisses chimeric alignments when a big fraction of a read cannot be aligned (as controlled by the parameter
--outFilterMatchNminOverLread). To this end, the STAR parameter
--clip3pAdapterSeq can be used or a specialized adapter trimming tool.
Detection of virus expression and viral integration sites is limited to the viruses that are built into the STAR index and assembly used to run Arriba. When the script
download_references.sh was used to generate the reference files, all ~12,000 RefSeq virus genomes as well as ~4,500 human-infecting virus strains related to the RefSeq viruses are included into the reference files. This includes all common cancer-associated viruses, such as human papillomavirus, Merkel cell polyomavirus, EBV, HIV, Hepatitis B/C virus, HTLV-1, and others. Users who wish to detect viruses beyond this default set need to append the respective virus genome sequence to the assembly and build the STAR index themselves.
Not all viruses integrate into the human genome. Their presence can often be detected nonetheless by checking if reads align to the respective virus genome. Arriba comes with a script
quantify_virus_expression.sh, which counts reads mapping to virus genomes and applies a few filters to remove false positives.
Due to sequence similarity between related virus strains, it happens occasionally that STAR maps reads to a related virus strain instead of to the strain that actually infected a given sample. Arriba then sometimes reports integrations of multiple related strains in a single sample. Although it is usually wrong of Arriba to report that a single sample harbors integrations of multiple strains, the positions of the various integration sites can very well be correct, and therefore all of them are reported - even if their alignments come from related but distinct strains. Similarly, when a cell line was genetically engineered using cloning vectors, the vector may sometimes be detected by Arriba as being integrated near the engineered gene (and possibly other places in the genome in case of off-target integration). Since cloning vectors as such are not part of the RefSeq database, Arriba usually reports integration of a virus with similar sequence to the cloning vector. For example, cloning vectors derived from adeno-associated viruses are frequently reported as the woodchuck hepatitis virus.
Arriba can be used to analyze RNA sequencing data from a panel of genes. However, Arriba was primarily developed for untargeted/unbiased sequencing. Focal, very deep sequencing depths as seen with targeted sequencing do not match Arriba's statistical model of the expected coverage distribution. Estimation of the background noise does not work as well here, which means Arriba will likely produce more false positives. Low-confidence fusions should probably be ignored altogether. Possibly, one should additionally remove fusions with a lower expression than 1% of the coverage, i.e., the sum of
discordant_mates should be at least 1% of the sum of
Moreover, one should be aware that the panel only covers cancer-relevant genes. So every reported fusion prediction will affect a potentially interesting gene. Common sense should be applied whether the fusion actually makes sense in the given cancer type. For example, some panels include the fusion SLC45A3-ELK4. This fusion is primarily relevant to prostate cancer. It would be unusual to observe it in the context of other cancer types. Yet it can be detected in many samples, because it is a read-through fusion that is also seen in benign tissue and because targeted sequencing over-amplifies even slightly expressed transcripts. Generally speaking, the false positives induced by targeted sequencing mostly affect neighboring genes, because the close proximity of genes gives rise to random cis-/trans-spliced transcripts, which are amplified by the panel to higher levels than one would expect with untargeted RNA-Seq. So fusion predictions involving genes in proximity should be treated with caution. More advice on identifying potential false positives and evaluating the relevance of predicted fusions can be found in the chapter Interpretation of results.
Lastly, it is recommended to use the parameter
-u to enable external duplicate marking instead of using Arriba's internal duplicate marking method. For targeted sequencing, it is common practice to tag fragments with unique molecular identifiers (UMIs) prior to PCR amplification and sequencing. Arriba is oblivious to UMIs. It marks duplicates solely based on the fact that reads have identical mapping coordinates. So it may discard reads with distinct UMIs as PCR duplicates. It is better to perform duplicate marking externally based on the UMIs prior to running Arriba. The parameter
-u instructs Arriba to not perform its own detection of duplicates and instead rely on duplicates being marked as such in the BAM file using the flag
Supporting read count vs. coverage
The number of reads (or fragments) supporting a fusion are given in the columns
discordant_mates. These columns only report reads which passed all filters and can be thought of as high-quality supporting reads. Reads which failed one or more filters are reported in the column
filters. In contrast, the columns
coverage1/2 report all reads covering the fusion breakpoints. No filters are applied to coverage calculation, such that these numbers are not afflicted with the negative bias of the supporting reads columns. Most notably, the coverage calculation includes duplicates, whereas the supporting reads lack duplicates. Moreover, Arriba by default ignores supporting reads in excess of 300 for performance reasons (see also parameter
-U). Therefore, the coverage values and supporting read counts are only roughly comparable - especially when a high number of duplicates is expected, for example with targeted sequencing libraries or highly expressed genes. Nevertheless, the implications on fusion calling are negligible, because few filters make use of coverage information. But users who desire consistent counting of supporting reads and coverage should remove (not just mark) all duplicates from the BAM file prior to running Arriba. This is currently the only way to obtain comparable counts.