Arriba

Usage

arriba [-c Chimeric.out.sam] -x Aligned.out.sam \
       -g annotation.gtf -a assembly.fa \
       [-b blacklists.tsv] [-k known_fusions.tsv] [-d structural_variants_from_WGS.tsv] \
       [-t tags.tsv] [-p protein_domains.gff3] \
       -o fusions.tsv [-O fusions.discarded.tsv] \
       [OPTIONS]

Options

-c FILE : File in SAM/BAM/CRAM format with chimeric alignments as generated by STAR (Chimeric.out.sam). This parameter is only required if STAR was run with the parameter --chimOutType SeparateSAMold. When STAR was run with the parameter --chimOutType WithinBAM, it suffices to pass the parameter -x to Arriba and -c can be omitted.

-x FILE : File in SAM/BAM/CRAM format with main alignments as generated by STAR (Aligned.out.sam). Arriba extracts candidate reads from this file.

-g FILE : GTF file with gene annotation. The file may be gzip-compressed.

-G GTF_FEATURES : Comma-/space-separated list of names of GTF features. The names of features in GTF files are not standardized. Different publishers use different names for the same features. For example, GENCODE uses gene_type for the gene type feature, whereas ENSEMBL uses gene_biotype. In order that Arriba can parse the GTF files from various publishers, the names of GTF features are configurable. Alternative names for one and the same feature can be specified by using the pipe symbol as a separator (|). Arriba supports a set of names which is suitable for RefSeq, GENCODE, and ENSEMBL. Default: gene_name=gene_name|gene_id gene_id=gene_id transcript_id=transcript_id feature_exon=exon feature_CDS=CDS

-a FILE : FastA file with genome sequence (assembly). The file may be gzip-compressed. An index with the file extension .fai must exist only if CRAM data is processed.

-b FILE : File containing blacklisted ranges. Refer to section Blacklist for a description of the expected file format. The file may be gzip-compressed.

-k FILE : File containing known/recurrent fusions. Some cancer entities are often characterized by fusions between the same pair of genes. In order to boost sensitivity, a list of known fusions can be supplied using this parameter. Refer to section Known fusions for a description of the expected file format. The file may be gzip-compressed.

-o FILE : Output file with fusions that have passed all filters. Refer to section fusions.tsv for a description of the columns.

-O FILE : Output file with fusions that were discarded due to filtering. The format is the same as for parameter -o.

-t FILE : Tab-separated file containing fusions to annotate with tags in the tags column. The first two columns specify the genes; the third column specifies the tag. See section Tags file for a detailed description of the format.

-p FILE : File in GFF3 format containing coordinates of the protein domains of genes. The detailed format is described in the section Protein domains. The protein domains retained in a fusion are listed in the column retained_protein_domains of Arriba's output file. The file may be gzip-compressed.

-d FILE : Tab-separated file with coordinates of structural variants found using whole-genome sequencing data. These coordinates serve to increase sensitivity towards weakly expressed fusions and to eliminate fusions with low confidence. Refer to section Structural variant calls from WGS for a description of the expected file format. The file may be gzip-compressed.

-D MAX_GENOMIC_BREAKPOINT_DISTANCE : When a file with genomic breakpoints obtained from whole-genome sequencing is supplied via the parameter -d, this parameter determines how far a genomic breakpoint may be away from a transcriptomic breakpoint to still consider it as a related event. For events inside genes, the distance is added to the end of the gene; for intergenic events, the distance threshold is applied as is. Default: 100000

-s STRANDEDNESS : Whether a strand-specific protocol was used for library preparation, and if so, the type of strandedness:

  • auto: auto-detect whether the library is stranded and the type of strandedness

  • yes: the library is stranded and the strand of the read designated as first-in-pair matches the transcribed strand

  • no: the library is not stranded

  • reverse: the library is stranded and the strand of the read designated as first-in-pair is the reverse of the transcribed strand

Even when an unstranded library is processed, Arriba can often infer the strand from splice-patterns. But in unclear situations, stranded data helps resolve ambiguities. Default: auto

-i CONTIGS : Comma-/space-separated list of interesting contigs. Fusions between genes on other contigs are ignored. Contigs can be specified with or without the prefix chr. Asterisks (*) are treated as wild-cards. Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y AC_* NC_*

-v CONTIGS : Comma-/space-separated list of viral contigs for reporting of viral integration sites. Contigs can be specified with or without the prefix chr. Asterisks (*) are treated as wild-cards. Default: AC_* NC_*

-f FILTERS : Comma-/space-separated list of filters to disable. By default all filters are enabled. Valid values are: top_expressed_viral_contigs, viral_contigs, low_coverage_viral_contigs, uninteresting_contigs, no_genomic_support, short_anchor, select_best, many_spliced, long_gap, merge_adjacent, hairpin, small_insert_size, same_gene, genomic_support, read_through, no_coverage, mismatches, homopolymer, low_entropy, multimappers, inconsistently_clipped, duplicates, homologs, blacklist, mismappers, spliced, relative_support, min_support, known_fusions, end_to_end, non_coding_neighbors, isoforms, intronic, in_vitro, intragenic_exonic, internal_tandem_duplication

-E MAX_E-VALUE : Arriba estimates the number of fusions with a given number of supporting reads which one would expect to see by random chance. If the expected number of fusions (e-value) is higher than this threshold, the fusion is discarded by the filter relative_support. Note: Increasing this threshold can dramatically increase the number of false positives and may increase the runtime of resource-intensive steps. Fractional values are possible. Default: 0.3

-S MIN_SUPPORTING_READS : The filter min_support discards all fusions with fewer than this many supporting reads (split reads and discordant mates combined). Default: 2

-m MAX_MISMAPPERS : When more than this fraction of supporting reads turns out to be mapped incorrectly, the filter mismappers discards the fusion. Default: 0.8

-L MAX_HOMOLOG_IDENTITY : Genes with more than the given fraction of sequence identity are considered homologs and removed by the filter homologs. Default: 0.3

-H HOMOPOLYMER_LENGTH : The filter homopolymer removes breakpoints adjacent to homopolymers of the given length or more. Default: 6

-R READ_THROUGH_DISTANCE : The filter read_through removes read-through fusions where the breakpoints are less than the given distance away from each other. Default: 10000

-A MIN_ANCHOR_LENGTH : Alignment artifacts are often characterized by split reads coming from only one gene and no discordant mates. Moreover, the split reads only align to a short stretch in one of the genes. The filter short_anchor removes these fusions. This parameter sets the threshold in bp for what the filter considers short. Default: 23

-M MANY_SPLICED_EVENTS : The filter many_spliced recovers fusions between genes that have at least this many spliced breakpoints. Default: 4

-K MAX_KMER_CONTENT : The filter low_entropy removes reads with repetitive 3-mers. If the 3-mers make up more than the given fraction of the sequence, then the read is discarded. Default: 0.6

-V MAX_MISMATCH_PVALUE : The filter mismatches uses a binomial model to calculate a p-value for observing a given number of mismatches in a read. If the number of mismatches is too high, the read is discarded. Default: 0.01

-F FRAGMENT_LENGTH : When paired-end data is given, the fragment length is estimated automatically and this parameter has no effect. But when single-end data is given, the mean fragment length should be specified to effectively filter fusions that arise from hairpin structures. Default: 200

-U MAX_READS : Subsample fusions with more than the given number of supporting reads. This improves performance without compromising sensitivity, as long as the threshold is high. Counting of supporting reads beyond the threshold is inaccurate, obviously. Arriba issues a WARNING: some fusions were subsampled, because they have more than 300 supporting reads when the threshold has been hit. Default: 300

-Q QUANTILE : Highly expressed genes are prone to produce artifacts during library preparation. Genes with an expression above the given quantile are eligible for filtering by the filter in_vitro. Default: 0.998

-e EXONIC_FRACTION : The breakpoints of false-positive predictions of intragenic events are often both in exons. True predictions are more likely to have at least one breakpoint in an intron, because introns are larger. If the fraction of exonic sequence between two breakpoints is smaller than the given fraction, the filter discards the event. Default: 0.2

-T TOP_N : If a tumor is truly infected with a virus, a substantial number of reads should map to the respective viral contig. Only report viral integration sites of the top N most highly expressed viral contigs. Default: 5

-C COVERAGE_FRACTION : Ignore virus-associated events if the virus is not fully expressed, i.e., less than the given fraction of the viral contig is transcribed. Default: 0.05

-l MAX_ITD_LENGTH : Maximum length of internal tandem duplications (ITDs) in bp. STAR often fails to align ITDs with a length of more than a few bp. However, many known oncogenic ITDs are longer than 20 bp and thus at risk of being overlooked. Arriba can manually search for reads that potentially arise from ITDs by attempting to align clipped reads as an ITD. This parameter defines the search space and also limits the effects of the internal_tandem_duplications filter. Note: Increasing this value can impair performance, because Arriba needs to perform an alignment for candidate reads and the alignment complexity depends on the the maximum search space. Moreover, increasing the value beyond the default can lead to many false positives, because the blacklist was trained with the default value and frequent germline variants with a larger length will not be filtered effectively by the blacklist. Default: 100

-z MIN_ITD_ALLELE_FRACTION : Required fraction of supporting reads to report an internal tandem duplication. Default: 0.07

-Z MIN_ITD_SUPPORTING_READS : Required absolute number of supporting reads to report an internal tandem duplication. Default: 10

-u : Arriba performs marking of duplicates internally based on identical mapping coordinates. When this switch is set, internal marking of duplicates is disabled and Arriba assumes that duplicates have been marked by a preceding program. In this case, Arriba only discards alignments flagged with the BAM_FDUP flag. This makes sense when duplicates cannot be reliably identified solely based on their mapping coordinates, e.g. when unique molecular identifiers (UMIs) are used or when independently generated libraries are merged in a single BAM file and the read group must be interrogated to distinguish duplicates from reads that map to the same coordinates by chance. In addition, when this switch is set, duplicate reads are not considered for the calculation of the coverage at fusion breakpoints (columns coverage1 and coverage2 in the output file).

-X : To reduce the runtime and file size, by default, the columns fusion_transcript, peptide_sequence, and read_identifiers are left empty in the file containing discarded fusion candidates (see parameter -O). When this flag is set, this extra information is reported in the discarded fusions file.

-I : By default, the fusion transcript sequence is assembled from the supporting reads. Like so, non-template bases, reference mismatches, aberrant splicing, and other deviations from the reference are correctly reflected in the sequence. However, when there are only few supporting reads, the assembled transcript can be incomplete. Gaps in the sequence are then denoted as .... When this switch is enabled, such gaps are filled with the sequence from the assembly wherever possible. Moreover, the sequence is expanded to the start and end of the fusion partners, yielding the complete sequence of the fusion gene. If the sequence exceeds the boundaries of the fused transcripts, it is trimmed to the boundaries. Since the fusion peptide sequence builds upon the fusion transcript sequence, enabling this switch implicitly causes Arriba to compute the full peptide sequence from the start codon of the 5' fusion partner to the stop codon of the 3' fusion partner. The main disadvantage of enabling this switch is that under rare circumstances the resulting sequence may lack some deviations from the reference, such as reference mismatches or aberrant splicing. It should be noted that not all gaps can be filled and that the fusion transcript sequence may still be incomplete. A complete construction of the 5' end is marked by a caret sign (^) at the beginning of the fusion transcript sequence; a complete construction of the 3' end is marked by a dollar sign ($) at the end.

-h : Print help and exit.

draw_fusions.R

Usage

draw_fusions.R --fusions=fusions.tsv --annotation=annotation.gtf --output=output.pdf \
               [--alignments=Aligned.sortedByCoord.out.bam] \
               [--cytobands=cytobands.tsv] [--proteinDomains=protein_domains.gff3] \
               [OPTIONS]

Options

--fusions=FILE : File containing fusion predictions from Arriba (fusions.tsv) or STAR-Fusion (star-fusion.fusion_predictions.tsv or star-fusion.fusion_predictions.abridged.coding_effect.tsv).

--annotation=FILE : Gene annotation in GTF format.

--output=FILE : Output file in PDF format containing the visualizations of the gene fusions.

--cytobands=FILE : Coordinates of the Giemsa staining bands. This information is used to draw ideograms. If the argument is omitted, then no ideograms are rendered. The file must have the following columns: contig, start, end, name, giemsa. Recognized values for the Giemsa staining intensity are: gneg, gpos followed by a percentage, acen, stalk. Distributions of Arriba provide Giemsa staining annotation for all supported assemblies in the database directory.

--alignments=FILE : BAM file containing normal alignments from STAR (Aligned.sortedByCoord.out.bam). The file must be sorted by coordinates and indexed. If this argument is given, the script generates coverage plots. This argument requires the Bioconductor package GenomicAlignments.

--proteinDomains=FILE : GFF3 file containing the genomic coordinates of protein domains. Distributions of Arriba offer protein domain annotations for all supported assemblies in the database directory. When this file is given, a plot is generated, which shows the protein domains retained in the fusion transcript. This option requires the Bioconductor package GenomicRanges.

--mergeDomainsOverlappingBy=FRACTION : Occasionally, domains are annotated redundantly. For example, tyrosine kinase domains are frequently annotated as Protein tyrosine kinase and Protein kinase domain. In order to simplify the visualization, such domains can be merged into one, given that they overlap by the given fraction. The description of the larger domain is used. Default: 0.9

--optimizeDomainColors=TRUE|FALSE : By default, the script colorizes domains according to the colors specified in the file given in --annotation. This way, coloring of domains is consistent across all proteins. But since there are more distinct domains than colors, this can lead to different domains having the same color. If this option is set to TRUE, the colors are recomputed for each fusion separately. This ensures that the colors have the maximum distance for each individual fusion, but they are no longer consistent across different fusions. Default: FALSE

--sampleName=NAME : The name of the sample is printed as the title on every page.

--minConfidenceForCircosPlot=none|low|medium|high : The fusion of interest is drawn as a solid line in the circos plot. To give an impression of the overall degree of rearrangement, all other fusions are drawn as semi-transparent lines in the background. This option determines which other fusions should be included in the circos plot. none means only the fusion of interest is drawn; all other values specify the minimum confidence a fusion must have to be included. It usually makes no sense to include low-confidence fusions in circos plots, because they are abundant and unreliable, and would clutter up the circos plot. Default: medium

--pdfWidth=INCHES : Width of the pages of the PDF output file in inches. Default: 11.692

--pdfHeight=INCHES : Height of the pages of the PDF output file in inches. Default: 8.267

--squishIntrons=TRUE|FALSE : Exons usually make up only a small fraction of a gene. They may be hard to see in the plot. Since introns are in most situations of no interest in the context of gene fusions, this switch can be used to shrink the size of introns to a fixed, negligible size. It makes sense to disable this feature if breakpoints in introns are of importance. Default: TRUE

--color1=COLOR : Color of the 5' end of the fusion. The color can be specified in any notation that is a valid color specification in R. Default: #e5a5a5

--color2=COLOR : Color of the 3' end of the fusion. The color can be specified in any notation that is a valid color specification in R. Default: #a7c4e5

--printExonLabels=TRUE|FALSE : By default the number of an exon is printed inside each exon, which is taken from the attribute exon_number of the GTF annotation. When a gene has many exons, the boxes may be too narrow to contain the labels, resulting in unreadable exon labels. In these situations, it may be better to turn off exon labels. Default: TRUE

--render3dEffect=TRUE|FALSE : Whether light and shadow should be rendered to give objects a 3D effect. Default: TRUE

--fontSize=SIZE : Decimal value to scale the size of text. Default: 1

--fontFamily=FONT : Font to use for all labels in the plots. To see a list of available fonts, pass an empty value to this parameter. Default: Helvetica

--showIntergenicVicinity=DISTANCE|LEFT_DISTANCE_1,RIGHT_DISTANCE_1,LEFT_DISTANCE_2,RIGHT_DISTANCE_2 : This option only applies to intergenic breakpoints. If it is set to a value greater than 0, then the script draws the genes which are no more than the given distance away from an intergenic breakpoint. The keywords closestGene and closestProteinCodingGene instruct the script to dynamically determine the distance to the next (protein-coding) gene for each breakpoint. Alternatively, instead of specifying a single distance that is applied upstream and downstream of both breakpoints alike, more fine-grained control over the region to be shown is possible by specifying four comma-separated values. The first two values determine the region to the left and to the right of breakpoint 1; the third and fourth values determine the region to the left and to the right of breakpoint 2. Note that this option is incompatible with --squishIntrons. Default: 0

--transcriptSelection=coverage|provided|canonical : By default, the transcript isoform with the highest coverage is drawn. Alternatively, the transcript isoform that is provided in the columns transcript_id1 and transcript_id2 in the given fusions file can be drawn. Selecting the isoform with the highest coverage usually produces nicer plots, in the sense that the coverage track is smooth and shows a visible increase in coverage after the fusion breakpoint. However, the isoform with the highest coverage may not be the one that is involved in the fusion. Often, genomic rearrangements lead to non-canonical isoforms being transcribed. For this reason, it can make sense to rely on the transcript selection provided by the columns transcript_id1/2, which reflect the actual isoforms involved in a fusion. As a third option, the transcripts that are annotated as canonical can be drawn. Transcript isoforms tagged with appris_principal, appris_candidate, or CCDS are considered canonical. Default: coverage

--fixedScale=BASES : By default, transcripts are scaled automatically to fill the entire page. This parameter enforces a fixed scale to be applied to all fusions, which is useful when a collection of fusions should be visualized and the sizes of all transcripts should be comparable. A common use case is the visualization of a gene that is found to be fused to multiple partners. By forcing all fusion plots to use the same scale, the fusions can be summarized as a collage in a single plot one above the other with matching scales. Note: The scale must be bigger than the sum of the biggest pair of transcripts to be drawn, or else dynamic scaling is applied, because display errors would occur otherwise. The default value is 0, which means that no fixed scale should be used and that the scale should be adapted dynamically for each fusion.