arriba

Usage

arriba -c chimeric.bam [-r read_through.bam] -x rna.bam \
       -g annotation.gtf -a assembly.fa [-b blacklists.tsv] [-k known_fusions.tsv] \
       -o fusions.tsv [-O discarded_fusions.tsv] \
       [OPTIONS]

Options

-c FILE : BAM file with chimeric alignments as generated by STAR (Chimeric.out.sam). The file must be in BAM format, but not necessarily sorted.

-r FILE : BAM file with read-through alignments as generated by extract_read-through_fusions.

-x FILE : BAM file with normal alignments as generated by STAR (Aligned.sortedByCoord.out.bam). The file must be sorted by coordinate and an index with the file extension .bai must be present.

-g FILE : GTF file with gene annotation. The file may be gzip-compressed.

-G GTF_FEATURES : Comma-/space-separated list of names of GTF features. The names of features in GTF files are not standardized. Different publishers use different names for the same features. For example, Gencode uses gene_type for the gene type feature, whereas ENSEMBL uses gene_biotype. In order that Arriba can parse the GTF files from various publishers, the names of GTF features is configurable. Alternative names for one and the same feature can be specified by using the pipe symbol as a separator (|). Arriba supports a set of names which is suitable for RefSeq, Gencode, and ENSEMBL. Default: gene_name=gene_name gene_id=gene_id transcript_id=transcript_id gene_status=gene_status|gene_type|gene_biotype status_KNOWN=KNOWN|protein_coding gene_type=gene_type|gene_biotype type_protein_coding=protein_coding feature_exon=exon feature_UTR=UTR feature_gene=gene

-a FILE : FastA file with genome sequence (assembly). The file may be gzip-compressed.

-b FILE : File containing blacklisted ranges. Refer to section Blacklist for a description of the expected file format. The file may be gzip-compressed.

-k FILE : File containing known/recurrent fusions. Some cancer entities are often characterized by fusions between the same pair of genes. In order to boost sensitivity, a list of known fusions can be supplied using this parameter. Refer to section (Known fusions)[input-files.md#known-fusions] for a description of the expected file format. The file may be gzip-compressed.

-o FILE : Output file with fusions that have passed all filters. Refer to section fusions.tsv for a description of the columns.

-O FILE : Output file with fusions that were discarded due to filtering. The format is the same as for parameter -o.

-d FILE : Tab-separated file with coordinates of structural variants found using whole-genome sequencing data. These coordinates serve to increase sensitivity towards weakly expressed fusions and to eliminate fusions with low confidence. Refer to section Structural variant calls from WGS for a description of the expected file format. The file may be gzip-compressed.

-D MAX_GENOMIC_BREAKPOINT_DISTANCE : When a file with genomic breakpoints obtained from whole-genome sequencing is supplied via the parameter -d, this parameter determines how far a genomic breakpoint may be away from a transcriptomic breakpoint to still consider it as a related event. For events inside genes, the distance is added to the end of the gene; for intergenic events, the distance threshold is applied as is. Default: 100000

-s STRANDEDNESS : Whether a strand-specific protocol was used for library preparation, and if so, the type of strandedness:

Even when an unstranded library is processed, Arriba can often infer the strand from splice-patterns. But in unclear situations, stranded data helps resolve ambiguities. Default: auto

-i CONTIGS : Comma-/space-separated list of interesting contigs. Fusions between genes on other contigs are ignored. Contigs can be specified with or without the prefix chr. Default: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

-f FILTERS : Comma-/space-separated list of filters to disable. By default all filters are enabled. Valid values are: uninteresting_contigs, novel, merge_adjacent, pcr_fusions, spliced, select_best, hairpin, small_insert_size, genomic_support, read_through, mismatches, homopolymer, long_gap, many_spliced, isoforms, intronic, end_to_end, known_fusions, inconsistently_clipped, duplicates, blacklist, homologs, intragenic_exonic, relative_support, min_support, same_gene, mismappers, non_expressed, short_anchor, no_genomic_support, low_entropy

-E MAX_E-VALUE : Arriba estimates the number of fusions with a given number of supporting reads which one would expect to see by random chance. If the expected number of fusions (e-value) is higher than this threshold, the fusion is discarded by the filter relative_support. Note: Increasing this threshold can dramatically increase the number of false positives and may increase the runtime of resource-intensive steps. Fractional values are possible. Default: 0.3

-S MIN_SUPPORTING_READS : The filter min_support discards all fusions with fewer than this many supporting reads (split reads and discordant mates combined). Default: 2

-m MAX_MISMAPPERS : When more than this fraction of supporting reads turns out to be mapped incorrectly, the filter mismappers discards the fusion. Default: 0.8

-L MAX_HOMOLOG_IDENTITY : Genes with more than the given fraction of sequence identity are considered homologs and removed by the filter homologs. Default: 0.3

-H HOMOPOLYMER_LENGTH : The filter homopolymer removes breakpoints adjacent to homopolymers of the given length or more. Default: 6

-R READ_THROUGH_DISTANCE : The filter read_through removes read-through fusions where the breakpoints are less than the given distance away from each other. Default: 10000

-A MIN_ANCHOR_LENGTH : Alignment artifacts are often characterized by split reads coming from only one gene and no discordant mates. Moreover, the split reads only align to a short stretch in one of the genes. The filter short_anchor removes these fusions. This parameter sets the threshold in bp for what the filter considers short. Default: 23

-M MANY_SPLICED_EVENTS : The filter many_spliced recovers fusions between genes that have at least this many spliced breakpoints. Default: 4

-K MAX_KMER_CONTENT : The filter low_entropy removes reads with repetitive 3-mers. If the 3-mers make up more than the given fraction of the sequence, then the read is discarded. Default: 0.6

-V MAX_MISMATCH_PVALUE : The filter mismatches uses a binomial model to calculate a p-value for observing a given number of mismatches in a read. If the number of mismatches is too high, the read is discarded. Default: 0.01

-F FRAGMENT_LENGTH : When paired-end data is given, the fragment length is estimated automatically and this parameter has no effect. But when single-end data is given, the mean fragment length should be specified to effectively filter fusions that arise from hairpin structures. Default: 200

-U MAX_READS : Subsample fusions with more than the given number of supporting reads. This improves performance without compromising sensitivity, as long as the threshold is high. Counting of supporting reads beyond the threshold is inaccurate, obviously. Default: 300

-Q QUANTILE : Highly expressed genes are prone to produce artifacts during library preparation. Genes with an expression above the given quantile are eligible for filtering by the filter pcr_fusions. Default: 0.998

-T : When set, the column fusion_transcript is populated with the sequence of the fused genes as assembled from the supporting reads. Specify the flag twice to also print the fusion transcripts to the file containing discarded fusions (-O). Refer to section fusions.tsv for a description of the format of the column. Default: off

-I : When set, the column read_identifiers is populated with identifiers of the reads which support the fusion. The identifiers are separated by commas. Specify the flag twice to also print the read identifiers to the file containing discarded fusions (-O). Default: off

-h : Print help and exit.

extract_read-through_fusions

Usage

Run on existing BAM file:

extract_read-through_fusions -g annotation.gtf -i rna.bam -o read_through.bam

Run during alignment (see section Execution for a detailed explanation of this use case):

STAR --outStd BAM_Unsorted --outSAMtype BAM Unsorted SortedByCoordinate [...] |
extract_read-through_fusions -g annotation.gtf > read_through.bam

Options

-i FILE : Input BAM file containing the normal alignments produced by STAR (Aligned.sortedByCoord.out.bam). The file does not need to be sorted. extract_read-through_fusions can process unsorted BAM files. Default: STDIN

-o FILE : Output BAM file containing all the reads (fragments), which cross the boundaries of a gene. Default: STDOUT

-g FILE : GTF file with gene annotation. The file may be gzip-compressed.

-G GTF_FEATURES : Comma-/space-separated list of names of GTF features. The names of features in GTF files are not standardized. Different publishers use different names for the same features. For example, Gencode uses gene_type for the gene type feature, whereas ENSEMBL uses gene_biotype. In order that Arriba can parse the GTF files from various publishers, the names of GTF features is configurable. Alternative names for one and the same feature can be specified by using the pipe symbol as a separator (|). By default, Arriba is configured to use a set of aliases which is suitable for RefSeq, Gencode, and ENSEMBL. Default: gene_name=gene_name gene_id=gene_id transcript_id=transcript_id gene_status=gene_status|gene_type|gene_biotype status_KNOWN=KNOWN|protein_coding gene_type=gene_type|gene_biotype type_protein_coding=protein_coding feature_exon=exon feature_UTR=UTR feature_gene=gene

-h : Print help and exit.