Aligning Sequence Reads, Clone Sequences and Assembly Contigs With Bwa-mem

SourceForge Downloads GitHub Downloads BioConda Install

Note: minimap2 has replaced BWA-MEM for PacBio and Nanopore read alignment. Information technology retains all major BWA-MEM features, simply is ~l times as fast, more than versatile, more authentic and produces better base-level alignment. A beta version of BWA-MEM2 has been released for short-read mapping. BWA-MEM2 is about twice as fast as BWA-MEM and outputs near identical alignments.

Getting started

              git clone https://github.com/lh3/bwa.git cd bwa; make ./bwa alphabetize ref.fa ./bwa mem ref.fa read-se.fq.gz | gzip -three > aln-se.sam.gz ./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

Introduction

BWA is a software package for mapping Dna sequences against a large reference genome, such as the man genome. It consists of 3 algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, simply BWA-MEM, which is the latest, is generally recommended as information technology is faster and more accurate. BWA-MEM also has better operation than BWA-backtrack for 70-100bp Illumina reads.

For all the algorithms, BWA offset needs to construct the FM-index for the reference genome (the index command). Alignment algorithms are invoked with unlike sub-commands: aln/samse/sampe for BWA-backtrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm.

Availability

BWA is released under GPLv3. The latest source code is freely available at github. Released packages can be downloaded at SourceForge. After you lot larn the source lawmaking, simply use make to compile and re-create the single executable bwa to the destination y'all want. The only dependency required to build BWA is zlib.

Since 0.7.xi, precompiled binary for x86_64-linux is available in bwakit. In add-on to BWA, this self-consequent package also comes with bwa-associated and tertiary-party tools for proper BAM-to-FASTQ conversion, mapping to ALT contigs, adapter triming, duplicate marking, HLA typing and associated information files.

Seeking help

The detailed usage is described in the man page available together with the source code. You lot tin can use man ./bwa.1 to view the man page in a terminal. The HTML version of the man folio tin exist found at the BWA website. If you have questions about BWA, you may sign up the mailing list and then send the questions to bio-bwa-aid@sourceforge.cyberspace. You lot may also ask questions in forums such as BioStar and SEQanswers.

Citing BWA

Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]. (if you use the BWA-backtrack algorithm)
Li H. and Durbin R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505]. (if you use the BWA-SW algorithm)
Li H. (2013) Adjustment sequence reads, clone sequences and associates contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]. (if you use the BWA-MEM algorithm or the fastmap command, or want to cite the whole BWA package)

Please note that the last reference is a preprint hosted at arXiv.org. I practice not have plan to submit it to a peer-reviewed journal in the near time to come.

Oft asked questions (FAQs)

What types of data does BWA piece of work with?
Why does a read appear multiple times in the output SAM?
Does BWA work on reference sequences longer than 4GB in total?
Why can one read in a pair has high mapping quality but the other has zero?
How tin can a BWA-backtrack alignment stands out of the end of a chromosome?
Does BWA work with ALT contigs in the GRCh38 release?
Can I just run BWA-MEM confronting GRCh38+ALT without post-processing?

1. What types of information does BWA work with?

BWA works with a variety types of DNA sequence information, though the optimal algorithm and setting may vary. The following listing gives the recommended settings:

Illumina/454/IonTorrent unmarried-end reads longer than ~70bp or assembly contigs up to a few megabases mapped to a closely related reference genome:

                                      bwa mem ref.fa reads.fq > aln.sam

Illumina single-end reads shorter than ~70bp:

                                      bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam

Illumina/454/IonTorrent paired-stop reads longer than ~70bp:

                                      bwa mem ref.fa read1.fq read2.fq > aln-pe.sam

Illumina paired-end reads shorter than ~70bp:

                                      bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai   bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln-pe.sam

PacBio subreads or Oxford Nanopore reads to a reference genome:

                                      bwa mem -ten pacbio ref.fa reads.fq > aln.sam   bwa mem -10 ont2d ref.fa reads.fq > aln.sam

BWA-MEM is recommended for query sequences longer than ~70bp for a variety of fault rates (or sequence divergence). Generally, BWA-MEM is more tolerant with errors given longer query sequences equally the adventure of missing all seeds is small. As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore reads with a sequencing fault rate over 20%.

2. Why does a read appear multiple times in the output SAM?

BWA-SW and BWA-MEM perform local alignments. If at that place is a translocation, a cistron fusion or a long deletion, a read bridging the pause point may have two hits, occupying ii lines in the SAM output. With the default setting of BWA-MEM, ane and only 1 line is primary and is soft clipped; other lines are tagged with 0x800 SAM flag (supplementary alignment) and are hard clipped.

three. Does BWA piece of work on reference sequences longer than 4GB in total?

Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over 4GB. However, private chromosome should not be longer than 2GB.

4. Why can one read in a pair have a high mapping quality simply the other has null?

This is correct. Mapping quality is assigned for individual read, not for a read pair. It is possible that one read can be mapped unambiguously, simply its mate falls in a tandem repeat and thus its authentic position cannot be determined.

v. How can a BWA-backtrack alignment stand out of the end of a chromosome?

Internally BWA concatenates all reference sequences into 1 long sequence. A read may be mapped to the junction of two side by side reference sequences. In this example, BWA-backtrack will flag the read every bit unmapped (0x4), merely yous will see position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment as well. BWA-MEM does not have this problem.

half-dozen. Does BWA work with ALT contigs in the GRCh38 release?

Yes, since 0.7.11, BWA-MEM officially supports mapping to GRCh38+ALT. BWA-backtrack and BWA-SW don't properly support ALT mapping as of now. Please run across README-alt.md for details. Briefly, it is recommended to use bwakit, the binary release of BWA, for generating the reference genome and for mapping.

7. Can I just run BWA-MEM against GRCh38+ALT without post-processing?

If y'all are not interested in hits to ALT contigs, it is okay to run BWA-MEM without post-processing. The alignments produced this style are very close to alignments confronting GRCh38 without ALT contigs. However, applying mail-processing helps to reduce fake mappings acquired by reads from the diverged function of ALT contigs and also enables HLA typing. Information technology is recommended to run the post-processing script.