bio align: perform alignments
Note: alignments in
bio
are primarily designed for exploratory use, for aligning relatively short (up to ~30Kb long sequences), visually investigating the alignments, interacting with the sequences before and after alignment.Use software that relies on heuristics when investigating large datasets. Specialized software will operate (many) orders of magnitude faster. Depending on your needs you may want to use:
blast
,blat
,mummer
,minimap2
,lastz
,lastal
,exonerate
,vsearch
,diamond
.
Install bio
with:
pip install bio --upgrade
The full documentation for bio
is maintained at https://www.bioinfo.help/.
Quick start
Input may be given from command line:
bio align GATTACA GATCA
the first sequence is the target, all following sequences are aligned considered queries. The command prints:
# seq1 (7) vs seq2 (5)
# pident=57.1% len=7 ident=4 mis=1 del=2 ins=0
# semiglobal: score=2.0 match=1 mismatch=2 gapopen=11 gapextend=1
GATTACA
|||.|--
GATCA--
Alignment as a table
When generating alignment providing output in different formats is important. Here is the alignment formatted as a table:
bio align GATTACA GATCA --table
prints:
target query pident len match mism ins del
seq1 seq2 57.1 7 4 1 0 2
Alignment as VCF
bio align GATTACA GATCA --vcf
prints:
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=TYPE,Number=1,Type=String,Description="Type of the variant">
##contig=<ID=seq1,length=7,assembly=seq1>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT seq2
seq1 4 4_SNP_T/C T C . PASS TYPE=SNP GT 1
seq1 5 5_DEL_2 ACA A . PASS TYPE=DEL GT 1
Alignment as differences
The output may be formatted in as differences (mutations) between the reference (second sequence) and the query (first sequence):
bio align GATTACA GATCA --diff
prints:
T4C SNP 4 T C seq1 seq2
ACA5A DEL 5 ACA A seq1 seq2
You can think of the mutations output can be thought of as a simplified VCF.
Alignment types
The default alignment is semi-global (global alignment with no end gap penalies). To perform a global alignment write:
bio align GATTACA GATCA --global
the output is:
# seq1 (7) vs seq2 (5)
# pident=71.4% len=7 ident=5 mis=0 del=2 ins=0
# global: score=-7.0 match=1 mismatch=2 gapopen=11 gapextend=1
GATTACA
|||--||
GAT--CA
Available ptions are : --global
, --local
, --semi-global
Align realistic data
Fetches two genomic files:
bio fetch MN996532 NC_045512 > genomes.gb
Align two genomes, the first is the target, all following sequences are considered queries and are aligned against the target. Here we are looking at what mutations would need to be made to the bat genome to turn it into the coronavirus genome:
cat genomes.gb | bio fasta --genome | bio align | head
the command aligns two 30KB sequences and takes about 15 seconds on my system, it will print:
# MN996532.2 (29855) vs NC_045512.2 (29903)
# pident=96.0% len=29908 ident=28725 mis=1125 del=5 ins=53
# semiglobal: score=139005.0 gap open=11 extend=1 matrix=NUC.4.4
ATTAAAGGTTTATACCTTTCCAGGTAACAAACCAACGAACTCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAA
||||||||||||||||||.|||||||||||||||||.||||.|||||||||||||||||||||||||||||||||||||||
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAA
The bio align
method takes the first file that it sees as target and aligns all other sequences to it as queries.
Align coding sequences
Align the DNA corresponding to protein S
cat genomes.gb | bio fasta --gene S | bio align | head
prints:
# QHR63300.2 (3810) vs YP_009724390.1 (3822)
# pident=92.9% len=3822 ident=3549 mis=261 del=0 ins=12
# semiglobal: score=3005.0 match=1 mismatch=2 gapopen=11 gapextend=1
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTTTCTAGTCAGTGTGTTAATCTAACAACTAGAACTCAGTTACCTCCTGCA
||||||||||||||||||||||||||||||||.||||||||||||||||||||.|||||.||||||||.|||||.||||||
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCA
Align protein sequences
Align the protein sequences that prints:
cat genomes.gb | bio fasta --gene S --protein | head
that prints:
S32F SNP 32 S F QHR63300.2 YP_009724390.1
L50S SNP 50 L S QHR63300.2 YP_009724390.1
I76T SNP 76 I T QHR63300.2 YP_009724390.1
P218Q SNP 218 P Q QHR63300.2 YP_009724390.1
D324E SNP 324 D E QHR63300.2 YP_009724390.1
T346R SNP 346 T R QHR63300.2 YP_009724390.1
T372A SNP 372 T A QHR63300.2 YP_009724390.1
T403R SNP 403 T R QHR63300.2 YP_009724390.1
Alignment showing mutations
cat genomes.gb | bio fasta --gene S --protein | bio align --diff | tail -5
prints the variations:
N519H SNP 519 N H QHR63300.2 YP_009724390.1
A604T SNP 604 A T QHR63300.2 YP_009724390.1
S680SPRRA INS 680 S SPRRA QHR63300.2 YP_009724390.1
S1121N SNP 1121 S N QHR63300.2 YP_009724390.1
I1224V SNP 1224 I V QHR63300.2 YP_009724390.1
Alignment with tabular output
You can produce a column based table output
cat genomes.gb | bio fasta --gene S --protein | bio align --table
prints:
target query pident len match mism ins del
QHR63300.2 YP_009724390.1 97.4 1273 1240 29 4 0
Different scoring matrices
cat genomes.gb | bio fasta --gene S --translate | bio align --matrix PAM30 | head
prints:
# QHR63300.2 (1270) vs YP_009724390.1 (1274)
# pident=97.4% len=1274 ident=1241 mis=29 del=0 ins=4
# semiglobal: score=9339.0 matrix=PAM30 gapopen=11 gapextend=1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSSTRGVYYPDKVFRSSVLHLTQDLFLPFFSNVTWFHAIHVSGTNGIKRFDN
|||||||||||||||||||||||||||||||.|||||||||||||||||.|||||||||||||||||||||||||.|||||
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDN
Scoring matrices
The scoring matrix may be a builtin name or a file with a scoring matrix format. See the scoring with:
bio align --matrix PAM30
prints:
#
# This matrix was produced by "pam" Version 1.0.6 [28-Jul-93]
#
# PAM 30 substitution matrix, scale = ln(2)/2 = 0.346574
#
# Expected score = -5.06, Entropy = 2.57 bits
#
# Lowest score = -17, Highest score = 13
#
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 6 -7 -4 -3 -6 -4 -2 -2 -7 -5 -6 -7 -5 -8 -2 0 -1 -13 -8 -2 -3 -3 -3 -17
R -7 8 -6 -10 -8 -2 -9 -9 -2 -5 -8 0 -4 -9 -4 -3 -6 -2 -10 -8 -7 -4 -6 -17
N -4 -6 8 2 -11 -3 -2 -3 0 -5 -7 -1 -9 -9 -6 0 -2 -8 -4 -8 6 -3 -3 -17
D -3 -10 2 8 -14 -2 2 -3 -4 -7 -12 -4 -11 -15 -8 -4 -5 -15 -11 -8 6 1 -5 -17
...
Exercises
Align genomic DNA to CDNA
bio fetch ENST00000288602 | head > genomic.fa
bio fetch ENST00000288602 --type cdna | head > cdna.fa
Is this a good alignment?
bio align genomic.fa cdna.fa
Try local alignment
bio align genomic.fa cdna.fa --local