Motivation
Zhou et al. presented that an unique 30-amino-acid deletion in Nsp2-coding region is a key feature to classify whether a strain is a highly pathogenic porcine reproductive and respiratory syndrome virus (PRRSV). And the Nsp2, nonstructural protein 2, has been shown to undergo remarkable genetic variation, primarily in its middle region, while exhibiting high conservation in the N-terminal putative protease domain and the C-terminal predicted transmembrane region (Han et al. 2007). This post aims to show how to find a quite large deletion in a specific coding-region with positional tolerance.
Method and Implementation
Pairwise alignment between a sequence of interest and a reference sequence (ORF1a of VR-2332 strain) is an essential step for finding insertions and/or deletions, shortly indels. The two sequences were aligned with BLAST (Altschul et al. 1999), exactly saying bl2seq with blastx program because my query are assumed as DNA sequences from clinical samples. Remind that global alignment like ClustalW (Thompson, J.D. et al. 1994) is not suitable for this case. The reason is that query sequences were quite shorter than a reference sequence. After blasting two sequences, a Perl script using BioPerl parses a report file from BLAST and evaluates whether indels' position and size is tolerable for user's constraints.
A format of BLAST (bl2seq) result consists of several sections including a summary of query, alignment reports for each HSP (high scoring pair), parameters, and statistics. An example part of the result is shown as the below; Each subject begins with a symbol, ">", accession number and description are followed. In next part, summary statistics for each HSP are written, including bit score, E-value, identities, etc. Actual alignment between the query and subject are described in the last part. These format of data are repeated for all HSPs and subjects.
Bio::SearchIO module supports to read and write search results in various file format of biological applications such as BLAST, FASTA, HMMER, Sim4, etc. Basic parsing a file from BLAST was done by Bio::SearchIO functions and I recommend to read a HOWTO document on the BioPerl. A next step is to extract a quite large indels from HSP alignment information.
At the first time I thought too pedantic to implement this problem --- this is why I write this post. An example of too much 'pedantic' idea is like this; calculating % gap for each columns, cut-off with a specific threshold, finding a series of deletions without break using graph, and evaluating each deletions. It's too much complex! I stop my work. After drinking a cup of coffee, I considered it again in mind of "Perl script should be Perlish".
The following code is to get a string containing a alignment of query sequence potentially including indels marked as "-".
my $query_alignment = $hsp->query_string();
Now I can use super strong features regular expression of Perl for find a series of indels; this codes extracts positions of start and end residues for all possible long indels, which are referred to a reference sequence. And also indels is constrained with minimal length of deletion.
my ($ref_start, $ref_end) = $hsp->hit->strand < 0 ?
( $hsp->hit->end,
$hsp->hit->start ) :
( $hsp->hit->start,
$hsp->hit->end );
while( $query_alignment =~ /(\-{$threshold_length,})/g ) {
$indel_start = $-[1] + $ref_start;
$indel_end = $-[1] + $ref_start + length($1) - 1;
$indel_length = length($1);
push @indels, [$indel_start, $indel_end];
}
Further filtering constraints were implemented and passed from arguments, threshold % indentification (-p), expected start position of indel (-s), expected end position of indel (-e), and tolerant positional error(-d). Sample result is shown as the following:
[bckang@gxs ~]$ ./gvs_find_indel.pl nsp2.blsx -s 921 -e 949 -d 15
#Mutation dectector's parameters: p=30 s=921 e=949 d=15
>AAO13191(NA_prototype)
920 948 29
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
It is true the simplest way is the best. Also it is better way that Perl is supposed to be Perlish and Python looks like Pythonic.
References
Zhou et al. presented that an unique 30-amino-acid deletion in Nsp2-coding region is a key feature to classify whether a strain is a highly pathogenic porcine reproductive and respiratory syndrome virus (PRRSV). And the Nsp2, nonstructural protein 2, has been shown to undergo remarkable genetic variation, primarily in its middle region, while exhibiting high conservation in the N-terminal putative protease domain and the C-terminal predicted transmembrane region (Han et al. 2007). This post aims to show how to find a quite large deletion in a specific coding-region with positional tolerance.
Figure 1. The 30-Amino-Acid Deletion in the Nsp2 of Highly Pathogenic PRRSV. (Zhou et al. 2009) |
Method and Implementation
Pairwise alignment between a sequence of interest and a reference sequence (ORF1a of VR-2332 strain) is an essential step for finding insertions and/or deletions, shortly indels. The two sequences were aligned with BLAST (Altschul et al. 1999), exactly saying bl2seq with blastx program because my query are assumed as DNA sequences from clinical samples. Remind that global alignment like ClustalW (Thompson, J.D. et al. 1994) is not suitable for this case. The reason is that query sequences were quite shorter than a reference sequence. After blasting two sequences, a Perl script using BioPerl parses a report file from BLAST and evaluates whether indels' position and size is tolerable for user's constraints.
A format of BLAST (bl2seq) result consists of several sections including a summary of query, alignment reports for each HSP (high scoring pair), parameters, and statistics. An example part of the result is shown as the below; Each subject begins with a symbol, ">", accession number and description are followed. In next part, summary statistics for each HSP are written, including bit score, E-value, identities, etc. Actual alignment between the query and subject are described in the last part. These format of data are repeated for all HSPs and subjects.
Figure 2. An example BLAST's result of the deletion in Nsp2. |
Bio::SearchIO module supports to read and write search results in various file format of biological applications such as BLAST, FASTA, HMMER, Sim4, etc. Basic parsing a file from BLAST was done by Bio::SearchIO functions and I recommend to read a HOWTO document on the BioPerl. A next step is to extract a quite large indels from HSP alignment information.
At the first time I thought too pedantic to implement this problem --- this is why I write this post. An example of too much 'pedantic' idea is like this; calculating % gap for each columns, cut-off with a specific threshold, finding a series of deletions without break using graph, and evaluating each deletions. It's too much complex! I stop my work. After drinking a cup of coffee, I considered it again in mind of "Perl script should be Perlish".
The following code is to get a string containing a alignment of query sequence potentially including indels marked as "-".
my $query_alignment = $hsp->query_string();
Now I can use super strong features regular expression of Perl for find a series of indels; this codes extracts positions of start and end residues for all possible long indels, which are referred to a reference sequence. And also indels is constrained with minimal length of deletion.
my ($ref_start, $ref_end) = $hsp->hit->strand < 0 ?
( $hsp->hit->end,
$hsp->hit->start ) :
( $hsp->hit->start,
$hsp->hit->end );
while( $query_alignment =~ /(\-{$threshold_length,})/g ) {
$indel_start = $-[1] + $ref_start;
$indel_end = $-[1] + $ref_start + length($1) - 1;
$indel_length = length($1);
push @indels, [$indel_start, $indel_end];
}
Further filtering constraints were implemented and passed from arguments, threshold % indentification (-p), expected start position of indel (-s), expected end position of indel (-e), and tolerant positional error(-d). Sample result is shown as the following:
[bckang@gxs ~]$ ./gvs_find_indel.pl nsp2.blsx -s 921 -e 949 -d 15
#Mutation dectector's parameters: p=30 s=921 e=949 d=15
>AAO13191(NA_prototype)
920 948 29
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
It is true the simplest way is the best. Also it is better way that Perl is supposed to be Perlish and Python looks like Pythonic.
- Zhou, L. et al. (2009) The 30-amino-acid deletion in the Nsp2 of highly pathogenic porcine reproductive and respiratory syndrome virus emerging in China is not related to its virulence. J. Virol. 83(10):5156-67.
- Han, J. et al. (2007) Identification of Nonessential Regions of the Nsp2 Replicase Protein of PRRSV Strain VR-2332 for Replication in Cell Culture. J. Virol. 81(18):9878-90.
- Wikipedia: PRRSV.
- Altscul, S.F. et al. (1999) Basic local alignment search tool. J. Mol. Biol. 215:403-410.
- Thompson, J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22):4673-4680.
- Perl, http://www.perl.org
- BioPerl, http://www.bioperl.org
Comments