Skip to main content

How to find a long indel from Nsp2 alignment

Motivation
Zhou et al. presented that an unique 30-amino-acid deletion in Nsp2-coding region is a key feature to classify whether a strain is a highly pathogenic porcine reproductive and respiratory syndrome virus (PRRSV). And the Nsp2, nonstructural protein 2, has been shown to undergo remarkable genetic variation, primarily in its middle region, while exhibiting high conservation in the N-terminal putative protease domain and the C-terminal predicted transmembrane region (Han et al. 2007). This post aims to show how to find a quite large deletion in a specific coding-region with positional tolerance.


Figure 1. The 30-Amino-Acid Deletion in the Nsp2 of Highly Pathogenic PRRSV. 
(Zhou et al. 2009)


Method and Implementation
Pairwise alignment between a sequence of interest and a reference sequence (ORF1a of VR-2332 strain) is an essential step for finding insertions and/or deletions, shortly indels. The two sequences were aligned with BLAST (Altschul et al. 1999), exactly saying bl2seq with blastx program because my query are assumed as DNA sequences from clinical samples. Remind that global alignment like ClustalW (Thompson, J.D. et al. 1994) is not suitable for this case. The reason is that query sequences were quite shorter than a reference sequence. After blasting two sequences, a Perl script using BioPerl parses a report file from BLAST and evaluates whether indels' position and size is tolerable for user's constraints.


A format of BLAST (bl2seq) result consists of several sections including a summary of query, alignment reports for each HSP (high scoring pair), parameters, and statistics. An example part of the result is shown as the below; Each subject begins with a symbol, ">", accession number and description are followed. In next part, summary statistics for each HSP are written, including bit score, E-value, identities, etc. Actual alignment between the query and subject are described in the last part. These format of data are repeated for all HSPs and subjects.


Figure 2. An example BLAST's result of the deletion in Nsp2.


Bio::SearchIO module supports to read and write search results in various file format of biological applications such as BLAST, FASTA, HMMER, Sim4, etc. Basic parsing a file from BLAST was done by Bio::SearchIO functions and I recommend to read a HOWTO document on the BioPerl. A next step is to extract a quite large indels from HSP alignment information.


At the first time I thought too pedantic to implement this problem --- this is why I write this post. An example of too much 'pedantic' idea is like this; calculating % gap for each columns, cut-off with a specific threshold, finding a series of deletions without break using graph, and evaluating each deletions. It's too much complex! I stop my work. After drinking a cup of coffee, I considered it again in mind of "Perl script should be Perlish".


The following code is to get a string containing a alignment of query sequence potentially including indels marked as "-".


my $query_alignment = $hsp->query_string();


Now I can use super strong features regular expression of Perl for find a series of indels; this codes extracts positions of start and end residues for all possible long indels, which are referred to a reference sequence. And also indels is constrained with minimal length of deletion.



my ($ref_start, $ref_end) = $hsp->hit->strand < 0 ?
                                    ( $hsp->hit->end,
                                      $hsp->hit->start ) :
                                    ( $hsp->hit->start,
                                      $hsp->hit->end );

while( $query_alignment =~ /(\-{$threshold_length,})/g ) {
      $indel_start = $-[1] + $ref_start;
      $indel_end = $-[1] + $ref_start + length($1) - 1;
      $indel_length = length($1);
      push  @indels, [$indel_start, $indel_end];
}


Further filtering constraints were implemented and passed from arguments, threshold % indentification (-p), expected start position of indel (-s), expected end position of indel (-e), and tolerant positional error(-d). Sample result is shown as the following:


[bckang@gxs ~]$ ./gvs_find_indel.pl nsp2.blsx -s 921 -e 949 -d 15
#Mutation dectector's parameters: p=30 s=921 e=949 d=15
>AAO13191(NA_prototype)
920     948     29
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)
>AAO13191(NA_prototype)


It is true the simplest way is the best. Also it is better way that Perl is supposed to be Perlish and Python looks like Pythonic.

References

  1. Zhou, L. et al. (2009) The 30-amino-acid deletion in the Nsp2 of highly pathogenic porcine reproductive and respiratory syndrome virus emerging in China is not related to its virulence. J. Virol.  83(10):5156-67.
  2. Han, J. et al. (2007) Identification of Nonessential Regions of the Nsp2 Replicase Protein of PRRSV Strain VR-2332 for Replication in Cell Culture. J. Virol.  81(18):9878-90.
  3. Wikipedia: PRRSV.
  4. Altscul, S.F. et al. (1999) Basic local alignment search tool. J. Mol. Biol. 215:403-410.
  5. Thompson, J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22(22):4673-4680.
  6. Perl, http://www.perl.org
  7. BioPerl, http://www.bioperl.org

Comments

Popular posts from this blog

[왜#1] 똑같은 환경의 방안에 둔 금속이 종이나 나무보다 더 차가운건 왜일까?

여름이든 겨울이든 방안에  둔 금속이 종이나 나무로 만들어진 것 보다 훨씬 차갑다는 것은 다 아는 사실이다. 금방 냉장고에 꺼낸것이 아니라 2-3시간에 그냥 둔 것이니 책이나 캔의 온도는 같아야 할텐데 우리가 손을 데면 금속이나 액체를 담은 팩이 훨씬 차다. 갑자기 우리 딸아이가 이런 것을 질문하면 어떻게 답해야할까? 라는 생각이 들었다. 전공시간에 열역학을 배우긴 했지만... 결국 구글링과 여러 링크를 서핑한 결과 두 가지 단서를 찾을 수 있었고, 나름 결과를 정리해본다. "온도감각은 현재의 온도가 아니라 온도변화율에 반응한다" 피부가 온도를 느끼는 것은 주로 온각섬유 warm receptor와 냉각섬유 cold receptor가 각각 작용한다(1). 참고자료를 꼼꼼히 읽어보면 각 섬유세포의 역치(threshold)에 대해서 나오는데 단위가 [온도/시간]이다. 온각의 역치는 0.001 ℃/sec , 냉각의 역치는 0.004 ℃/sec 이며 공통으로 약 3초 후에 순응한다. 즉, 사람은 초당 온도가 얼만큼 변화는 가에 따라 온도를 느끼는 정도가 달라진다는 것이다. 겨울에 바람이 많이 부는 날은 체감온도가 낮아지는 이유도, 어느정도 따뜻해진 옷속의 공기를 계속 차갑게 하면서 변화율을 높이기 때문인 것이다. 결국 방안의 물건들이 같은 온도임에도 불구하고 금속성 물질과 종이로 만들 것을 만질때 감각의 차이는 온도변화율에 기인하는 것임을 알 수 있다. 다음으로  "접촉한 두 물질의 종류에 따라서 온도변화율이 달라지는가"에 대한 의문이 꼬리를 문다. 몇 가지 키워드로 자료를 찾다보면 "열전달 heat transfer", "열전도율 heat conductivity"이라는 용어를 접할 수 있다(2). 사실은 이공계 출신이라 전도율이라는 단어는 금방 떠올라서, '온도 전도율'로 검색했더니 올바른 표현인 '열전도율' 자료를 찾을 수 있었다. ...

Out-focusing of Lens

Photo 1. A toy car, a trial of out-focusing Yesterday I went to a photo studio, LemonTerrace Dongtan-branch , to take pictures of my son for celebrating the first birthday of him, " Dol ", a Korean word. The host of the studio, a specialist of camera, taught how to use my DSLR camera (EOS1000D) in details. Especially he demonstrated the out-focusing technique in case-by-case. A simple protocol for the beginner of DSLR user is the following; (1) Set 'Av' mode of your camera, (2) Set max number of ISO , my case is 1400, (3) adjust the distance between the camera and the subject as close as possible, (4) check shutter's speed by half pressing shutter button, (5) the speed should be lower than 1/60, (6) if too lower eg. 1/125, change ISO to be lower, (7) if higher speed, adjust the distance or set higher ISO, and (8) take a picture by full pressing. And he highly recommended for me to change my camera lens for portraits, 30 mm f/1.4 . The studio facilities and peo...

Running X-Window application without a screen

Batch processing of beautifying phylogenetic trees I prefer to use treedyn when I decorate many number of phylogenetic trees at  one time. In use on Linux desktop, my Perl script works well to convert from raw trees to beautified trees. However,  it was stopped at running of treedyn because Tcl of treedyn essentially required a DISPLAY of X-Window when  the script was integrated into an Web application. Xvfb (X virtual framebuffer) presents a virtual screen for user or program, which works only on memory not real video device. And xvfb-run.sh , a script of Xvfb, is an utility for command line program. $ xvfb-run.sh my_script.pl -infile tree.phy -outfile tree.ps Please see details of xvfb-run.sh  in Dascalita's very concise post . Additional comments Chevenet, the author of treedyn, and Christen recently published ScripTree that is more flexible and functional program for script developers. This ScripTree is also required to use Xvfb in case of backgrou...