신호와 의미 / Signal & Semantics

Posts

Showing posts with the label Bioinformatics

NGS를 위한 생물정보 프레임워크의 필요성

최근 차세대염기서열결정법(NGS)이 각 생물학 연구에서 주요한 도구로 활용됨에 따라 생물정보 분석 서비스를 위한 하드웨어 및 소프트웨어 일체를 NGS에 맞출 필요가 있습니다. NGS는 기존 염기서열결정법에 비해 짧은 서열을 대규모로 생산하므로 이에 맞춰 메모리 활용이나 병렬 계산법이 적극적으로 적용된 소프트웨어가 개발되고 있고, 따라서, 최적의 성능을 발휘하기 위한 하드웨어의 구성도 함께 필요한 것입니다. 최근 발간된 보고서(The Global Outlook for Next Generation Sequencing: Usage, Platform Drivers & Workflow, October 31, 2011. BioInformatics, LLC. See on YouTube )에 따르면 NGS를 이용하는 연구자들의 이슈 중에서 가장 어려운 문제로, 분석 소프트웨어의 성능 개선으로 조사되었음. 또한 플랫폼 관리 및 스토리지 문제를 포함하면 생물정보 관련 문제가 29%나 차지하고 있습니다. 그림 1. Most Significant Improvement to Your Next Generation Sequencing Workflow (출처: The Global Outlook for Next Generation Sequencing: Usage, Platform Drivers & Workflow, October 31, 2011. BioInformatics, LLC) 또한 이 보고서에 따르면, 현재의 연구 환경에서 연구자들이 느끼는 가장 큰 병목 지점으로 워크플로우(work flow)의 투명한 관리에 있다고 합니다. 즉, 워크플로우를 이용한 여러 단계의 심층 분석의 결과와 각 단계별 결과의 이력을 투명하게 관리하는 정보공학적 플랫폼이 가장 큰 병목 문제로 지목하였습니다. 이와 함께 워크플로우의 아웃소싱을 전략적으로 수행하기 위한 체계 구축도 중요한 이슈로 부각되었습니다. 오믹스 프로...

Collecting meta data from Entrez

It's often to show growth of sequence data of interest when one writes research proposal. For an example, you requires to collect number sequences from agricultural organisms and compare it to human if you want to explain how sequences regarding to agricultures grow faster than human data. Usually the gross statistics of GenBank , is posted on NCBI's Web page, might be not enough to describe details of the data growth. By using show index , preview , and limit functions in Entrez, you can quickly collect meta information like number of entries. dbE ST Total records Records for last 3 years Growth rate for last 3 years human 8,315,231 177,492 2.1% mouse 4,853,547 3,289 0.1% cattle 1,559,494 45,232 2.9% pig 1,620,570 144,207 8.9% chicken 600,423 1,041 0.2% insects 4,493,137 1,864,326 41.5% bacteria 1,266 1,012 79.9% fungi 2,893,583 1,508,814 52.1% plant 22,633,681 7,290,397 32.2% To complete the above table, we need to count total records for each species in dbES...

How to find a long indel from Nsp2 alignment

Motivation Zhou et al . presented that an unique 30-amino-acid deletion in Nsp2-coding region is a key feature to classify whether a strain is a highly pathogenic porcine reproductive and respiratory syndrome virus (PRRSV). And the Nsp2, nonstructural protein 2, has been shown to undergo remarkable genetic variation, primarily in its middle region, while exhibiting high conservation in the N-terminal putative protease domain and the C-terminal predicted transmembrane region ( Han et al . 2007 ). This post aims to show how to find a quite large deletion in a specific coding-region with positional tolerance. Figure 1. The 30-Amino-Acid Deletion in the Nsp2 of Highly Pathogenic PRRSV. ( Zhou et al. 2009 ) Method and Implementation Pairwise alignment between a sequence of interest and a reference sequence (ORF1a of VR-2332 strain) is an essential step for finding insertions and/or deletions, shortly indels. The two sequences were aligned with BLAST (Altsch...