基因数据处理47之ART基因序列数据生成器（仿真）

发布时间：2021-03-10 11:06:04 所属栏目：大数据来源：网络整理

导读：1.概念： ART基因序列数据生成器详细请见论文：【1】和官网【2】 2.下载： ART-bin-GreatSmokyMountains-04.17.16-Linux64.tgz http://www.niehs.nih.gov/research/resources/assets/docs/artbingreatsmokymountains041716linux64tgz.tgz 3.配置 sudo cp到

1.概念：
ART基因序列数据生成器
详细请见论文：【1】
和官网【2】

2.下载：
ART-bin-GreatSmokyMountains-04.17.16-Linux64.tgz

http://www.niehs.nih.gov/research/resources/assets/docs/artbingreatsmokymountains041716linux64tgz.tgz

3.配置
sudo cp到用户的bin下

4.使用:

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina

详细请看附录

5.例子：

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20

结果：

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20

    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.1 (Apr 17,2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------
还在运行

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ ll
total 9443836
drwxrwxr-x 2 hadoop hadoop       4096  6月  2 23:10 ./
drwxrwxr-x 6 hadoop hadoop       4096  6月  2 22:59 ../
-rw-rw-r-- 1 hadoop hadoop 4635232124  6月  2 23:11 G38L100F20Nhs20.aln
-rw-rw-r-- 1 hadoop hadoop 4347022003  6月  2 23:11 G38L100F20Nhs20.fq
-rw-r--r-- 1 hadoop hadoop  252513055  6月  2 23:00 GRCH38chr1L3556522.fna

参考
【1】 http://bioinformatics.oxfordjournals.org/content/28/4/593.short
【2】 http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
【3】 http://www.niehs.nih.gov/research/resources/assets/docs/artbingreatsmokymountains041716linux64tgz.tgz
附录：

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina 

    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.1 (Apr 17,2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

===== USAGE =====

art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>

===== PARAMETERS =====

  -1   --qprof1   the first-read quality profile
  -2   --qprof2   the second-read quality profile
  -amp --amplicon amplicon sequencing simulation
  -c   --rcount   number of reads/read pairs to be generated per sequence/amplicon (not be used together with -f/--fcov)
  -d   --id       the prefix identification tag for read ID
  -ef  --errfree  indicate to generate the zero sequencing errors SAM file as well the regular one
                  NOTE: the reads in the zero-error SAM file have the same alignment positions
                  as those in the regular SAM file,but have no sequencing errors
  -f   --fcov     the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon
  -h   --help     print out usage information
  -i   --in       the filename of input DNA/RNA reference
  -ir  --insRate  the first-read insertion rate (default: 0.00009)
  -ir2 --insRate2 the second-read insertion rate (default: 0.00015)
  -dr  --delRate  the first-read deletion rate (default:  0.00011)
  -dr2 --delRate2 the second-read deletion rate (default: 0.00023)
  -l   --len      the length of reads to be simulated
  -m   --mflen    the mean size of DNA/RNA fragments for paired-end simulations
  -mp  --matepair indicate a mate-pair read simulation
  -M  --cigarM    indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch
  -nf  --maskN    the cutoff frequency of 'N' in a window size of the read length for masking genomic regions
                  NOTE: default: '-nf 1' to mask all regions with 'N'. Use '-nf 0' to turn off masking
  -na  --noALN    do not output ALN alignment file
  -o   --out      the prefix of output filename
  -p   --paired   indicate a paired-end read simulation or to generate reads from both ends of amplicons
                  NOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000
  -q   --quiet    turn off end of run summary
  -qL  --minQ     the minimum base quality score
  -qU  --maxQ     the maxiumum base quality score
  -qs  --qShift   the amount to shift every first-read quality score by 
  -qs2 --qShift2  the amount to shift every second-read quality score by
                  NOTE: For -qs/-qs2 option,a positive number will shift up quality scores (the max is 93) 
                  that reduce substitution sequencing errors and a negative number will shift down 
                  quality scores that increase sequencing errors. If shifting scores by x,the error
                  rate will be 1/(10^(x/10)) of the default profile.
  -rs  --rndSeed  the seed for random number generator (default: system time in second)
                  NOTE: using a fixed seed to generate two identical datasets from different runs
  -s   --sdev     the standard deviation of DNA/RNA fragment size for paired-end simulations.
  -sam --samout   indicate to generate SAM alignment file
  -sp  --sepProf  indicate to use separate quality profiles for different bases (ATGC)
  -ss  --seqSys   The name of Illumina sequencing system of the built-in profile used for simulation
       NOTE: sequencing system ID names are:
            GA1 - GenomeAnalyzer I (36bp,44bp),GA2 - GenomeAnalyzer II (50bp,75bp)
           HS10 - HiSeq 1000 (100bp),HS20 - HiSeq 2000 (100bp),HS25 - HiSeq 2500 (125bp,150bp)
           HS10 - HiSeq 1000 (100bp),150bp)
           HSXn - HiSeqX PCR free (150bp),HSXt - HiSeqX TruSeq (150bp),MinS - MiniSeq TruSeq (50bp)
           MSv1 - MiSeq v1 (250bp),MSv3 - MiSeq v3 (250bp),NS50 - NextSeq500 v2 (75bp)
===== NOTES =====

* ART by default selects a built-in quality score profile according to the read length specified for the run.

* For single-end simulation,ART requires input sequence file,outputfile prefix,read length,and read count/fold coverage.

* For paired-end simulation (except for amplicon sequencing),ART also requires the parameter values of
  the mean and standard deviation of DNA/RNA fragment lengths

===== EXAMPLES =====

 1) single-end read simulation
    art_illumina -ss HS25 -sam -i reference.fa -l 150 -f 10 -o single_dat

 2) paired-end read simulation
       art_illumina -ss HS25 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_dat

 3) mate-pair read simulation
       art_illumina -ss HS10 -sam -i reference.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat

 4) amplicon sequencing simulation with 5' end single-end reads art_illumina -ss GA2 -amp -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_5end_dat 5) amplicon sequencing simulation with paired-end reads art_illumina -ss GA2 -amp -p -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_pair_dat 6) amplicon sequencing simulation with matepair reads art_illumina -ss MSv1 -amp -mp -sam -na -i amp_reference.fa -l 150 -f 10 -o amplicon_mate_dat 7) generate an extra SAM file with zero-sequencing errors for a paired-end read simulation art_illumina -ss HSXn -ef -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_twosam_dat 8) reduce the substitution error rate to one 10th of the default profile art_illumina -i reference.fa -qs 10 -qs2 10 -l 50 -f 10 -p -m 500 -s 10 -sam -o reduce_error 9) turn off the masking of genomic regions with unknown nucleotides 'N' art_illumina -ss HS20 -nf 0 -sam -i reference.fa -p -l 100 -f 20 -m 200 -s 10 -o paired_nomask 10) masking genomic regions with >=5 'N's within the read length 50 art_illumina -ss HSXt -nf 5 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_maskN5

（编辑：淮安站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

终于有人把MPP大数据系	字节跳动数据平台技术
数据中台虚火？数据管	转向未来的AI自动化测