1.基因注释数据库(Gene Annotation Database)
基因注释数据库是描述基因组上gene、transcript、exon、intron等结构的染色体序号、正负链、起止位置、标准名称等信息的数据库。有三大注释库: RefSeq(refGene), UCSC(knownGene), Ensembl(ensGene)。另外,常用的GENCODE Annotation结合了Havana manual gene annotation和Ensembl automated gene annotation。Ensembl浏览器显示即是GENCODE Annotation,二者等同)。
1.Annotation是一种数据库。其文本格式表现为GTF文件、GFF文件等。
2.有些Annotation还有basic版本,它从多个transcript中只选用了最长的那个。按官方解释为 a subset of representative transcripts for each gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.
RefSeq(refGene的来源)和GENCODE(ensGene的来源)是两大最基础的gene sets数据库。前者多是curated,重在唯一、标准化。后者采纳不同来源,尤其基因的定义更加广泛,包括了编码、非编码、假基因;既有验证的,又有推测的;转录本异常丰富。
GENCODE注释库有趋势成为”金标准”。它引用了Ensembl的ID。其ID以”ENS”为头,加上类型标志(“G” for gene, “T” for transcript, “E” for exon, “P” for translation),再加11位数字。
2.突变注释(Mutation Annotation)
彻底掌握突变注释方法,不在于对软件的了解,而在于精通数据库。如果不是批量处理数据,结合可视化软件、进行手工注释更直观、更准确。有多个软件可以对SNP和InDel进行注释(Annotation)。注释时需要两个重要文件:vcf输入文件和注释数据库。
■ Annovar
1.Install application and set environment variables
alias annotate_variation='E:/path/annotate_variation.pl'
alias table_annovar='E:/path/table_annovar.pl'
vi annotate_variation.pl,then change defaunlt seeting $buildver='hg18' into hg38
annotatedb='E:/path/humandb/'
2.Prepare annotation database
annotate_variation -downdb -buildver hg38 -webfrom annovar refGene $annotatedb
refGene is the name of annotation file humandb/ is the folder to put $annotation
3.Prepare vcf input file
convert2annovar -format vcf4old example.vcf > example.avinput # convert vcf to tap-separated file
4.Perform genomic annotation
# gene-based annotation keyword: -geneanno
annotate_variation -buildver hg38 -geneanno example.avinput $annotatedb # by default, geneanno is ON
# region-based annotation keyword: -regionanno
annotate_variation -buildver hg38 -regionanno -dbtype cytoBand example.avinput $annotatedb
annotate_variation -buildver hg38 -regionanno -dbtype gff3 -gff3dbfile tfbs.gff3 example.avinput $annotatedb
# filter rare or unreported variants (in 1000G/dbSNP) or predicted deleterious variants
annotate_variation -buildver hg38 -filter -dbtype 1000g2015aug_all -maf 0.01 example.avinput $annotatedb
annotate_variation -buildver hg38 -filter -dbtype snp138 example.avinput $annotatedb
annotate_variation -buildver hg38 -filter -dbtype dbnsfp30a -otherinfo example.avinput $annotatedb
# perform the aboving three types of annotation together
table_annovar
5.Perform protein annotation
coding_change
6.Other tool: retrieve_seq_from_fasta
■ SnpEff
1. Install application and set environment variables
snpeff='E:/path/snpEff.jar'
2. Prepare annotation database
java -Xmx4g -jar $snpeff databases | less
java -Xmx4g -jar $snpeff download GRCh38.86
3. Prepare vcf input file
no special requirement on input files
4. Perform genomic annotation
java -Xmx4g -jar $snpeff -i vcf -o vcf GRCh38.86 example.vcf > example_snpeff.vcf
■ Oncotator
可以输出为VCF文件或者MAF(Mutation Annotation Format)
1. Install application and set environment variables
基于python,oncotator可以直接执行
2. Prepare annotation database
oncotatordb='/path/to/oncotator_v1_ds_April052016'
3. Prepare input file
vcf OR tsv(a.k.a MAFLITE)
4. Perform genomic annotion
oncotator --db-dir $oncotatordb exmple.vcf exampleOutput.tsv hg38
■ Visualization
无论结果如何,必需在基因组浏览器中,连同序列比对文件一起可视化,手工核查重点结果。由于大量转录本的存在,有时软件注释并不准确,而且不同软件对同一个突变描述不同(若干博客都提到了这个问题)。