“Variant Normalization”的版本间的差异
| 第1行: | 第1行: | ||
| + | __TOC__ | ||
| + | |||
== Introduction == | == Introduction == | ||
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. A failure to recognize this will frequently result in inaccurate analyses. | The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. A failure to recognize this will frequently result in inaccurate analyses. | ||
| 第6行: | 第8行: | ||
== Definition == | == Definition == | ||
The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively. | The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively. | ||
| + | |||
| + | === Parsimony === | ||
| + | In the context of variant representation, parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0. It is a property describing the nature of the length of a variant's alleles and is defined as follows: | ||
2020年1月16日 (四) 05:08的版本
Introduction
The Variant Call Format (VCF) is a flexible file format specification that allows us to represent many different variant types ranging from SNPs, indels to copy number variations. However, variant representation in VCF is non-unique for variants that have explicitly expressed reference and alternate sequences. A failure to recognize this will frequently result in inaccurate analyses.
On this wiki page, we describe a variant normalization procedure that is well defined for biallelic as well as multiallelic variants. We then provide a formal proof the procedure's correctness.
Definition
The normalization of a variant representation in VCF consists of two parts: parsimony and left alignment pertaining to the nature of a variant's length and position respectively.
Parsimony
In the context of variant representation, parsimony means representing a variant in as few nucleotides as possible without reducing the length of any allele to 0. It is a property describing the nature of the length of a variant's alleles and is defined as follows: