Can't find what you are looking for? Ask us!
A PharmGKB Accession ID is a unique identifier that is assigned to
objects (e.g. a gene, a PCR assay, a sample set, a subject, etc.)
in the PharmGKB. PharmGKB Accession IDs start with the letters
PA, followed by a number (e.g. PA128366314).
When a submission is uploaded to the PharmGKB, new Accession IDs are generated for certain elements (elements that have pharmgkbId and localId attributes) in the submission. The list of new Accession IDs for the submission is sent to the submitter as well as made available on the PharmGKB website for future reference.
For a working example, the complete list of new PharmGKB Accession IDs generated for submission PS205469 (view XML) can be found here.
A localId is a unique identifier that is assigned to elements within a single XML document. It is an arbitrary value that is assigned by the submitter. The only rule a localId must follow is that no two XML elements can be assigned the same localId within a single XML document.
The primary motivation for providing a localId is to allow the element with the localId to be cross-referenced elsewhere in the XML document. For example, submitters will invariably want to assign assay elements a localId:
<rflpAssay localId="RFLP_ASSAY_1"> <interrogatedPosition>497</interrogatedPosition> <targetGenotype>C</targetGenotype> <method> ... </method> </rflpAssay> <dhplcAssay localId="DHPLC_ASSAY_1"> <analyzedRegion> <startPosition>35</startPosition> <stopPosition>102<stopPosition> </analyzedRegion> <poolSize>10</poolSize> <baselineSampleSize>100</baselineSampleSize> <method> ... </method> </dhplcAssay>so that they can be cross-referenced in the results of the assay for a particular subject:
<genotypingResult> <assayXref resource="local">RFLP_ASSAY_1</assayXref> <noResult /> </genotypingResult> <dhplcResult> <assayXref resource="local">DHPLC_ASSAY_1</assayXref> <sameAsMajority>false</sameAsMajority> </dhplcResult>
Another reason to provide a localId is to facilitate the mapping between a submitter's data and the PharmGKB Accession IDs assigned to it. See "What is a PharmGKB Accession ID?" for details on this.
A pharmgkbId is a unique identifier assigned by the PharmGKB (the aforementioned PharmGKB Accession Id). This attribute is used to refer to data that is already in the PharmGKB and is only supplied when editing or adding to such data.
This means that only a new element should be assigned a localId. Once it has been submitted, it should be referred to with a pharmgkbId. For example, the first time a submitter submits a subject, they may only know the subject's gender:
<subject localId="subject1"> <sex>Male</sex> </subject>Supposing
subject1 was assigned PharmGKB Accession ID
PA126750995, if the submitter should obtain the
subject's race at a later date and wanted to update the subject's
data, they would be able to do so with:
<subject pharmgkbId="PA126750995">
<race>
<nihCategory>Asian</nihCategory>
</race>
</subject>
If you are planning on using the same sample set across many submissions it is highly recommended that you first submit a separate submission that just contains your sample set. Once this has been approved, you will be able to use the PharmGKB Accession Id assigned to your sample set in cross-references in subsequent submissions.
For example, if you had previously submitted a sample set that was
assigned PharmGKB Accession Id PA128366314, you could
refer to this sample set from within an experiment like so:
...
<experiment>
<pcrAssay>
...
</pcrAssay>
<sampleSetXref resource="PharmGKB">PA128366314</sampleSetXref>
<genotypesInSample>
...
</genotypesInSample>
</experiment>
...
Variant alleles can be reported at various locations in the
PharmGKB XML schema (e.g. within a <variant> element). In all circumstances,
the variant allele must be associated with a position along a
reference sequence and the variant allele will replace the base on
the reference sequence at this position. For example, if the
reference sequence is "ATGATGATG" and the variant is
reported at position 3 to be an A, then we expect the resulting
sequence to be "ATAATGATG".
The previous example demonstrates a simple SNP, which is
represented by a single DNA base; deletions are represented by a
dash ("-") for each base that is deleted; insertions
are represented by multiple DNA bases; and repeats are represented
with a (repeat)X notation, where the
repeat is the DNA sequence being repeated, and
X is the number of times the sequence is repeated.
To report a polymorphism where more than one base is changed, the
sequence must be enclosed in square brackets
(e.g. [atc]) to distinguish it from an insertion.
Note that the first base of the variant always replaces the base
at the specified position in the sequence.
The following examples are based on variant alleles reported
against position 3 on the reference sequence
"ATGATGATG."
| Type of Variant | XML Example | Reported Allele | Resulting Sequence |
|---|---|---|---|
| SNP | <allele>a</allele> |
A |
ATAATGATG |
| Multi-base Polymorphism | <allele>[atc]</allele> |
[ATC] |
ATATCGATG |
| Deletion | <allele>-</allele> |
- |
ATATGATG |
| Multi-base Deletion | <allele>[---]</allele> |
[---] |
ATGATG |
| Insertion (polymorphic insertion) | <allele>atc</allele> |
ATC |
ATATCATGATG |
| Insertion (CC inserted) | <allele>gcc</allele> |
GCC |
ATGCCATGATG |
| Repeat | <allele>(cc)2</allele> |
(CC)2 |
ATCCCCATGATG |
The next set of examples show how these reported variants are
displayed on the web site. Again, these variants are being reported
against position 3 on the reference sequence "ATGATGATG."
| Type of Variant | XML Example | Reported Allele | Displayed on Web Site |
|---|---|---|---|
| Heterozygous SNP | <allele>a</allele><allele>c</allele> |
AC |
A/C |
| Heterozygous SNP [PCR] | <allele>a</allele> |
A |
A/G |
| Multi-base Polymorphism | <allele>a</allele><allele>[atc]</allele> |
A[ATC] |
A/[ATC] |
| Deletion [PCR] | <allele>c</allele><allele>-</allele> |
C- |
C/- |
| Deletion [PCR] | <allele>-</allele> |
- |
G/- |
| Multi-base Deletion | <allele>a</allele><allele>[---]</allele> |
A[---] |
A/[---] |
| Insertion (CC inserted) | <allele>g</allele><allele>gcc</allele> |
GGCC |
G/GCC |
| Repeat [PCR] | <allele>(cc2)</allele> |
(CC)2 |
G/(CC)2 |
In the instances where either a PCR assay or a sequencing assay in which the reference sequence is the default was used (see pcrAssay and sequencingAssay respectively for details) and only one allele is reported, the other allele is automatically deduced and displayed on the website.
For more details on reporting a repeat (or VNTR), continue on to the next FAQ.
For an overview of reporting variants, see the FAQ above. This FAQ focuses on reporting repeats.
Scenario 1: If there is a repetitive DNA sequence in the reference sequence, and this repeat is reported to vary by allele (VNTR), the number of repeats in the reference sequence will be determined by doing an exact pattern match, and this number will be used as the default number of repeats.
Example: Reference sequence with 5 repeats beginning at position 8
Position: 1 40
| |
CGGGACTGAT GATGATGATG ATGCCTATGC ACTTAGTCCA
Say 10 alleles were assayed, and 6 were found to have 5
GAT repeats as in the reference sequence,
2 alleles had 4 repeats and 2 alleles had 6 repeats.
<allele>(GAT)4</allele>
<allele>(GAT)6</allele>
<allele>(GAT)5</allele>
OR (more simply)
<noVariant />
When this file is processed, PharmGKB will observe that there is a varying repeat being reported at position 8. We will scan through the reference sequence and determine that there are 5 repeats in this sequence.
Case 1 will be read as the 4 repeats will replace the 5 repeats found in the reference sequence to yield:
Position: 1 37
| |
CGGGACTGAT GATGATGATG CCTATGCACT TAGTCCA
Case 2 will be read as the 6 repeats will replace the 5 repeats found in the reference sequence to yield:
Position: 1 40
| |
CGGGACTGAT GATGATGATG ATGATGCCTA TGCACTTAGT CCA
Case 3 will be read as no change from the reference sequence to yield:
Position: 1 40
| |
CGGGACTGAT GATGATGATG ATGCCTATGC ACTTAGTCCA
On the variant page on the PharmGKB website, these results will be reported in the �Variant� column as:
(GAT)4/(GAT)5/(GAT)6
Scenario 2: If there is NO repetitive DNA sequence in the reference sequence, but repeats are reported at a particular position, the repeats are considered insertions into the reference sequence.
Example: Reference sequence with no repeats at position 8
Position: 1 30
| |
CGGGACTCCT ATGCCTATGC ACTTAGTCCA
Say 10 alleles were assayed, and 6 were found to have no repeats as in the
reference sequence, 2 alleles had 4 repeats of GAT
at position 8 and 2 alleles had 8 repeats at position 8.
<allele>(GAT)4</allele>
<allele>(GAT)8</allele>
<noVariant />
When this file is processed, PharmGKB will observe that though there is a repeat being reported at position 8 for some alleles, there is no such repeat in the reference sequence at that position.
Case 1 will be read as the 4 repeats will replace the base found at position 8 in the reference sequence to yield:
Position: 1 40
| |
CGGGACTGAT GATGATGATC TATGCCTATG CACTTAGTCC A
Case 2 will be read as the 8 repeats will replace the base found at position 8 in the reference sequence to yield:
Position: 1 40
| |
CGGGACTGAT GATGATGATG ATGATGATGA TCTATGCCTA TGCACTTAGT CCA
Case 3 will be read as no change from the reference sequence to yield:
Position: 1 30
| |
CGGGACTCCT ATGCCTATGC ACTTAGTCCA
On the variant page on the PharmGKB website, these results will be reported in the �Variant� column as:
C/(GAT)4/(GAT)8
The readingFrameStartPosition defines the position of the first base in the first codon in a coding region on the reference sequence. It is used to determine the amino acid a variant codes for and is therefore only applicable to coding features (i.e. exons).
In the sequence below, intronic regions are denoted in lower case
and exons are denoted in upper case and the first base (the
left-most c) is position 1.
Position: 4 27
| |
cccGATGATGATGATGccccccccccTATGATGATGaaa
If the reading frame begins at position 4 for the first exon and
at position 27 for the second exon, then the codons for this
sequence would be GAT GAT GAT GAT GTA TGA TGA TG.
If, however, the reading frame begins at position 5 for the first
exon, then the codons for this sequence would be ..G ATG
ATG ATG ATG TAT GAT GAT G.., which would change the amino
acid translation of the sequence.