An experienced reviewer periodically searched the literature and reviewed each abstract
and publication to determine if criteria for inclusion were met. The search terms
in the Table below were applied. Annual checks were also made against other existing
GWAS databases to ensure publications were not missed.
Occasional searches of GoogleScholar were conducted with the search terminologies
above which identified a few additional studies including some in journals not indexed
by PubMed. If articles or portions of their supplementary material were missing
or unavailable authors were contacted with reprint requests in order to determine
if the study met inclusion criteria. All GWAS included in GRASP Build 220.127.116.11 were
first published on or before July 8, 2013.
We define GWAS as studies that reported testing ≥25,000 human genetic markers for
1 or more trait. We exclude studies for the following reasons: CNV-only studies,
replication/follow-up studies testing <25K markers, non-human only studies, article
not available in English, gene-environment or gene-gene GWAS where single SNP main
effects are not given, linkage only studies, aCGH/LOH only studies, studies only
presenting gene-based or pathway-based results, heterozygosity/homozygosity (genome-wide
or long run) studies, simulation-only studies, studies which we judge as redundant
with prior studies since they do not provide significant inclusion of new samples
or exposure of new results (e.g., many methodological papers on the WTCCC and FHS
GWAS data extraction was performed by experienced researchers. QUOSA (Waltham, MA)
was used to automate the download of article PDFs where possible. Supplementary
text, tables and figures were all examined to determine if they contained any genetic
marker results. All manuscript materials were manually scanned to determine if there
was indication of additional GWAS results available at an external site or database,
in which case attempts were made to download all available results. If obtaining
GWAS results required institutional approvals or other extensive application these
were not pursued.
Individual genetic variant results were excluded if they were gene-gene, gene-environment
or haplotype-based association results where the main effects and/or single variant
results were not clearly given. If results were clearly indicated as re-reporting
or lookup of past results these were excluded. ALL remaining results with P≤0.05
from ALL manuscript materials including text, figures and supplements were extracted
(either P-value based tests or other significance tests, e.g., FDR values). When
an association statistic could be clearly linked with an effect estimate (e.g.,
beta and sem), or an odds ratio or hazard ratio and/or 95% confidence interval,
this information was extracted. Currently this information is not included in the
public version of GRASP due to potential privacy concerns (see Johnson et al., PLoS
Gen 2011 and related articles). Modeled or effect alleles were not recorded because
of the lack of consistent availability and clear reporting. The exact source of
each result (e.g., Table S5, Supplementary Text, Figure 3, WebData/FullScan) was
recorded to facilitate the ability to rapidly locate results in their original contexts.
If an individual SNP was reported as being associated with multiple phenotypes in
a given publication then ALL such SNP-phenotype associations were recorded as long
as the phenotypes were distinct (e.g., LDL cholesterol in all samples and LDL cholesterol
in men, adult height and height growth from age 4-14 are all treated as distinct
associations). The same phenotype being associated in different ethnogeographic
populations was not considered distinct (e.g., LDL cholesterol in European ancestry
and LDL cholesterol in African ancestry are considered the same phenotype). When
multiple associations are reported for the same SNP-phenotype combination, the SNP-phenotype
association with the lowest P-value was retained.
A set of information was collected at the level of each individual article. Basic
information about the article was collected (PubMed identifier, date of first publication,
journal, article title). On 9/15/14 studies in GRASP were compared
by PubMedID against the NHGRI GWAS catalog to index whether studies from GRASP were
included. An overall phenotype description(s) was assigned to each article based
on the GWAS conducted. Genotyping platforms (Affymetrix, Illumina, Perlegen, Sequenom
or other combinations) were recorded. The number of SNP markers included in post-QC
analyses were recorded if clearly given, otherwise they were approximated based
on the SNP array(s) or imputation.
The GWAS discovery phase and replication phase(s) samples were described including
number of samples in each ethnogeographic population, and an ethnogeographic sample
description (if clearly indicated). When mixed samples were described with percentages
in each sub-group or in prior publications, an attempt was made to calculate the
number of samples in each group. If replication samples were not reported this is
indicated as NR. Total sample numbers in discovery and replication and across the
whole study were calculated, as well as within individual ethnogeographic groups.
Gender-specific GWAS analyses were identified on the basis of phenotypes (e.g.,
male pattern baldness, prostate cancer) or on the basis of GWAS sample description
(e.g., Women’s Genome Health Study). If a publication included exclusively gender-specific
analyses this was recorded. If a study included either SOME or ALL gender-specific
analyses this was separately recorded.
Phenotype category labels were assigned to each study (e.g., CVD RF, Lipids) by
several researchers involved (ADJ, JDE, RL). When extensive results were available
(e.g., full scan results) Perl programs were often written to filter results to
those with P≤0.05. If dbSNP identifiers (rsIDs) were not given for associations
then other information such as array-specific IDs, chromosome/position/genome build
and alleles were recorded. All identifiers (rsID or otherwise) were used to map
to current genome and dbSNP builds via custom programs or through bioinformatic
processes (e.g., UCSC LiftOver).
Upon initial completion of all data extraction a list of unique phenotype labels
was reviewed to identify phenotype synonyms (e.g., “ApoA-I”, “ApoA1”, “ApoAI”, “Apolipoprotein
AI”, “apo A-I”). A Perl program using regular expression matching was employed to
find and replace phenotype labels in order to increase harmonization while retaining
meaning (e.g., retaining units) of individual results. Additional Perl programs
were used to flag potential redundant results within the database either within
individual publications or across publications.
The final catalog of results and their genomic positions was used to annotate against
gene, SNP and other feature annotations (see below). Study-level information was
integrated with individual results via the PubMed study identifier.
All SNP associations in the Full Download version are mapped to the genome [hg19]
build and reference SNP database [dbSNP build 141]. All SNPs and Genes in the Query
Search are based on current NCBI builds and may differ from the Download version.
Individual SNP-phenotype entries in GRASP have been annotated for assigned SNP function
[dbSNP functional classifications], location within or nearby to protein coding
genes [based on RefSeq genes], non-coding RNAs [based on Cabili PMID 21890647],
microRNAs [based on miRbase version 18, Kozomara PMID 21037258], microRNA target
binding sites [based on PolymiRTS, Ziebarth PMID 22080514], validated human enhancer
regions [based on Vista Enhancers, Visel PMID 17130149] and other known regulatory
elements [based on ORegAnno, Griffith PMID 18006570], amino acid changes and their
predicted consequences [PolyPhen2, Adzhubei PMID 20354512; SIFT, Kumar PMID 19561590;
LRT, Chun PMID 19602639] and post-translational modifications and other protein
functional features [based on mapping of UniProt features to amino acid positions
in the current genome build, UniProt PMID 21051339].
An NHLBIkey unique to each individual result was generated in the final database
using a concatenation of the PubMed ID + the row number. A “Creation Date” and “LastCurationDate”
were assigned to each result upon its creation in GRASP. In the event that entries
are edited over additional Builds the LastCurationDate will not match the CreationDate
but the NHLBIkey will remain the same.
We have assigned our own phenotype categories to facilitate searching and categorization.
Some studies conducted GWAS but do not report any SNP-specific results in their
manuscripts or via supplemental materials that are readily accessed. These studies
are included in the overall GWAS study list but will not have any results found.
A separate list of these studies is maintained