Methods & Resources

Search Approach for GWAS

An experienced reviewer periodically searched the literature and reviewed each abstract and publication to determine if criteria for inclusion were met. The search terms in the Table below were applied. Annual checks were also made against other existing GWAS databases to ensure publications were not missed.

Admixture mapping

eQTL

Genome-wide significance

iQTL

Affymetrix Axiom

EWAS

Genome-wide significant

Metabochip

Affymetrix SNP

Exome array

Genomic architecture

mQTL

Affymetrix and “World array”

Exome chip

GWA

meQTL

Association scan

Expression QTL

Human 20k cSNP

Omni Chip

BeadChip

"Gene chip" and "SNP"

IBC array

"Omni" and “Exome”

"Candidate Gene Association Resource"

Genetic architecture

iCOGs

pQTL

Cardio-Metabochip

Genome scan

Illumina HumanHap

rQTL

Cardio Metabochip

Genome scanning

Illumina NeuroX array

SNP array

Cardio Metabo Chip

Genomewide association

Illumina Omni

sQTL

CytoSNP

Genome-wide association

ImmunoChip

WGA

DMET

Genomewide meta-analysis

Immuno Chip

Whole genome association

Epi-genome wide

Genome-wide meta-analysis

iSelect

Occasional searches of GoogleScholar were conducted with the search terminologies above which identified a few additional studies including some in journals not indexed by PubMed. If articles or portions of their supplementary material were missing or unavailable authors were contacted with reprint requests in order to determine if the study met inclusion criteria. All GWAS included in GRASP Build 2.0.0.0 were first published on or before July 8, 2013.

Study Inclusion and Exclusion Criteria

We define GWAS as studies that reported testing ≥25,000 human genetic markers for 1 or more trait. We exclude studies for the following reasons: CNV-only studies, replication/follow-up studies testing <25K markers, non-human only studies, article not available in English, gene-environment or gene-gene GWAS where single SNP main effects are not given, linkage only studies, aCGH/LOH only studies, studies only presenting gene-based or pathway-based results, heterozygosity/homozygosity (genome-wide or long run) studies, simulation-only studies, studies which we judge as redundant with prior studies since they do not provide significant inclusion of new samples or exposure of new results (e.g., many methodological papers on the WTCCC and FHS GWAS).

GWAS Data Extraction

GWAS data extraction was performed by experienced researchers. QUOSA (Waltham, MA) was used to automate the download of article PDFs where possible. Supplementary text, tables and figures were all examined to determine if they contained any genetic marker results. All manuscript materials were manually scanned to determine if there was indication of additional GWAS results available at an external site or database, in which case attempts were made to download all available results. If obtaining GWAS results required institutional approvals or other extensive application these were not pursued.

Individual genetic variant results were excluded if they were gene-gene, gene-environment or haplotype-based association results where the main effects and/or single variant results were not clearly given. If results were clearly indicated as re-reporting or lookup of past results these were excluded. ALL remaining results with P≤0.05 from ALL manuscript materials including text, figures and supplements were extracted (either P-value based tests or other significance tests, e.g., FDR values). When an association statistic could be clearly linked with an effect estimate (e.g., beta and sem), or an odds ratio or hazard ratio and/or 95% confidence interval, this information was extracted. Currently this information is not included in the public version of GRASP due to potential privacy concerns (see Johnson et al., PLoS Gen 2011 and related articles). Modeled or effect alleles were not recorded because of the lack of consistent availability and clear reporting. The exact source of each result (e.g., Table S5, Supplementary Text, Figure 3, WebData/FullScan) was recorded to facilitate the ability to rapidly locate results in their original contexts.

If an individual SNP was reported as being associated with multiple phenotypes in a given publication then ALL such SNP-phenotype associations were recorded as long as the phenotypes were distinct (e.g., LDL cholesterol in all samples and LDL cholesterol in men, adult height and height growth from age 4-14 are all treated as distinct associations). The same phenotype being associated in different ethnogeographic populations was not considered distinct (e.g., LDL cholesterol in European ancestry and LDL cholesterol in African ancestry are considered the same phenotype). When multiple associations are reported for the same SNP-phenotype combination, the SNP-phenotype association with the lowest P-value was retained.

Study-level Data

A set of information was collected at the level of each individual article. Basic information about the article was collected (PubMed identifier, date of first publication, journal, article title). On 9/15/14 studies in GRASP were compared by PubMedID against the NHGRI GWAS catalog to index whether studies from GRASP were included. An overall phenotype description(s) was assigned to each article based on the GWAS conducted. Genotyping platforms (Affymetrix, Illumina, Perlegen, Sequenom or other combinations) were recorded. The number of SNP markers included in post-QC analyses were recorded if clearly given, otherwise they were approximated based on the SNP array(s) or imputation.

The GWAS discovery phase and replication phase(s) samples were described including number of samples in each ethnogeographic population, and an ethnogeographic sample description (if clearly indicated). When mixed samples were described with percentages in each sub-group or in prior publications, an attempt was made to calculate the number of samples in each group. If replication samples were not reported this is indicated as NR. Total sample numbers in discovery and replication and across the whole study were calculated, as well as within individual ethnogeographic groups. Gender-specific GWAS analyses were identified on the basis of phenotypes (e.g., male pattern baldness, prostate cancer) or on the basis of GWAS sample description (e.g., Women’s Genome Health Study). If a publication included exclusively gender-specific analyses this was recorded. If a study included either SOME or ALL gender-specific analyses this was separately recorded.

Phenotype category labels were assigned to each study (e.g., CVD RF, Lipids) by several researchers involved (ADJ, JDE, RL). When extensive results were available (e.g., full scan results) Perl programs were often written to filter results to those with P≤0.05. If dbSNP identifiers (rsIDs) were not given for associations then other information such as array-specific IDs, chromosome/position/genome build and alleles were recorded. All identifiers (rsID or otherwise) were used to map to current genome and dbSNP builds via custom programs or through bioinformatic processes (e.g., UCSC LiftOver).

Upon initial completion of all data extraction a list of unique phenotype labels was reviewed to identify phenotype synonyms (e.g., “ApoA-I”, “ApoA1”, “ApoAI”, “Apolipoprotein AI”, “apo A-I”). A Perl program using regular expression matching was employed to find and replace phenotype labels in order to increase harmonization while retaining meaning (e.g., retaining units) of individual results. Additional Perl programs were used to flag potential redundant results within the database either within individual publications or across publications.

The final catalog of results and their genomic positions was used to annotate against gene, SNP and other feature annotations (see below). Study-level information was integrated with individual results via the PubMed study identifier.

Data Annotation

All SNP associations in the Full Download version are mapped to the genome [hg19] build and reference SNP database [dbSNP build 141]. All SNPs and Genes in the Query Search are based on current NCBI builds and may differ from the Download version. Individual SNP-phenotype entries in GRASP have been annotated for assigned SNP function [dbSNP functional classifications], location within or nearby to protein coding genes [based on RefSeq genes], non-coding RNAs [based on Cabili PMID 21890647], microRNAs [based on miRbase version 18, Kozomara PMID 21037258], microRNA target binding sites [based on PolymiRTS, Ziebarth PMID 22080514], validated human enhancer regions [based on Vista Enhancers, Visel PMID 17130149] and other known regulatory elements [based on ORegAnno, Griffith PMID 18006570], amino acid changes and their predicted consequences [PolyPhen2, Adzhubei PMID 20354512; SIFT, Kumar PMID 19561590; LRT, Chun PMID 19602639] and post-translational modifications and other protein functional features [based on mapping of UniProt features to amino acid positions in the current genome build, UniProt PMID 21051339].

Database Structure

An NHLBIkey unique to each individual result was generated in the final database using a concatenation of the PubMed ID + the row number. A “Creation Date” and “LastCurationDate” were assigned to each result upon its creation in GRASP. In the event that entries are edited over additional Builds the LastCurationDate will not match the CreationDate but the NHLBIkey will remain the same.

Phenotype Terminology Assignment

We have assigned our own phenotype categories to facilitate searching and categorization.

Studies with No Available Results

Some studies conducted GWAS but do not report any SNP-specific results in their manuscripts or via supplemental materials that are readily accessed. These studies are included in the overall GWAS study list but will not have any results found. A separate list of these studies is maintained here.