Natural selection of immune and metabolic genes associated with health in two lowland Bolivian populations

Significance Humans have adapted to diverse environments, but few studies have linked regions of the genome with evidence of selection to phenotypes. We generated an integrative dataset of genomic, transcriptomic, and biomarker data for the Tsimane and the Moseten—two indigenous Amerindian populations in Bolivia. From analyses focused on Tsimane individuals, we found evidence for adaptation at genes and traits involved in immunity, which makes sense given that infection has strong effects on physiology and fitness in this population. Using phenotypic data from both populations, we were able to link genotypes to the immune and metabolic traits that are potentially advantageous. This study expands our knowledge of natural selection in Amerindians and uncovers previously undescribed loci of evolutionary and biomedical relevance.


Supplementary Information for Lea et al. "Natural selection of immune and metabolic genes associated with health in two lowland Bolivian populations"
Supplementary items in this document: • • Dataset S1. Unadmixed Peruvian samples from 1000 Genomes used for selection analyses • Dataset S2. Candidate regions with evidence for positive selection (not collapsed) • Dataset S3. Candidate regions with evidence for positive selection (collapsed) • Dataset S4. Parameters used for demographic modeling • Dataset S5. Genes overlapping or near candidate regions • Dataset S6. Enrichment of genes that overlap candidate regions for trait associations in published GWAS • Dataset S7. Sample sizes and demographic overview for the phenotypic dataset • Dataset S8. Significant results from linear mixed effects models testing for eQTL in candidate regions • Dataset S9. Significant results from linear mixed effects models testing for genotypephenotype associations in candidate regions • Dataset S10. Tag SNPs in candidate regions associated with phenotypic traits • Dataset S11. Results from polygenic selection analyses • Dataset S12. Results from polygenic selection analyses

Genotype data filtering and processing
We used Plink [1] to remove the following SNPs from our total dataset of n=1286: nonautosomal SNPs, SNPs that were not biallelic, SNPs that were not in Hardy-Weinberg equilibrium (p<10 -8 ), and SNPs that were not genotyped across >90% of individuals. This filtering left us with 1,651,754 SNPs. We also removed 5 samples with call rates <90%, as well as 15 individuals with genome-wide heterozygosity values that were >3 standard deviations above or below the mean value for the sample set (as in [2]). For samples that were run in duplicate, we retained the replicate with the higher call rate.
Using these SNP and sample sets, we next performed analyses to create a dataset of unrelated Tsimane individuals for evolutionary inference. First, we removed SNPs with MAF<5% and sites in linkage disequilibrium. Specifically, we used the indep-pairwise function in Plink [1] to scan windows of 50kb with a 20kb offset, and to randomly prune variants within each window so that no pair exceeded an R 2 threshold of 0.8. We then estimated pairwise relatedness for all samples using PC-Relate, which performs well in sample sets with population structure [3]. We followed the workflow provided here: https://bioconductor.org/packages/release/bioc/vignettes/GENESIS/inst/doc/pcair.html. Using the relatedness estimates from PC-Relate and functions provided by the program, we randomly pruned the dataset so that no individuals remaining in the dataset exhibited a kinship coefficient >0.125 (corresponding to third degree relatives). This left us with 203 Tsimane individuals.
Next, we used GenotypeHarmonizer [4] to combine this reduced dataset with the 1000 Genomes Phase 3 call set [5] (downloaded from ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/). This harmonization is necessary when combining genotype data generated via different platforms and pipelines, such that they are stored using potentially different or unknown strands. We excluded A/T and C/G SNPs that could not be unambiguously merged between the two datasets using the "ambiguousSnpFilter" option. This process resulted in 693,720 biallelic SNPs, which were then phased using Shape-IT v2.r904 [6] and the 1000 Genomes Phase 3 call set as a reference panel as well as the 1000 Genomes Phase 3 genetic map. We note that ideally a population specific reference panel and genetic map would be used for phasing, but given that none exist, we followed the common practice of using the 1000 Genomes dataset for analyses of genetically uncharacterized populations [7,8].

Principal components and admixture analyses
We merged our filtered, phased data with: 1) the 1000 Genomes Phase 3 call subset for Peruvians (the least admixed Native American population), Han Chinese (who are often used as the best available ancestral reference for Amerindians [8,9]), Yorubans (to represent African ancestry), and British (to represent European ancestry) and 2) all South and Central American populations from the Simons Genome Diversity Project [10]. We used Plink [1] to filter for MAF>5% in the merged dataset, remove sites in strong linkage disequilibrium (again using a window size of 50kb, a 20kb offset, and an R 2 threshold of 0.8), and to perform PCA (Figure 1).
We used two approaches to detect European and West African admixture. First, we used the program ADMIXTURE [11] to estimate the proportion of the genome originating from K ancestral populations for each individual, with K being specified a priori. We performed these analyses using the merged dataset described above for PCA, but with Han Chinese samples removed (such that the setup was similar to [8]). We ran ADMIXTURE with K=3-7. Analyses using a given value of K were run five times with different random seeds. For each value of K, we retained results providing the lowest cross-validation (CV) error and we report these result in Figure S1. For ease of visualization, reference populations with >20 samples were randomly pruned to n=20 and results for K=4 (the lowest K value close to the minimum CV error) are presented in the main text ( Figure 2).
Our second approach relied on local ancestry assignments from RFMix [12]. Here, we used 1) British individuals to represent European ancestry, 2) Yoruba individuals to represent West African ancestry, and 3) all South and Central American samples from SGDP grouped together to represent Native American ancestry. The SGDP samples contain minimal European admixture and are therefore arguably a better reference set than the South and Central American samples from 1000 Genomes. Using these reference populations, we then assigned chromosomal segments for each sample to its most likely ancestral source. We also included Peruvian samples from 1000 Genomes in this test set as a positive control, given that admixture has been analyzed in this population previously [13][14][15].
We observe essentially no evidence for African ancestry within the Tsimane and minimal evidence of European admixture ( Figure 2 and Figure S1). Our estimates of European admixture from RFMix are highly correlated with our estimates from ADMIXTURE (R 2 =0.8393, p<10 -16 ; Figure 2C). For all downstream selection analyses, we pruned our set of 203 unrelated individuals down to 196 individuals with >95% Native American ancestry inferred from RFmix.
We also pruned the 1000 Genomes Peruvian dataset down to 26 individuals with >85% Native American ancestry (using higher thresholds for Peruvians resulted in too few samples for analysis). The set of admixture-filtered Peruvian individuals used for selection analyses are presented in Table S1. For the Tsimane samples that passed our filters, we followed the methods of [8] and masked SNPs in 1) regions of low confidence ancestry assignments (<90%) and 2) regions that were inferred to be inherited from a non-Native American ancestor (i.e., regions that passed our confidence threshold, but where the most likely assigned genetic source was European or African).

Summarizing selection statistics to identify candidate regions
To identify candidate regions that have putatively been under positive selection, we used two approaches to summarize our iHS, XP-EHH, and PBS results set. First, we rank ordered the |iHS|, XP-EHH, and PBS distributions and defined outliers as those in the top 1% of each distribution. We then binned the genome into 50kb windows with 25kb offsets, and counted the number of outlier loci for each statistic that fell in a given window. Windows with outlier numbers in the top 1% of the genome-wide distribution for all three selection statistics were considered as candidates. Because each statistic has its own underlying assumptions and sensitivities, windows that are outliers for many different tests are expected to be enriched for true positives [2,8,16]. We required all three statistics to exhibit outliers in a given region in order to identify the most robust signals; in other words, we aimed to minimize the false positive rate even if it came at the cost of a higher false negative rate (as is common in the literature [2,8,16]). We chose to use 50kb windows because simulation studies have shown this window size provides good power to detect sweeps with selection intensities between 2Ns=100 and 1000 [17].
Our second approach to identify candidate windows used the same results set, but summarized in a different way. Specifically, we combined the genome-wide rank of the three statistics for each SNP that was analyzable by all methods with a Fisher's combined score (FCS) [18]. This score was equal to the sum, over the three statistics, of -log10(rank of the statistic/number of SNPs tested; Figure S2). Outlier regions were then defined as those with a median FCS score among the highest 0.1% of the genome. We used this second approach because it has been shown that combining different neutrality statistics into a single score may increase power [18]. The rationale being that neutrality statistics are expected to be more correlated for positively-selected variants relative to neutral variants [19]. The set of 50kb regions identified by our two summary approaches were highly overlapping (Table S2) and the union set with overlapping regions collapsed is summarized in Table S3.

mRNA-seq data generation and processing
Between July and November 2017, venous blood samples were collected in PAXgene tubes from Moseten (n=88) and Tsimane individuals (n=154). Samples were processed according to manufacturer's protocol (kept at ambient temperature for 2 hours, then transferred to a -20°C, and were stored at -80°C after they were exported to the US). In the US, blood samples were sent to the UCLA Social Genomics Core Laboratory where RNA was extracted, Because of the extreme population-batch confound, we did not attempt to combine Tsimane and Moseten samples and instead processed them as two separate datasets (after removing the five Tsimane samples sequenced with the Moseten samples). Below, we describe our processing procedures, which were performed for each dataset separately.

Supplementary Figure 5. Phenotypic effects of genetic variation under selection, stratified by population. A-B) Results for candidate regions with a significant eQTL, C-E)
Results for candidate regions with a significant genotype-phenotype association. X-axis shows copies of the minor allele, y-axis shows normalized gene expression levels or phenotypic measures (units for triglyceride and HDL cholesterol levels=mg/dL). In all cases, gene expression and phenotypes were modeled linearly and assume an additive model. Detailed results are provided in Tables S9-10. SNPs used for association mapping are as in Figure 4.  Figure 6. Whole blood gene expression levels for candidate genes. X-axis shows all candidate genes and y-axis shows the distribution of log 10 counts per million (CPM) from RNA-seq data. A median CPM cutoff of >3 (equivalent to a log10 CPM>0.48) was used to filter for expressed genes. For all tag SNPs discussed in the main text, we reran our analyses after permuting the genotype label 1000 times, to confirm that the empirical null distribution was uniform as expected. Histograms show the distribution of p-values for the genotype effect on a given phenotype from 1000 permutations.