2020 new data: Genetic and phenotypic landscapes of mtDNA in the Japanese population

Below is an excerpt from the Open Access 2020 study:

Yamamoto, K., Sakaue, S., Matsuda, K. et al. Genetic and phenotypic landscape of the mitochondrial genome in the Japanese population. Commun Biol 3, 104 (2020). https://doi.org/10.1038/s42003-020-0812-9

Abstract

The genetic landscape of mitochondrial DNA (mtDNA) has been elusive. By analyzing mtDNA using the whole genome sequence (WGS) of Japanese individuals (n = 1928), we identified 2023 mtDNA variants and high-resolution haplogroups. Frequency spectra of the haplogroups were population-specific and were heterogeneous among geographic regions within Japan. Application of machine learning methods could finely classify the subjects corresponding to the high-digit mtDNA sub-haplogroups. mtDNA had distinct genetic structures from that of nuclear DNA (nDNA), characterized by no distance-dependent linkage disequilibrium decay, sparse tagging of common variants, and the existence of common haplotypes spanning the entire mtDNA. We did not detect any evidence of mtDNA–nDNA (or mtDNA copy number–nDNA) genotype associations. Together with WGS-based mtDNA variant imputation, we conducted a phenome-wide association study of 147,437 Japanese individuals with 99 clinical phenotypes. We observed pleiotropy of mtDNA genetic risk on the five late-onset human complex traits including creatine kinase (P = 1.7 × 10−12).

… Here we provide a comprehensive analysis that characterizes the genetic and phenotypic landscape of mitochondria in the Japanese population. (i) We constructed a high-resolution mtDNA variant list and haplotype classifications in the Japanese population based on deep WGS of ~2000 individuals. (ii) We conducted unsupervised clustering of the mtDNA variant patterns and assessed their correspondence with defined haplogroups by using a set of machine learning (ML) methods. (iii) We quantitatively assessed positional distributions and LD structure of the mtDNA variants by parallelly comparing with those of nDNA. (iv) We performed genome-wide scans to investigate the mtDNA–nDNA and mtCN–nDNA genotype associations. (v) Finally, we conducted mitochondrial genomewide genotype imputation of genome-wide association study (GWAS) data of more than 140,000 Japanese individuals, by using the population-specific WGS reference data. We then conducted a PheWAS of 99 complex human disease and quantitative traits.

RESULTS

High-resolution mtDNA variant and haplogroup lists in the Japanese population

We re-analyzed the previously reported WGS data of the Japanese population (n = 1928)18. We realigned the WGS reads on the human reference genome GRCh37, which includes the revised Cambridge Reference Sequence (rCRS, NC_012920.1) as the human reference mitochondrial genome. The rCRS has been widely used in mitochondrial genome analyses including Japanese individuals. We only used the reads uniquely mapped on the mitochondrial region to avoid the contamination of nuclear copies of the mitochondrial genome (nuMTs)19. In this study, we focused on the analyses of homoplasmy. Then, we identified 2023 variant sites, of which 63 sites were multiallelic (the mean depth = 1488; Supplementary Data 1). Of these, 516 variants (25.5%) were newly identified in our WGS data. Minor allele frequency (MAF) spectra indicated that the majority of the identified variants were rare in Japanese; rare variants (MAF < 0.5%), low-frequency variants (0.5% ≤ MAF < 5%), and common variants (MAF ≥ 5%) accounted for 79.3%, 16.4%, and 4.3%, respectively (Supplementary Fig. 1). We observed clear concordances of the alternative allele frequencies with those in the previously reported two Japanese databases (1507 and 1025 variants with 3.5 KJPN [n = 3552] and Giib-JST mtSNP [n = 672], respectively; Supplementary Fig. 2)7,20,21. As previously reported22,23, mutational spectrum indicated a high transition to transversion (Ti/Tv) ratio of 16.44 (Supplementary Fig. 3).

Next, each individual was classified into the mtDNA haplogroup based on a variant list detected by the WGS using HaploGrep (v2.1.14)24. Haplogroups are classifications of the mtDNA haplotypes defined according to a set of the specific mtDNA variants. As mtDNA is a haploid genome, the detected variants could be directly used for haplogroup classification without phasing. Nomenclature of each haplogroup is hierarchically defined based on the number of the letters (from one to nine), which was divided into sub-haplogroups (e.g, “D4b” as three letters). The number of the haplogroups monotonically increased from the macrohaplogroup (n = 11 at one letter) to the sub-haplogroups with larger number of letters (n = 310 at nine letters; Supplementary Data 2). Increments in the number of the haplogroups became limited from seven to nine letters (Fig. 1a, b).

The frequency distribution of each haplogroup was obtained according to geographical regions in Japan (as defined by BioBank Japan Project: Hokkaido, Tohoku, Kanto-Koshinetsu, Chubu-Hokuriku, Kinki, Kyushu, and Okinawa from northeast to southwest areas of Japan; Supplementary Fig. 4)18 and all the populations from the 1000 Genomes Project Phase 3 (1KG, n = 2504; Fig. 1c)25. In the Japanese population, macrohaplogroups A, B, D, M, and N had more than 1% of frequencies across all the regions. In the regions from Hokkaido to Kyushu, macrohaplogroup D was most prevalent (>28%), followed by M and B. In contrast, the different spectrum was observed in Okinawa (the islandic region at the most southwest area in Japan), where M and B were more prevalent (37.5% and 25%, respectively) than D (18.8%). Furthermore, although the D4a and D4b haplogroups were prevalent from Hokkaido to Kyushu, the D4c haplogroup was prevalent in Okinawa (Supplementary Fig. 5). Although the haplogroup spectra in 1KG East Asians (EAS) were relatively similar to those in Japanese, R was more enriched in 1KG EAS, and D, G, and M were more frequent in Japanese. 1KG populations other than EAS showed distinct haplogroup patterns from the Japanese population. M, A, H, and L were most prevalent in 1KG South Asians, Americans, Europeans, and Africans (AFR), respectively. Especially, 1KG AFR showed the least diversity within macrohaplogroups, of which African-specific macrohaplogroup of L accounted for >90% of frequencies.

Fig. 1: High-resolution spectra of mtDNA haplogroups in the Japanese population

c Stacked bar plots of the frequencies of the macrohaplogroups within the geographical regions in Japan and subpopulations from 1KG. The geographical regions in Japan are defined as Hokkaido, Tohoku, Kanto-Koshinetsu, Chubu-Hokuriku, Kinki, Kyushu, and Okinawa from northeast to southwest areas of Japan, as described elsewhere42.

Fig. 1: High-resolution spectra of mtDNA haplogroups in the Japanese population.

From: Genetic and phenotypic landscape of the mitochondrial genome in the Japanese population

Unsupervised ML approaches deconvoluted mtDNA classification patterns. To evaluate how the defined haplogroups reflect the mtDNA diversity within a population, we conducted unsupervised clustering of the subjects based on the mtDNA variants and evaluated the concordances with haplogroup assignments. We adopted three unsupervised ML classification approaches of phylogenetic approach, principal component analysis (PCA), and uniform manifold approximation and projection for dimensionality reduction (UMAP). We first constructed the phylogenetic tree of the WGS individuals, which was illustrated as an unrooted tree type (Fig. 2a). The tree branch was mainly divided into the major two lineages at the root base, which were known as the “M” and “N” clusters. Each major lineage was further divided into sub-lineages corresponding to the sub-haplogroups. Next, we applied the linear dimensionality reduction technique of PCA and examined up to the 20 PCs. The explained variances were 12.6% for the top 20 PCs. As the two-dimensional plot of the PC1 and PC2 was difficult to fully capture the cluster classifications (Supplementary Fig. 6), we adopted the three-dimensional plot consisting of the top three PCs (Fig.2b). The major M and N clusters were also illustrated as distinct groups in the PCA plot. In contrast to the M cluster, the N cluster was further divided into sub-haplogroups, such as A5a, B4a, B4b, B4c, B5*, F1a, F1b, N9a, and N9b.

UMAP

UMAP is a recently developed nonlinear dimensionality reduction algorithm, which has a merit in preserving the topology of the local and global structure26. Although UMAP has been previously applied to high-dimensional biological data such as single-cell RNA sequencing27, here we applied it to the genomic data of the mtDNA variants. An application of UMAP could classify the subjects into >20 clusters, which was concordant with the pre-defined sub-haplogroups with three letters (Fig.2c). UMAP could differentiate the sub-haplogroups belonging to the same macrohaplogroup in more detail (e.g., D4a, D4b1, D4b2,D4c, D4e, and D4g belonging to D). On the other hand, we did not find the clear delineation between the M and N clusters, which was observed in the phylogenetic approach and PCA. Several sub-haplogroups with small sample sizes were clustered closely with other macrohaplogroups (e.g., M7c clustered with D macrohaplogroup). These observations could be the potential limitations of UMAP. As each ML classification method hadunique advantages to classify mtDNA variations, we would propose a value of applying multiple ML methods to comprehensively visualize haplogroup diversity within a population.Distinct characteristics of mtDNA variant structure from nDNA. Due to lack of recombination and higher mutation rate, LD structure and tag variant distribution in mtDNA are considered to be distinct from those in nDNA, whereas the details have been unclear especially in non-Europeans. Thus, we conducted a comprehensive assessment of these features in Japanese by using the WGS data. In addition to calculate the haplotype correlations (r2) between common mtDNA variant pairs (MAF≥5%,n=80), we estimated those obtained from the randomly selected autosomal phased haplotypes with adjustment on variant positional differences (±8.3 kbp). Fractions of the variant pairs with high correlations were relatively smaller in mtDNA than in nDNA (2.4% and 20.2% of the variant pairs with r2≥0.9 in mtDNA and nDNA, respectively; Fig.3a).When we checked distance-dependent decay of LD, we observed clear LD decay in the nDNA variant pairs according to physical distances between the variant pairs, which would reflect LD blocks (R=−0.092,P=1.0 × 10−7, as highlighted with red; Fig.3b). However, mtDNA variant showed no distance-dependent LD decay (R=0.022,P=0.21). Although there existed controversial discussions16,28, our results did not support the hypothesis of potential recombination events in mtDNA. Lack of recombination and relatively weak LD in mtDNA should propose that the common mtDNA variants are sparsely tagged bythe surrounding single-nucleotide polymorphisms (SNPs) when compared with nDNA. The numbers of the tag variants per common variant were smaller in mtDNA than in nDNA, even when the variant distances were adjusted (Fig.3c). As many as 13.8% of the common mtDNA variants did not have any tag variants with r2≥0.5, whereas only 5.0% in nDNA. Systematic visualization of pairwise LD patterns revealed that mtDNA variants did not constitute LD blocks among neighboring variants (Fig.3d). On the other hand, we observed multiple common haplotypes spanning the entire mtDNA (n=8), which might have been the consequence of lack of recombination and no distance-dependent LD decay. Interestingly, mtDNA variants without any tag variants were mostly identified in the D-loop region, one of the non-coding but functional regions in mtDNA. By using WGS data, we highlighted the characteristics of the mtDNA variants28, which were (i) no distance-dependent LD decay (i.e., absence of LD blocks) and (ii) sparse tagging of the common variants, in non-European population. In addition, we could newly detect existence of the common haplotypes spanning the entire mtDNA. No evidence of mtDNA–nDNA genotype association. Mitochondrial function is regulated not only by the genes encoded in mtDNA but also by those in nDNA. As these two typesof genes confer synergistic biological functions6, co-evolutionof the genetic variants embedded within them5, namely “mtDNA–nDNA genotype association”, has been suggested17. Thus, we tested the hypothesis that there might exist preferential transmission between nDNA and mtDNA at a genotype level. To explore footprints of the mtDNA–nDNA genotype association,we conducted a genome-wide scan to assess genotype distribution dependence between mtDNA and nDNA variants using the WGS data. We investigated 86 common mtDNA (MAF≥5%) and the genome-wide 7,124,343 nDNA variants (MAF≥1%), but there was no significant mtDNA–nDNA genotype association when multiple comparisons were considered (P=5.0 × 10−8/86=5.8 × 10−10; Fig.4a). Even when we focused on the variants located within ±10 kbp of the previously defined nDNA mito-chondrial genes (n=1105)29, we still did not detect a significant association. In addition, we investigated the associations by using the imputed GWAS data (n=141,552; see details in the next section). We analyzed the association between the 8 mtDNA(MAF≥5%) and the 7,402,102 nDNA variants (Rsq≥0.7 and MAF≥1%), but no significant signals were detected (P=5.0 ×10−8/8=6.3 × 10−9). We neither observed the enrichment of the association signals in the mitochondrial-related gene variants in nDNA (Fig.4b).

DISCUSSION

UMAP is a recently developed nonlinear dimensionality reduction algorithm, which has a merit in preserving the topology of the local and global structure26. Although UMAP has been previously applied to high-dimensional biological data such as single-cell RNA sequencing27, here we applied it to the genomic data of the mtDNA variants. An application of UMAP could classify the subjects into >20 clusters, which was concordant with the pre-defined sub-haplogroups with three letters (Fig. 2c). UMAP could differentiate the sub-haplogroups belonging to the same macrohaplogroup in more detail (e.g., D4a, D4b1, D4b2, D4c, D4e, and D4g belonging to D). On the other hand, we did not find the clear delineation between the M and N clusters, which was observed in the phylogenetic approach and PCA. Several sub-haplogroups with small sample sizes were clustered closely with other macrohaplogroups (e.g., M7c clustered with D macrohaplogroup). …

Our study demonstrated a successful imputation of the mtDNA variants by using the large-scale WGS as a reference panel. …

In conclusion, through the large-scale WGS and PheWAS of mtDNA, our study could comprehensively elucidate the genetic and phenotypic landscapes of mtDNA in the Japanese population.