IdrD domain-level metagenome surveys

Overview

This is a reproducible, narrative description of the methods used to investigate the representation of idrD in real-world environments, specifically the human microbiome, as reported in our manuscript. The motivation for mining metagenomes came from the biochemical characterization of IdrD, which represents a novel contact-dependent toxin involved in self/non-self identification between bacterial strains. IdrD is particularly interesting, structurally, it contains multiple subregions with varying nucleotide-level conservation. The 5’ end of the gene (N-terminus) seems to behave “ordinarily” with expected nucleotide-level divergence between taxa; the middle usually has a relatively conserved domain for recombination (e.g., Rhs); and the C-terminal domain (CTD) is much more variable. This CTD contains the active site for DNase activity. So, a priori we expect some lumpy coverage reflecting the different domains, as they represent different levels of sequence conservation (i.e., frequency in the pool of short reads that is a metagenome). Thus, we were interested in investigating its presence and diversity in complex, host-associated microbiomes to understand the CTD’s prevalence, diversity, and population specificity.

The process

We started with a FASTA file of full-length idrD nucleotide sequences, obtained by manual curation of idrD hits from a combination of HMM and tBLASTn searches of EMBL and NCBI. The sequences can be found here:

>Acinetobacter_baumannii_strain_XH858_idrD
ATGAACAAGAATAGTTATCGCATTATTTATAGCAAAGCACGTCAAATGTTTGTGGCCGTGGCTGAAAATGTACGAAGCCAAACCAAGACTTCTGGTCAAAGTGAAGCAAGTACCCAGAACAAGATTGATAACACTGAATCTCAGGCATTTCATCAACTCTGGCAAGTGAAAGCATTGGTTGCCAGTATTAGCCTGTGGATGCCTTTAGCACCTGTTTATGCGGGTATTGTAGCGGACAGTGCCGCAAATGCTGCGAATAGAGCAGTTATTGGTGCTGGAAAGAACTCAGCAGGTACAGTCGTTCCTGTCGTTAATATTCAAACTCCAAAAAATGGGATATCCCATAATATTTATAAACAGTTTGATGTATTGGCTGAAGGTGCAGTACTCAATAACAGCCGTCAGGGTGCTACGACAAAAACAGTAGGAAACGTTGCAGCGAACCCATTTTTAGCTACAGGTGAAGCACGAGTTATTTTAAATGAAGTAAATTCATCAGCCGCTTCACGCTTTGAGGGGAACTTGGAAGTTGCAGGGCAAATGGCTGATGTGATTATTGCCAACCCGTCTGGTATTAGTATTAAAGGTGGTGGCTTTATTAATGCCAACAAGGCAATTTTCACCACAGGTAAACCTCAGTTAAATGCTGATGGCAGTATCAAGCAGTTTACGGTTGACCAAGGAAAAATTACGGTCTCAGCGAACCCCAATTCTAAATTTGGTTTGGGTGGAAATAATAACGATGCAAATTATGTGGATCTCTATGCGCGTGCATTAGAGCTTAGTGCTGAGCTACGTGCTAAAAATGATATTCAGGTGATTGCCGGGGCAAATAATGTCAGTGCAGATTTACAGGATGTTACAGCTAAAACAGGTACTGGAACAGCACCAACATTGGCTGTGGATGTTAAAGCATTAGGCGGGATGTATGCCAATAATATTTATTTGATGGGAACTGAAAAAGGCTTAGGCGTGACGAACGCAGGGACTATTCAGGCAGTCAACAATCTGGTAATTACCAGTGCCGGAAAGATTGAACACAGTGGTACAATCAGTTCGACCAGTAAAACTCAAGGTTTAGTAAACATACAAACAACGGGTACGGGTGCAGCTTCCGATATCAACTCATCAGGTAGTATTAATAGTAATAGCATGTTAAATATCGATTCAGGTAATAACCTGAATGTAAATGCTAAAGAAATTATTATTAATAACAGCAGTCTAGCATCTTCACCTTTAATCGTTAATACCAAAGGTAATATCAATTTAGCAGCAAACACCCGCATTATGGATGATTCACAAGGTGGAGATGTCTATATTGATGCAGCAAATATCAATCTTGCCGCTGGATCGGAACTCAAAAGTAACCGTGGTACAGCAACCATACAAGTACAAAAAGACTTAGTTGCAGCCAAAGGCGCAAAACTGATTGCTGCCCAAGACTTAAATGTTTTAAGTAATGGTAAGTTATCACTTACTGAGAACCATATACAGGCATCATTAGGTACAATCAACCTACAAGCAAATTCGGCTAATACCCAAAATCTGATTGACCTTCAAGGTGGAACGATTTATGCGGGTAAAGATTTAAATTTATATAGCACTGATAATATTAATATGAAAAACTTAAGTTTTTCAATAGAAAATAATAAAAATCGAGTAAAAAATATTAATATTTATAGTGGTGGTAGTTTATATTGGGATAATACTGCTTTTTTAATACCTGACCTAACTGGAAAAATTTCACTGAGAGCGGAAAAAGATTTAACAATATTGGGTCAAGGTGGTTTAAGTGCAGATAACGGTATTAGTTTATATGGTGGAAATGTAAATGTTTCTTCACTTAATTTAAATTCAAATATTGGTAATACATTGGTTTTTGCAAATAATAATATTGATATTAAGGGGGTAAATATAAAGAGTAATAATGGAAATTTAAATTTTTCAGCAGAAAATGGCTATATCAAGATTGCTGATTCGAATTTGTCAAGCGAAAAAAATAAATTATTTTTAATTTCTAAAGATAAGCAAATTTTTAATAATACTAGTTTACTTTCAGGAGGGGATTTAATATTAAATACTATAAGTAATAAAGGTGATGTAGAAATTAATCAAGATGGAGGGTTATTTTCTTCACCAATTAAATCAAATAATGGTAATGTAATTATTGATTCTAAAGGGATGTTAAATATAACAACAAAAGACTTAAATAAATTTGTTGATATCAGTGGTGAGAAAATAGAAATTTCTTCTTCTAATAATATGAAATTAGATGGTATTAATGTATATTCAAATGGTAATTTATCTCTACATGCAGATAAAGATATATACTTGAATACAGTTCTGCCAAACTCATGGACAGGGGGTGGTTTTGGTAGTGAAATCATTAGAAGCCAAAAGCATATATCTATCAGTAGTAAAGAGGGCAGTGTTAAAATAGGATTAGAAAACCAAGCTTTAAATGCTCCTGTCGTTGGCCTAAAAGCTAATGGCGTAATTTCAATAAAAAGTAAGTTAGATCAAAGTTATCAAAATACCTCGCTAAATGCGGGTGCTATTACGCTTCACTCAGATGCCGGAAGTATTAATAATAGTAAAAATTTTAGAGCAATTGCTAATACAACATTTTTTCTCGAAAGTGATAGCGAATTAAAAAAAATAAATGGTAACTTAAGTTTATATGCGAAAAAAGATATTCAACTAGACGCGATTAACTCTCAAGGACGAGCTTTTGATCCAAATTACACGCGTATGATTAAACCCGTGCTCTCAAGCCAAGGTTTAATGGATATTCGTGCTGATGGAGTGTTTCAGGTGAATGGTAAAGCACCTCCAGCACAAACTGCGATGTGGGATCAGTTGAAAATGGATAGAGCATATTTTACTTCAAATGATGGTATTAATATTCTAGCAGGAACTGTCAAGATGTATGCTGGTGATCTAAAAAACACATCAAATACTGCACCAATCAATATAATATCAACAGGAGATATCGTTCTTGATGCCATGAACTATGATGTAAATATAGGGCAGTTAGATCTGATGTTACCAGCACAAAAGCTTCGAGATGAATTAGAAGCTAGCGGTAAAAAGGATTCTGTTACATTAAATACAATCAACCAGCTCAATGAAGAAATTGCATTTTATTTATCAAGATCATTAAATGGAACAAGAAGCCAAAGTAGCCATATTGATTCAGCACAAAATATTAATCTCCTATCTAAGAAGGGAATACTCATTAGAGCTTCAGAATTATATGCTAAAAATGAAGTAAATATTGAAGCACAAGGGTTATTAGCACGTGATCTTACAGGGGCTCAGGTAACTGATTATATCGATACTTCAATTTTAGTAGATGGGGTACATGATATTTATAAAAATGGTGAAGCTACTAATTCAAATTATGAAGAACGCTCTGATTTTCATAATTCAATTGTTAGTGGTGATAATGGAATCAATATTAAATCAACAGGTTCAGCCAAATTAAATTGGTATGATAATCCAAACCTATTATCTAATTACGCTACAGTCACTTCAGCAAAACCTATACAGTTAAAAAAGAAAATAAATGATGGAAAAAATAATATTGTTTTTAATGGTGCAGAATTAATTTCATCAGGAGGAGATGTCAATATTCAATCTAATGCAAATATTGTATTAGAATCAAGTCAAGATGAGATTTATAAGCATAGTACCAGTCAAGTTATAAAAAAAACTTGGTATGGAAAAAAGAAATCTACAGAAACCTATACACAGAGCCAATATGGGACAGCTGTTCCAACAACAATTTTAGCAAAAAATATTAGTCTGAAGGCAGCAGGAGATATAGATTTCTATTCAACGTTGATGAAAGCAAATAGTGGTAATGTTAATATTACTGGCGAAAATCTTTACTTTCTAACTTCAAATAACTATGCAAATTCAACTAGTACAACGAAAAAAAACTCATCAATTTTAGGAATTCCACTTAATAAATCTAAAACGACTAGTAGTCGTAGTCAAATTTCGCAACTTCCAGTAAAGCTGGTTGGAGATTATCTGTCAACTGAATCGAAAAATGATACAGTTTTTGTTGGGACAGAGTTTGATTATATGAAAAATGCGTCAGTTAAGGCAGGTGGGCAGATTGCTTTCCTTGGGGCATCAGAAACAATAAGTCAAGCTTTAAAAGTTGAAAAGAGCAATATTGTTTGGCAATCAATGCAAGATAAAGGCTCTATTACTGAAACTACTAAACTCCCAAATTTTAATGGACCTACTCCACCTACTTTTAAAGCATCAGGTGGCTTAATAGTACAGGTGCCTATTATATCTAATAAAAATAATGATATCCGTACAGAAGTGATTAACCTTGCTAATCAACCAGGTAATGAATATCTCAAAGGTTTTATTGCCAGAAAAGATGTAAATTGGGAGGCGGTTAAACTTGCACAAGAGAATTGGGATTATAAATCACAAGGTTTAACTGCTGCTGGTGCTGCAATTATTGTAATCATAGTTACAATTGCTACGATGGGTTCTGGTACTGCAGCTGCAGCTGGTGCTGCAGGTGGAACTGCTGCTAGTGGGACTACAGTAGGTCTTGGAGCAAGTATGATTGGAACTGCTGGTGTAACTACGACTGCAGCAGGAGCAATTGTTCCATCCACTTTAGGTGTTATGGCTAATGCAGCTGTTACAAGTTTAGCAACTCAAGCAAGTATTAGCTTAATTAATAATGGAGGTGATATTAGTAAGACACTTAAAGATTTAGGAAGTAAAGAAACTGTTAAAAATTTAGCAACTTCGGTTGTTACTGCAGGAGCCTTGAGTACAGTTGGTGGGCTTGATTGGATGGAAAAATTAAAAGATATGGGTACAGATCCGAATCTTTTAGTAAATCTAGCAGGGCGTACTGGTGGAGCTATAGTTGAAGCCGGTCTAAGTGCAGGAATTAGTAGTAGTATCAATGGTGGAAGTTTAACTGATAATTTACAAGGTATGTTACTTTCTAGTTTCGCAACAGCTTTACAGGGTAGTTTAGCGTCAAAAATTGGTAAAGAACTAGGAGGAATGAATAAAACAGATTTCGAAAATATTTTACATAAAATAGCCCATCTTGCTGCTGGGTGCGTAGCTGGTGCTATACAAAAACAATGTGAATCTGGCGCTATAGGTGCAGGAATTGGAGAGTTTGTCGCAGAGTTCATGCCAGCAGCCAGTGGCAATGGGGTGTATAGTGATGACGAAAAAGAAAGAGTTTTAGCAGTTGGTAAAATTTTGGCAGCTGGTGTAGCTGGTATTACTGGGTACGATGTAAACGTTGCATCAAGCTCTGCAAATACAGCACTGTTGAATAATGCATTATCAAAAACGGATGTAGATGTATTAATTCAAAACTTAAAAAGCTGCAAAAATGAATCACAATGTCGAAATGAGATTCAAAAAGCAGTTCTCAAGAGTAATTATAATCAAAGTAATGCAGTGAATTGTCTTTCGTTACTTTGCTCAGATAATCTTGAGCAGGCAATTAGTGGTTATAATCAGCTTAGAGGATTTATAGCAAGTGGTAAATTTTCAGGAGATAATTTAAAGCTATTAAATAATTTACTCACTCAAAGTACTAAAGATATTACTCTTCAAATTAATAGACTACCTCAAGTTCAAGCTTATCTAAATTCCCCTTGGACACCGCCAACAAAGCAACAGATGGCACAGATAGTTATTGATATTTTACCAATGACTAGTGATCCTTCTGCAATTTATTCAATTATTACAGGTAAAGCAGTTGTTACACAAGAAGAGGTAAGTCGTTTTTATGCAGCTGTTGGTTTAATTTTGCCAATTACAATTGGTCGTTCTTTAACAAAAGCCGAAATTGCTGGTTTGAAACAGTCTTCGCAAAATCCACTAAGCGGTACAGTAAGTATAAAAGATCTCCCAGCTCTACCAAAATCTGCACAAAACTTCGTTATTGAACAAACAGCACAATCTAAATATACAAAACTTTTATATGAAGCAAAGAATGATCCTAGTTTAGGAACTTGGAGTACTTCTAAAATAGGTCAAAAGGGTGAGGAAATAGCTGTACAGTTGGTTAAGGGGAGTGGTTTTACAGATGTTATAAGTATACAAAACGCTAGTGGTAATGGTATTGATATTATTGCAAAAAATCCTCAGGGAAATTATGTTTACTTAGAAGTAAAAACATCTGCTGTTGGAAAGATTGGTAGTTTATCTAAAAGGCAAGAAGATATGAACTTCTTTGTAAATGATATATTAACAAATGCCGCTGCTAAAACTGGGCGTTATGTAAACATAAGTGATACTGTAAATGCACAAGCTAAAATTATGTTAACAAAATATAAGCAACAAAAAGCTGTAGGTAATATCTATGGAGCAGCTATAGGGGTAGATTTAAAAAATGCTCAAATCTTAATCTCTTCTTGGACAAATAAATAA
>Cronobacter_turicensis_z3032_idrD
GTGTCTGAACCGTTAGCTGCCCGTTATCAGGACCCGCTCATCCACAGCTCTCTGCTCGCCGATGTCGTCAGCGGTGTGGTCGAAGGCGCCATTTGTCTTGCTGCGTTTACGGCGGGTGCGGCGATGATGACAACGGGGCTTGGCACGGTCGCAGGCGTTGTACTGATTGCCGCTGTTTTCGGCAGCGGCATCGCGGAAGAGGCAGGCGATCTGGTCGGCCAGGGTGTTGATGCCGTGCTGGATTTCTTCGGCTGGCGTGGGCCACCGGATGCCTTTATTACCAGTGGCTCGCACAACGTTCATATCATGGATCTCCCTGCCGCCCGTGCGGCGGGTACGGTCGATCATGATTATCTCAATACGCCTATCCCGGCAGAGAGCTTTGCCGATAAGGCGAAGGCGTTCGCGATAAACACGGCGGTTACCATTCTCGAAGTGGCGCAGTTTACGCTGCACCCGTTTGATAATCTGGCTGCGGGAGCCGACGCGATTGCGAGCAGCGGCTGGCAGGGCGTGAAAAATTTCGCGGGCAGCGTCTGGGATAACTTAACCCAGCCGGTCGTGGCGGGGGCCAGCCCTTTTGCTAAAGAAGCGCCGCTGGATACTGTCGAGTGCACCAAAGGCCATACGATCACCGGCGGTAACTTCCTGGCGGAAGGTTCGAAAAAAGTGCTGATCAACGGTCAGCCCGCGTGCCGTGATGGCGATCGCAGCACCTGCGAAGCAAAGATAAAAGTTAAAGAAAACACGCGTGTGCGCATCGGCGGCGAAAGCATTGTGGTGCGCGATATCCGCAGCGGTAAAAATTTCTGGGCGCGGCTTATCGGTAACGCCATCGGCAGCCTTGGACCGGGAATATTGCGCAATCTCAGTAAAGGGATGTTAAAGGCGATATTCAGCCGCCAGATCCTGAAAACCTTTTGCTGTCAGCTCGCCGCCGATCTCGGCATGGGATTAACCACGTTCGGGCTGATTCAGGCGGGGAAAGTGGGCAGCGAAGCGCGCCACACGCAGCACCCGGTGGATATCGCCAGCGGCGCGAAAATCCTCGCGGGCGGCGAAGACCGCGATTTCACGCTGGAAGACCGCATTCCGCTCATCTGGCAGCGTGTTTACAACAGCCGCAATCTGGCGACCGGTATGCTCGGTACCGGCTGGCTGCTGCCGTTTGAGACCCGCTTCTTCCGTCTTGAGAACAATCAGTTTATCTGGCGCGATATGTCGGGCCGCGATCTGGGCTTCGGCGAGCTGACCCCCGGCGACGTGGTGGATTATCTCGAAGACGGCGTCACGCTCTATTACACCGTCAGCGGCACGCTGATGCTCCAGATGACGAGCGGCGAATATCACGTCTACGAACCGGACCCGACGCGTCCGGGCGAGTGGCGGCTGTTCCGCATCTACGATCGTCACGAAAACTGCCAGTATTACAGCTGGGATGAACACGGCAGGCTGGCGCGCATTTCCGGCGACAACGAAGCGCTGGATGTCGAACTTGCCTACGAGAAATCGCATGGCCGTCTCGCCAGCGTGCACCAGGTATGCAACGGCGAGCGCCGCCTGCTGGTGACTTATGGCTATAACGAAAACGGCCAGTTGATCGAGGTCACCGACGCGGACGGCATCGTCACGCGCCGCTTCGGCTGGGATCGCGCCAGCGACATGATGGGCTGGCACAGCTACTCCACCAACCTCAGCGTGCATTATCAGTGGCAGCCCGCCGCCGATGCGCCCAACTGGCGCGTGTGCAGTTATCAGGTGCTGGACGACCAGGACAACGTGCTGGAGCGCTGGCGCATCGACGCCGACGAGGCGAAACGCTGCGCCACGGTAAGCTGCGACGCGGGTTTCTCGACGCGCCACTGCTGGGATTTCCTCTACCGCATCACGGAATACACCGACCGCCACGGCGGCGTGTGGCGCTACGAGTGGGCAGATTATGCCGAGCTGCTGAAAGCGGCCACCACGCCGGACGGCAGCCGCTGGGTGTATGGCTACGACGAGCACGGCAACCTGACGGAAGTGCGCGATCCGCTCGGCAACAGCACATTCACCACGTGGCACCCGGTGTTCGCGTTCCCGGTGAAGGAAGTGCTGCCGGACGGCGCCGCCTGGCAGTATGAATACAATGCGCGCGGCGATGTGGTGGCGCTCACCGACCCGAAAGGCGGCGTGACGCGCTTTGAGTGGAACGAGCAGGGCGACCTGGTTCAGCAGACCGACGCGCTGAACAACACGCACCGGTTCTGGTGGAACGAGCGCGGACAGCTGGTGCGCGACGAGGACTGCTCCGGCAACCAGAGCCAGCGGCTTTACGACGACGCGGGGCGGCCGCTCAGCGCCAGCGACGCCGAAGGCAACACCGACCGCTGGGCGCTGACCGCGGCGGGGCGTCTGCGCACCTGGCGACGGGCGGACGGCCGCGAGACGCACTACGAATATGACAACGCCGGGCTGCTCTGCGGGCAGGACGACGACGGGCTGCGCGAGCGCAAGGTAACGCGCAACGCGCGCGGGCAGGTGGTGAGCGCCGCCGACCCGGCAGGCCATCTGACGCGCCTGCGCTATGACCGTTTTGGCCGTCTGACGACGCTGGTGAACCCGAACCGCGAAAGCTGGCGGTTTGAGTATGACGCCGCGGGCCGCCTGACGGGCCAGCGCGACTACGCGGGCCGCCTGACGGAATACCGCCACGACGCGCTGGGCCAGGTGACGGAGGTGATCCGCCATCCGCTGCCGGGCAGCCCGGACGCGCCGCTGGTCACCGCGTTTGAATATGACGTGCTGGGCCGCCTGACTGCGCGGGAAACCGCAGAGCACCGCACCGAATACCGCCACGACACGCTGTCGCTGGAAATCCGCCGCGCCACGCGCGCCGAATGGCGGCAGGCGCTGCTGGAAGAGCGCGAACCGGCGTGGGACGCCGTGCTGGTCTTCACCCGCAATGCCGCGGGCGAACTGGTGAGTGAGGAGAACCACGGCGGGAAGTTTGAGTATGAGTACGACGCGCTGGGCAACCTCAGCAGTACAACGTTTCCGGATGGCCGGGAGCTCGCCACCCTGCGCTACGGCACCGGCCACCTGCTGGAGATGCAGCTGCGCCACGGCGGGGCCACGCACACGCTGGCGGCATACGGCCGCGACCGCCTGCACCGGGAAATATCCCGCAGCCAGGGTGTGCTCTCACAGGAGACCCGTTACGATTCCGCAGGCCGCGTCACCCAGCGCACGGTGCTGGATGCGCGACGCGAACTGGTGTTCGAGCGTCGCTACCGCTGGGACCGCACCGACCAGATAGTCCAGCAGATACACACTGACTCAACCCCTGCCACGCCGGGCGAGAAATACAGCCAGTACCTGTGGGGCTACGATGCCGCGGGCCAGGTCACAAAAGCTGTCGAACCGCAGAAAGAAGAGCGCTTCTTCTGGGACGCCGCGGGCAACCGCACTGAGGAGCACCGCAACCCGGTGTGGCACAACCTGCTGCTGCGCCTCGACGGACTGAAGCTGGATTACGACGGGTTCGGGCGACTCACGCGCCGTCAGGATAAAAGCGGCGTAGTGCAGCATTTTGCCTACGACGATGAGCAGCGGGTAAAAGAGATTCGGTTTGAGGGGAACACGAAGTTCCGCCGGGTGGAATACCGCTACGACCCGCTGGGGCGACGCACGCATAAAATCCTGTGGCGCTACGGAGAGAAAGACCCGGAAACCATCCGCTTCGACTGGCAGGGTTTACAGCTTGCAGGTGAACAGAGCGACCGGGAGCCTGACCACTACGTTCAGTACGTTTACACCGAAGGCAGCTACGAACCGCTGGCGCGGGTCGACAGTGTGTATGACGACTGTGAGATTTACTGGTACCACACCGAACTCAACGGCCTGCCGGAGCGGGTTACCGACGCGAACGGCCAGACCGTCTGGCGCGGACAGTTCAGCACCTGGGGCGAAACCGAACGCGAGCTCAGCGTGCCGCAGTGGCAGGTGCCGCAGAACTTACGCTTCCAGGGCCAGTACCTCGACCGTGAAAGCGGGCTGCATTACAACCTCTTCCGCTATTACGACCCGGTTGCCGGGCGTTACACCCAGATGGACCCGATTGGGCTTCTTGGTGGGATTAATACTTATGGATATGTGCCGGACCCATTAACTTGGGTAGATCCGCTTGGTCTGTGCTTCAGTAAGCGCTTGGGTGATTTTGGTGAATCCCATGTTAAATCTCAGCTCGAACGCTCAGGAAATTATGCAGATGTTTTTTCAGTGCAAAATAAATCTAATAACGGTATTGATTTAGTCGGCTTAAGACATGACGGGAAATATGACTTCTTTGAAGTAAAAACAAATACAACCGGATCTGTCGGCAAATTAAGTGATCGGCAGCTAAGTTCAAAAGATTTTATTGAAACTGTTCTTGGAACAGATGTCCGGAAAGGAGGCTATGACCTTAACGGGCACAGAGTCAATGATATGTTGGATAATATAGGTGATACTTATGTTGTCGATGTTTTCGTTGAAAGAGCCTCTAATGGGCGTTGGTTTGTTTCAAAGTATTTGATGAGTATTTGGGAGTAA
>Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD
ATGGTGTGGGCGCGCGACGTGCTTGGGAACGTGACGGCCTACGCCTACGACGACGCCGACAACCTGATACACATTGAGGAAGCGGACGGCAACGTGCACAACCTCGTGTACGACGCCTCGGGTAACCTCGTCAAGGCGTCGGACCTGCTGCACGACGTGGAGTTCGACTACGGCCCGCTGGGCACGCTCACGGGCAGGCGGCAGTTCGGCCGCCACGTGCGTTTCGGCTACGACAGCGAACTGCAGTTGCGCGAGATTATCAACGAGGGCGGTGAATGCTATTCCTTCCGTCTTGACGGACTCGGTCGTGTCGTGGAGGAGACTGGCTTTGACGGCCTGCAGCGAAAGTACCTTCGTGATGGTGCTGGCAGGGTGATACGGGTTGTACGTCCCGAGGGAAGACAGACAGAATATGAGTATGATGGTGTGGGCAATATTCTGCAGGAAACACAGCATGATGGTCGAACAAGCCGCTTTGCTTATGACGGGGACGGACAGCTGCTGCGTGCTGAGAACAAGGAGATGAAAGTAGCCTTCAAGCATGACCTTGGGGGACGGATAGTGAAGGAGACGCAGGGCGGGCACTGCATCATGCGCAAGTATGACCGTACAGGACGCCATGTGCACACCGAGAGCAGTCTTGGTGCAAGTATTGACTATTCCCACGACAAGGATGGCAACCTCAGTGAGATGAACAGCGGAGGATGGTCAGCCAGATGGCAGCGTGACAGGGTAGGACTGGAGACGGAGCGTGAACTTACAGGCGGAGTGCATGTAAAGACACATCATGACAGCCTTGGACGTGAGACCTACAAGTCAGTGGGCGCACGCAATGTCGAGCAGTTCCGCCGCAGCTACACCTGGGGCATCGGCAACCGTCTTCATGCAACCGAGGACGGTCTTACGGGGCGCCGGGTACGCTATGACTATGATGAGTTCGACAATCTCCTTTCTGCAGAGTACAAGCAGGGGAATGACGTTGAGCGCATCTACCGTATTCCTGACCGTATCGGCAATCTCTTTGAGACACGTGAGAAGGATGACCGTAAGTACGGTGCGGGCAGCCGTCTGACGGAAGACCGTGACTACTTCTATCACTATGACTGCGAGGGCAACCTGGTGTTCAAGGAGTTCAGGACGCTGGACTGGGCCGGTGTGTCCGTTCCGCTTGGCGGACAGAAGGCGCACCTGGAGGAGGAGCTCGGCATAACCTTCCGTGCCTTTGGCGCAGGCTGGCGTTATGACTGGCAGAGTGATGGTATGCTGTCGAGGGTTGTTCGCCCCGACAGCAAGGAAGTGTCATTCGCATACGATGCGCTGGGCAGGCGGACCGAGAAGACTTATGAGGGCGTCGCTACTCATTTCGTGTGGAACGGCAATGTACCGCTGCACGAGTGGCAGGAAGTTGCAGCTGATGCCGGAAAGGCAGATGTCACCACCTGGCTTTTCGAGCAGAACACATTCATTCCTGCTGCCAAGCTCGCAGCCAACGGCGAGAGTTTCTCCATCGTGTCTGACTACCTCGGTACTCCCCTGCAGGCATTTGACAATAACGGCAACAAGGTTTGGGAGCAGGAGCTTGACATCTTCGGCAGAAAGAGAAGAAAGAACAACAATAATTCCTCCTTCATTCCGTTTAAGTATCAGGGGCAATATGAGGATGTTGAGACGGGGCTGTATTACAACCGCTTCCGCTATTATGAGCCGAATACAGGAACCTATATCAGCCAGGATCCGATAGGGTTGGTGGGCGGAAACCCGACACTATATGCGTATGTCGGAAGTCCTAACAACTGGTATGATATATTTGGCTTACGACCATTCGGACATGCTGTGGGAGATATTGGTGAGAAAGCTGTAATTAATGACCTTAAAAAGAACAATTATGAAATTATAGACGTAAAATACGGCTCTAACAATGGCATTGATGTGCTTGCCAAGAATCCAAGTACAGGAAAATATGATGCCTTTGAGGTAAAATCCTCTACGGTAGGAAAATTCAATCTTTCTGAAGCTCAGATGGATCCTGACAACTTTGTAAGGACACGATTGGAAAGAGCTGAATCACATGGTATGATTACCGAAGATGCTCGAATAGATATTATGTCTAACTTGGGAGAAAGAAAGGTTGCTTATGTGGATATAAAACGAAGAGATAATGGGGCTCTGTATGCTGAATCAATCCATTATGAGAGTTGGGATGCAGAGGCAAAAAGAATAGAAAAACTAAAAAAAGGAGGACATCATTAA
>Proteus_mirabilis_BB2000_idrD
ATGGGAAATGAATTTTTCGCAGCCAAGCAAGGCGATAAATTAATGCACTCGTCAATCTTCGCCGATATCGTGTGTGGTGTGGTTAAAGGGGCGGTTTATGCCGCGATAGGGACGGCCGCTGTGGCGTTGACTATTGGTACAGGTGGTATGGCAACGGCCGCCGGTGTTGCCATTTTTGCCGGTGCAGGGCTGGCGGCGGGCTTTACCATCGGTGGATTGATTGATGATGCCGCTGATTGGGTCGCGGATGGGATCGATAGCTTGTTCGGGCTAAGTAACCCTGATGGTGAAATCGATACTGGCTCGAAAAATGTACGTGTCAAAGGCAAGGGAGCCGCTCGCGCAGCGGGAAAATTACCGTCACCCGTACTATTAGCCTCAATTGCGGCAGATAATACCCCACCGAAAGAAGAAAAAGGCACCGAGATTGCCTTACGCGTATTGAAAAATACCGCCATGATGGCGACTGGGGCATTTTCTTTATTGGCTGGTAAGGTGCTAGGGAGCCAAACCAACAATGCGGCACCCATAGCGCTCGATCCCTCCGAGTTTCCCGAATACCCCTCACCGGAGCAAGGTTTTTTTGACAGTATTCTTTCCCCAGTAAAAGCTGCGAAACACCCGAATGCTGAACCTATGCCTTTGGATACCATCACGTGCGATAAATGGCATTGTGCTAGCGCCCCTTACTTTTTAGCCGAAGGGTCGAAAAAAGTCCAAATTAATAGCCAACCGGCGTGTCGTAACGGTGATAGAAGCACTTGTGAAGCCAAAATCAGTGATACGCAAGAAAAAGGGATTGTGCGTATTGGCGGTGGCAGTGTGGTGGTACGCGATATTATGAGTGGGCGTAACCCCTTTGCCGAAATGTTTGGTGAAATTCTCGGTGGCCTTGCTGTTGGAGCGGCGTCTACCCTGTTTCGCAGAGGGGGACTCAAGGCGATTAAACAGTGTTTGAATAGTGTAGGCTGTACTTTGGTGAGTGAAATCGCCAGTACCGCGGTGGGCAGTATGGTAGTCGCCTCGGTGACCCAAGCCTTTAATAGTGTGCGTCACCCCGTGCATGCGGCTACGGGGGCGAAAGTCCTAAATGGAGAGGATGATACCGATTTTATTATTGATGGTGTGTACCCGTTAGTCTGGTCGCGTATTTACCAAAGTCGTAATACGGGTGAAAATCGTTTAGGACGCGGTTGGTCGATGCCGTTTGATGTCTTTTTGACAATTGACGACACAGGTAAAGGGCTTGAGAATGAAAACATTTATTATCATGACATGTCGGGGAGGCGTTTAGCGTTAGGAAAAATCGCCCTTGGACAGAAAGTGTTCTATCAAGATGAAGGCTTTACCGTCTATCGCACGACAAACAATCTTTTCCTTGTTGAATCGGCGGAAGGTGATTATCAACTTTTCGAAGCCAATCCACATAAAATCAATACACTGCGCTTAATGAAAAGTGCTGATCGTCATAATAACGCTTTACACTACCGCTATGCTAATGACGGTGAATTGGTACAAATTCATGATGATGCGTATCTGACCGATATCCGGTTACATTATGATGAAATCACCCAACGCTTACAGTCGGTGACCCGCCACCAAGGACAAGAAGAAAAAACGCTGGTCACTTATACTTATGATGCACAACAGCGTTTAGTGCAAGTCACTAATGCGGATAAGCGAGTGACTCGTCGTTTTGGCTGGGATGATGAATCGGGTTTAATGGCCATGCACCAATATGCCACAGGGGTCAGTAGTCATTATCGTTGGCAACGTTTTGATGCCTTTACCATTGAAGACAATGAACCTGAGTGGCGAGTGGTTGAACACTGGCTTAAAGACGGTAAACGCTGTTTAGAGCATACCGAACTGACGTATGATTTAGCTCAACGAACCCTCACCACCGTGGAAACGGGGGGTGAAACCACCTTCCGTCGCTGGAATGAACAACAGCAAATTATCGAATACACCAATGCATTGAATGAAACGTGGTGGTTTGAATGGGATACCAGCCGCTTATTAACGAAAGCAATAGCCCCAGATGGCAGTGAATGGGGGTATACCTATGATGAGCGCGGTAATTTAACCCAATCGACCGATCCGGAACAACAATCGACTTGCTATGATTGGGACAAAGATTTTGCGTTTCCCACGGCACAAACTTTGCCAAATGGGGCGGCTTGGCATTGGGAGTACAATGAGCACGGCGATATTCGTCGTGTGATTGACCCGCTTGGCCATATCACGCGCTTGGCGTGGGATGACCAAGGCCTGTGTCTGGGGCAAGTGGATGCTAAGGGTAATGAAACCCACTATCGCTATAATGCCCGCGGTCAGTTAATCGAGCAACGGGACTGTTCGGGTTATCCCACCACCTTGACCTACGATGATTGGGGACAACTTCGCTCGTTGACCAATGCACAGAATGAAACCACGACTTACACCTTTAGTGAAGCGGGGTTGTTGTTAACAGAGCGCTTACCGGATGGGACAGAAAACCGTTATGACTATGATGCCACCGGGCAATTAGTGGGAATAACGGATGCCGGAGAGCGCCATATTCTGTTGCGCCGTAACCGTCGTGGACAAGTGATAGCCCGACGCGATCCGGCAGGGCATTGGTTGCATTTTCACTATGATACTTTCGGGCGCATGCAGGCACTGGAGAATGAACAGGGCGAGCAGTACCGCTTTGAGTATGATGCCTTGCATCGTTTAACCGATGAACATGATCTTATCGGACAACAAAAGCACTATCAGTACGATGTGATGGGGAATGTCACCCAGATAAAAACTACCCCGGGGCCTTGCGTTGATACGCCGATGCCCTTATCCCCGCTGGTGACCACCTTCGGTTACGATAAAGTGGGACGGTTGCTGTTTCGAGAAAACGCCGATTATCGCACGGAATACCTTTATCAACCTTTGAGTGTGACGTTACGCCGAGTCCCCATGGCGATATGGCATGAAGCCGAACGCACCGGCACCACCGCCCGCGTAGAGTATCAAGACGCCCTGACCTTTACCTACGATAAAGTGGGGCAGTTGGTTCGAGAAGCCAGTGCCCGCGATGATTATCAGCATCACTATGATGTGCTGGGTAATATCACTCGCACCGAGTTGCCCCATCAAAGGGCATTTGAATACCTGTATTACGGCTCTGGGCATTTACAACAAACGCAATGGCGAGATAATGCGCAACTGACGGTATTGGCGGAATATCAACGAGACCGTTTACACCGAGAAACGTTGCGCACCAGTGGCGCCTTAGACAATGAAACGGGCTATGACTGTCGGGGACGTATTACGCACCAAGTGGCGCGCCAAATGAATGCCTCACAGTTTGTGACCCCGGTGATTGACCGTCGTTATCGTTGGGATAAACGCAATCAACTTATCGAGCGCAGTGTCAGTTATGGTCAAACAGGCGAGGTTTTTACCGCCGGACATTGGTACTACCACAGTTATCAGTACGACCCGCTGGGGCAATTGACGGCGCATTTAGGGTCGGTGCAAACCGAACACTTTCTGTATGATGCGGCGGCCAATTTACTGACTCGTCCGCACAGCGAAGCACCGCATAACCAAGTACAAGGCAGTGATAAGTATGATTATCGCTACGACGGTTTTGGTCGCATGGTCTCCCGCTATGAAAAAGGCAGTAGCTCGGGACAACGTTATCACTATGACAGTGACCACCGCATTATCGCGGTGGATATCGACCAAGGGCCCTTAGGGTATCAGCGAGCGGAATACCGCTACGATATTTTAGGGCGCCGGATTGAAAAACGGTTATGGAAAGCCAGTGCGATTGCCAACACAGTCACCTATCACCAACATGAGCCGGATGAAGTGTACACTTTTGGCTGGGTAGGGATGCGACTGGTGTCTGAGCACAGCAGCGCAGCGCCCCATACCACGGTTTATCATGCATACAATGACCAAAGTTATACGCCTCTTGCTCGCATTGAATGCACGGATAACCCGCTTAATCCGCAACGGGCGATTTATTATACGCACAGCAGTTTAAGTGGCTTACCGGAAGCGTTGACCAACAGTGAGGGTGAAATAGTTTGGCAAGGGCAATACAGTGTTTGGGGACATTTACAGCGTCAAACCCGCCCCACAAGCACATTTAATCGTGAACAGAACCTGCGCTTTCAAGGGCAGTATTTCGACAAAGAGACGGGGTTACATTATAATACGTTCAGGTACTATGCCCCGGATTTAGGCCGTTTTACCCAGCAAGACCCAATAGGGTTGGCAGGGGGCATAAACTTATATGCTTATGCGCCAAATCCATTGACATGGGTGGATCCGTGGGGTTGGAGTTGCTTTAGTGCTCGGGTAGGTGCTTTTGGTGAGAAAAGAGTTATGAAATACTTATCTGGAGCGGGCTATAAAAAAGTTTTTTCTGTACAAAACAATTCTGGGCATGGTCTGGATATAGTTGCTTTAAGACCAGATGGAAAATTTGATATTTTTGAAGTTAAAAGTTCGACAATAGGACAATTTTCTTTATCTTCCCGCCAAGCTACAGGCGATGATTTTGCAAAAATAGTTCTTTTAAACGATGTGAAAAAAGGAGGTTATAATATTATCGATATAGATGGTAATGTTAAAGCAATTACAAGTAAACAAGCTAGATACATTTATAATAACATAGGAACAACCGAGTGGGTTCAGGTAAATGTTGGCCGAAATCCCAAATTTTATGATCAGAATATAACGTTTGAGGAATGGTAG
>Pseudomonas_fluorescens_F113_idrD
GTGCAGGACGGCAAGAATCGCCTGGTCACCGCCAGCCTCACCAGCCGCGACATCAAGAACAAAGCCGAGTACGACGCCAGCACCGTCAGCGTCGGCGGCGGCTATCGCGAAGTCGGCAAGGACCAGCAGGGCAACGCCACCAGCGGCGGCACGCAAACCCCTGGCACCGACCTGCCGAAGAACGACAACGACCTCAGCGCCACCATGCCGATCGCCATCAGTGCCAGCGACGATGCCAGCAGCGTCACCCGCAGCGGCATCAGTGGCGGCACCATCGTCATCAGCAACGACGCCGAGCAACAGAAGCGCACCGGCAAGAACGGCGAGCAAACTATCGCCAGCCTCAACCGCGATGTCTCCAGCGACCGCGACACCACCAACACGCTCAAGCCGATCTTCGACGAAGACGAAATCCGTGCCGGCTTCGAGATCGTCGGGGCGTTCGCCAACGAAGCCAGCACCTTGCTCGCCAACAAGGCCAAAGAAGTCGACCTCAAGCGCAAACAGGCCGAGGCGGCTCAGACAGCGGCCCTGGACCCCAACGCCAAACTCACCGACACCGAACGCCTGGCGCTGCTCGGCAGCGCTCAGCGCCTGAACGCCGAAGCCAGCGGCATCGCCGACAAGTGGGGCGCGGGCGGTACCTACCGGCAGATCACCACCGCCCTGGTGGGCGCGGCCAGCGGCAACATCAGCGGCGGTAACAGTGCCTTCGTTCAAGGGCTGGTGGTCAATTACGTGCAGCAGCAAGGTGCCGGCTACATCGGCGACCTGGTGGCGGACGGCAAACTCACCGAAGGCTCGCCGCTGCACGCCGCCCTGCACGGCATCGTCGCCTGTGCCGGCGCCGCCGCCAGCAGCCAGAGCTGCGGCAGCGGTGCTGCAGGCGCGGCGGCGTCGAGCCTGCTCACCGGCCTGTTCTCCGAAGCCAGCCCCGATGAAAGCGAATCCCAACGTGAGGCCAAGCGCAACCTGATCGTCAGCATCGTCACCGGGCTGGCCGCCACCAGTTCCTCGCTCGACCCGGCCACCGCCACCAGCGCGGCCGGGGCAGCGGTGGACAACAACTGGCTGGCGACTCAGCAGTTGGTTCAGGCGGAGAAGGAATACGACGCAGCTGACGGCTTAGGGGCCAAGGCGCAAGTGATCGCCAAGTGGGCGTACATCAATCAGAAACAAGACGTCCTGACCCGCTTCGGCGTGGGCAAAGGCTTGGTGCAGGCGGGGTGGAGCGACGTCGAAGGCTTGGCGCAATTCCTCCTTCACCCGGTTGATGGAATCAACGGCCTCAAGGCCCTGATCAACGACCCCGAGGCTCGCAAGCAGTTTGGCGATGGCGTGGTCAACGAACTCAACGCCAAACTTGACCGCGTTAAAGACGCCCTGGAAAACGGTGGTGATGATCGCGCCGTGCAACTGGGCCAAGACATTGGCGAACTCGTCTGGCAGGTCGGCAGCATCGCCACCGGTGTTGGCGGGGCGGCCAAGGGTGGCGTGGCGTTGGCGAAAGCCGGGATCAAGGTGGGGGCGGAAGGGCTTGAAAAAATGGCCGATATGGCCAGGCTTGAAAAAATAGCGAAAGTCGAAGCGCCAGCTACCAAGCCATTGCCTGACTGGAGCGGTGATGGACCCTCCGTCGGGAAGGGGCCAGATAAAACAGTCCCTCCGAGCACGACAACTGATCTGTTTGAAGCGTCTAGAGGTAAGGATCTTTCGACCCTGACAAACCAGCAAATTGGCGACTTGGGTGAGGCGATATCCAAAACCTTCTTGCGCGATAATAACCATACTGACATCTTTGCTATCCAGAACAGTAGCGGAAATGGCATCGATATTGTCAGTCGCACACCTGATGGTCGTCTCGCCTTCACTGAGGTGAAAACATCGCGAACAGGGAGTATTGGAGGGCTTTCCAGTCGTCAAGAAAATATGACGACGTTTGTCGAGGATATTCTGACTCAGGCAGCTAGTCGCCAGGGGCGCTACCGGAATATTGACGTCGCTGATCAACAGCGGGCACGTCAGATGCTCAGGGAGTTTAGGCAAACACCTGAAAGTGTGAGCGGAAATTTGATCGGTGTTGATCTAAAGGGAGAAGTATTGAGAGTCTCTCCTTGGGCGCGCAATTGA
>Rothia_C6B_idrD
atgaccgacatccacggcaatctcttatggtacagcgaatacaccgcctggggtcgtctgaaaaaggacgggcgggtttacaagaacgcgcaccagccgttcagattgcagaaccagtattacgatgaagagaccgggctgcattataacttgttgcgttattacgagcctgaggctgggcggtttgtgaatcaggatccgattgggttgttgggtggggagcatctttatcaatttgcagataacgctttggtttggttcgacccgttgggattgaaaaaaacctacgtgcagcgcttgggtactgccggcgagcgcagggtaatgaaatatctggaaggtacgggatacaaaaaagtcttctctattcaaaatgcctccggaaacggcttggatattgtcgcattgagaccggatgggaaatatgacatatttgaagttaaaagttctaaacgtggaaaatttaaactaagcgaaagacagcagaaaggcggtaaatgttttgctgagcaagtgttgacggaagatgtaacggataagaaaaaaggcggatattttatgaaaggcttagacggcaaaaagacgcctcttaataagaagaaggcgcaagagatatttaataatatagataaaaccgaaactgtctttgttgatatgaatcataaatttcaagcaacccgcatgacatttagcccatggtaa
>Xanthomonas_citri_pv_malvacearum_strain_XcmN1003_idrD
ATGGCCAGCACCACCGAGATCGCAGCACTGCTGTCTGCCACCGATGCACAGGGCCGGGTCGATCCGATGGCCTACGAAGCGCAACGGTTGATCCTGCAGCACCAAGGCCTGGGGAGCGTGGATGCTGCTGCGCTGGTGCATGCTCTGCCGGCCTCGCCGGGATTCGCCACCTTCGACCGTAACGCCTTCTTTTCCGCCATCGACCAGCGCCTGGAGACGCCGCAGGAACGGCAGCGCTTTGCTGATGCGCTGGACCAGGCCAACCTCAGCGATAGCTTGCTCGAGCGGCTGGGCGAGCGGGCCTTTGAGGCCGGTGGGCAGGCCTATGACGCTACGCGCAACGCGGCGAGCCGGGCCGGCGCCCAAGTCAGCGATGGCCTGGCCACTGCGCAGCGCACCGCCCGCCACCCGGCCGCCGATCCCGGTGCCTCGCCCGCGTTGCGCGCTGCAGCGAATGTCGGCAGCAACGTGATCGGTGAGGCTCAGCAAGGCTACGGTTTCGTCACCGGCGGTGCCACGCACGCCCTGTCCACGCTCAGCGATACCGTCGATCTGGCGCGCATGGGTGTGCGCTTGGCGTCCGATCCCAACTACCGAGACGTGGTGGTCGGCATGGTCCGCATCTATGCCGCCGAGGTCGCCGCCGACCCGCACAAGCCGATCGACCAGGTCCGCAGCGCCGTCACGCAGGCGTGGCAGCATTGGGAACAGGGGTTGGCACAGGCGCAACGCGATGGGCGCGAGCGCGAATACATCGGCAGCGCGACCGGTGCGGCCGGAGTGGAGATCCTGGCCACCCTGGTGCCGGCAAGCAAAGTGAGCAAGCTCGGCAAGGCCGCGCAGATCATGGAGGACATCGCCCCCCGGGCCGCCGCCGAGGCCAGCGAAGGCCTGGCCGATGCCGCGCGTGCCGAACGCGCTGGTGCGCGCGCCCACGACGGCACTGAGTATGCGCCCGAGCGCCATGCGGGCTTGCCCGGCCGCGCTCGCGCCTACGCCGAGGGGCTGGAGGAAATCGGCCAGGCCGCCCGGCGTGGTGGTGAGGCCGGCGAAGGGGCGCAGCAGATGCTGCGCGGGCTGGTCGGCCAGGCCCGCCAGGACGGCCACCTGGACGATCTGGTCCGCGCCGCGCGCGCCACCGACAACGTCGAAGGCCTGCTGCGCAGCGGCCAACTGGAACCGCACGAACTCACCGCCATCCTCAAGCGCACGCCGGGTGTCTTTGACGGCAGGATCGATTTCCCGACGGCCCTAGACCACAGTACGCAGGGCGTGGACCTGACCCGGCTCACCACCCGCCAGCTCGGCGACATCGGCGAAGCCATCCAGACCCATGACCTGGTAAAGCAGGGATATAGCGACATCGTCGCGATCAAGAACCGCTCCGGTCACGGCATCGACCTGGTCGGCCGCAACCCCGACGGTGATCTGGAGTTCTTCGAGATTAAGACCTCGGCCAAGGGATTGGCGCCGGCGCAACATGGGGACCCGGAACAATTCGTAGCCAAGCGGCTTGAGCGGGCGATTGACGCCCAGGGTCACTGGGATCCGAAGAACACGATACCTGGGCTGAAGTCGATCGCTCAGGATATCAGCAGGGAAATCGGCCCGGAGGCCGAGAACATCAACGCCACGTGGGTGCGGCTCAATCTGAGCCGGAGCCCGGACGCGCCGCACTTGCAGATCGAACGCAAGATAGATCCATGGGGCAAGCCCGCAGCGCCGAAACAGACGATGCTGACTCCTGGCGATGCCGACCACCCAGACCACGCCGCATTTGACACGATACTGCGCACGGTGCAGGGCGACGGCCGTTGGAACGTTGCGGAAAGCCACAATATTGCCGCGAGCCTGCTGAAGGAATACAAGGCCGATCCGCTGTGTAAGACCCTGGACTCGGTGCGCATCGGCGGCAGCGCCGATACGCCATGCATCTTCGCGGTCAGTTCGCCACACGGCGACAGAGGCCCGGATTTCCATGTCCACGTCCAAGCCGAGCAAGCTGCGCAGCGCCCGGCGGCCGAGAGCTTCGCTCAGGTCCAACAGATCAACCAGCAACAAGCGCTGACCCAGGAAACAGAGCAACAGAACCTGGCTCACCGCGGCCCGAGGATGAGTTGA

We put these sequences in a file called idrD.fna.

Identifying metagenomes likely to contain populations with idrD

Since there are too many metagenomes to practically download and query all, we screened for metagenomes likely to contain our sequences (or similar sequences), using SearchSRA. Briefly, SearchSRA employs the PARTIE algorithm (Torres et al 2017) on subsamples (100,000 reads per metagenome) from all metagenomes on SRA (which mirrors ENA and other international short read databases) and queries them with bowtie2. We uploaded the FASTA containing all 7 idrD nucleotide sequences to SearchSRA and searched against all metagenomes.

This resulted in a zipped folder with all the alignments. It was called results.zip; we renamed it to searchsra_results_20190113.zip and then unzipped it with the command unzip searchsra_results_20190113.zip and then entered the resultant directory with cd searchsra_results_20190113.

Now, we wanted to find which metagenomes (that is, the *.bam files produced from PARTIE) had non-zero, or practically zero coverage. Since PARTIE works by only mapping a [small] subset of reads from each metagenome, the actual mapped counts are not informative, other than whether any reads mapped (meaning that metagenome sampled a population with idrD) or didn’t map. For this first pass filtering of metagenomes, we worked off of file size, as metagenomes with zero mapping were 4KB, so we kept anything above 4KB. We did that with this one-liner:

for file in $(ls | grep "^[0-9]*$"); do 
du -h $file/* | grep -v '4.0K'; done | awk '{print $2}' | sed 's/.bam//; s/^[0-9]*\///' 
> sra_run_accessions_gtt4KB.txt

This resulted in 3801 metagenome accessions. We downloaded the available metadata for each of these metagenomes by looping through each entry like so:

for entry in $(cat sra_run_accessions_gtt4KB.txt); do 

esearch -db sra -query $entry | efetch -format runinfo >> sra_metadata_accessions_gtt4KB.txt

done

This comma-separated file has the needed data but also junk whitespace (empty lines) so we cleaned that and converted to a tab-separated file by running

grep -v '^$' sra_metadata_accessions_gtt4KB.txt | tr ',' '\t' > temp 
mv temp sra_metadata_accessions_gtt4KB.txt

We took this list and manually curated to remove genome assemblies, metagenomes with too-specific library preparation methods (e.g. target capture), etc. to ultimately produce a list of 1,189 metagenomes, including both single- and paired-end sequence libraries. This table can be found at this Drive link. In this table, the sample site for each metagenome is listed in the column “ScientificName”, with varying information content depending on the uploader.

Preparing the metagenomes

Downloading and sorting

We copied the curated list of 1,189 accessions, just the accessions with one accession per line, to a text file called curated_sra_list_20190114.txt and uploaded it to the cluster. The first few lines of the file look something like this:

SRR1182872
SRR948162
SRR1291146
SRR1291147
SRR948162
SRR1291146
SRR1291147
SRR948163
SRR1291148
SRR1291149

Note: Everything from here until the end of Retrieving the coverage and variability data, occured on Harvard’s Odyssey computational cluster, due to the large filesizes and computational resources. Odyssey uses the Slurm workload scheduler; batch scripts copied below retain all header information for maximum reproducibility.

We looped through this list of metagenomes to download each one in parallel batches, using the following script submitted to Odyssey:

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --contiguous
#SBATCH --mem=1G
#SBATCH -t 0-01:00:00
#SBATCH -p shared
#SBATCH --job-name="downloadMaster"
#SBATCH -o odyssey_downloadMaster.out
#SBATCH -e odyssey_downloadMaster.err
#SBATCH --mail-type=NONE

# chunk the accessions into batches
cat curated_sra_list_20190114.txt | split -l 20 -d - sraBatch_

# get the batches
batches=$(ls sraBatch_*)

# write and submit the script that does the downloading, for each batch
for batch in $batches; do

echo "#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --contiguous
#SBATCH --mem=1500M
#SBATCH -t 0-06:00:00
#SBATCH -p shared
#SBATCH --job-name='$batch-tar'
#SBATCH -o odyssey_unstuff-$batch.out
#SBATCH -e odyssey_unstuff-$batch.err
#SBATCH --mail-type=NONE

module load sratoolkit

for sra in \$(cat $batch); do

# fastq-dump does the downloading, --split-3 dumps 'junk' or unpaired reads a 3rd file
fastq-dump --split-3 $sra

done
" > slurm-$batch.sh && sbatch slurm-$batch.sh

sleep 30

done

Some details about this script:

  • The lines starting with #SBATCH are instructions to the cluster we used (which uses Slurm)
  • This is an ad-hoc parallelization, where the first part does not actually download but splits the list up into batches of 20 accessions and then generates and submites a script for each batch to actually download the metagenomes.
    • Note: This could be written more efficiently as a job array, but our cluster was unreliable at this time so retaining individual batch’s scripts was useful for resubmitting stalled jobs.

This resulted in 1,188 metagenomes (a single metagenome failed to download; attempts to download individually failed repeatedly). The next step was to separate paired-end sequences from single-end sequences as the mapping program handles them differently. We accomplished this with a small script:

#!/bin/bash

mkdir single_end
mkdir paired_end

# move paired ends, clean off 3rd file of errors
for r1 in $(ls *_1.fastq); do
  
  # extract the ID
  prefix=$(echo "$r1" | sed 's/_1.fastq//')
  
  # move the PE files to paired_end
  mv $r1 paired_end
  mv ${prefix}_2.fastq paired_end
  
  # delete any 3rd errors
  rm ${prefix}.fastq

  # make list of which samples were pe and where to find them
  echo "paired_end/$prefix" >> sra_paired_end.txt  
done

# move single fastqs leftover
# make list of which samples were se and where to find them
ls *fastq | sed 's,.fastq,,; s,^,single_end/,' >> sra_single_end.txt
mv *fastq single_end

Mapping

Now that everything was set up, we mapped the metagenomic short reads to the reference idrD sequences. We chose to use bowtie2, an accurate and fast gap-sensitive read aligner (Langmead & Salzberg 2012). The first step is to turn the FASTA file of idrD nucleotide sequences into a bowtie2-compatible reference database, carried out with the command bowtie2-build idrD.fna idrD. Now, we can map each metagenome against the idrD reference database using the following scripts:

#!/bin/bash
#SBATCH -N 1 
#SBATCH -n 16 
#SBATCH --contiguous
#SBATCH --mem=32G
#SBATCH -t 1-12:00:00
#SBATCH -p shared
#SBATCH --array=0-32%33
#SBATCH --job-name="bt-pe"
#SBATCH -o odyssey_map_pe_%a.out
#SBATCH -e odyssey_map_pe_%a.err
#SBATCH --mail-type=END

# split -l 20 -d -a 4 sra_paired_end.txt pe_batch-
# mkdir pe_bt_mapped

style="pe"  # pe = paired end, se = single end

module load samtools bowtie2 zlib xz
taskNum=$(printf %04d $SLURM_ARRAY_TASK_ID)

# get list of sample-runs, chop into per sample batches
batch=${style}_batch-$taskNum

# iterate for each participant
for samp in $(cat $batch); do

sampOut=$(echo "$samp" | sed "s,^[a-z_]*,${style}_bt_mapped,")

R1=${samp}_1.fastq
R2=${samp}_2.fastq

# map sample set
bowtie2 -x idrD -1 $R1 -2 $R2 --no-unal -S $sampOut.sam --threads 16
samtools view -S -b $sampOut.sam > $sampOut-RAW.bam
samtools sort $sampOut-RAW.bam > $sampOut.bam
samtools index $sampOut.bam

# clean as you go
rm $sampOut.sam $sampOut-RAW.bam

done

The above script was for the paired-end metagenomes, and the script below was for the single-end metagenomes:

#!/bin/bash
#SBATCH -N 1 
#SBATCH -n 16 
#SBATCH --contiguous
#SBATCH --mem=32G 
#SBATCH -t 1-00:00:00
#SBATCH -p shared
#SBATCH --array=0-54%20
#SBATCH --job-name="bt-se"
#SBATCH -o odyssey_map_se_%a.out
#SBATCH -e odyssey_map_se_%a.err
#SBATCH --mail-type=END

# split -l 20 -d -a 4 sra_single_end.txt se_batch-
# mkdir se_bt_mapped

style="se"  # pe = paired end, se = single end

module load samtools bowtie2 zlib xz
taskNum=$(printf %04d $SLURM_ARRAY_TASK_ID)

# get specific batch
batch=${style}_batch-$taskNum

# iterate for each sample
for samp in $(cat $batch); do

sampOut=$(echo "$samp" | sed "s,^[a-z_]*,${style}_bt_mapped,")

# map sample set
bowtie2 -x idrD -U ${samp}.fastq --no-unal -S $sampOut.sam --threads 16
samtools view -S -b $sampOut.sam > $sampOut-RAW.bam
samtools sort $sampOut-RAW.bam > $sampOut.bam
samtools index $sampOut.bam

# clean as you go
rm $sampOut.sam $sampOut-RAW.bam

done

IMPORTANT: Before running these scripts, as they are job arrays, we ran the first commented-out lines split -l 20 -d -a 4 sra_single_end.txt se_batch- and mkdir se_bt_mapped for the single-end metagenomes, and the corresponding lines for the paired-end metagenomes. These commands first make the lists of files (with 20 accessions per file) that will be processed by each instance of the job array and secondly make the directory into which the mapping results will be placed.

Importing the results into Anvi’o

We managed the data with anvi’o, a platform for analysis and visualization of ’omics data (Eren et al., 2015). The first step is to make a so-called contigs database to hold the sequence information, which we accomplished with the command anvi-gen-contigs-database -f idrD.fna -o idrD-CONTIGS.db --skip-gene-calling

Now we can profile the contigs database with the bowtie2 results, which will make a so-called profile database that stores and summarizes the bowtie2 mapping results in a readily accessible manner. For the paired-end metagenomes, we used this script:

#!/bin/bash
#SBATCH -N 1 
#SBATCH -n 4
#SBATCH --contiguous
#SBATCH --mem=2G
#SBATCH -t 0-01:00:00
#SBATCH -p shared
#SBATCH --array=0-32%33
#SBATCH --job-name="prof-pe"
#SBATCH -o odyssey_prof_pe_%a.out
#SBATCH -e odyssey_prof_pe_%a.err
#SBATCH --mail-type=END

# split -l 20 -d -a 4 sra_paired_end.txt pe_batch-
# mkdir pe_profs

style="pe"  # pe = paired end, se = single end

# activate anvio
source ~/virtual-envs/anvio-dev-venv/bin/activate
module load samtools bowtie2 zlib xz

taskNum=$(printf %04d $SLURM_ARRAY_TASK_ID)

# get list of sample-runs, chop into per sample batches
batch=${style}_batch-$taskNum

# iterate for each participant
for samp in $(cat $batch); do

bam=$(echo "$samp" | sed "s,^[a-z_]*,${style}_bt_mapped,")
profOut=$(echo "$samp" | sed "s,^[a-z_]*,${style}_profs,")

# make anvio profiles, -M 0 means don't skip profiling contigs less than 2.5kb
anvi-profile -i $bam.bam -c idrD-CONTIGS.db -o $profOut -M 0 -T 4 --write-buffer-size 700

done

Note that like before, the split and mkdir commands commented out in the script were run at the command line prior to submitting this array; likewise for the single-end metagenomes but substituting single and se for paired and pe, respectively.

For the single-end metagenomes, we used the same script but changed the line style="pe" to style="se"

This resulted in two directories, pe_profs and se_profs, each with hundreds of profiles, one per metagenome, that contained all the coverage and variability information we are interested in. To facilitate future steps, we used the anvi-merge command to merge the indiviual profile database into a single merged profile database with the following script:

#!/bin/bash
#SBATCH -N 1 
#SBATCH -n 1
#SBATCH --contiguous
#SBATCH --mem=12G
#SBATCH -t 0-04:00:00
#SBATCH -p shared
#SBATCH --job-name="merge"
#SBATCH -o odyssey_merge.out
#SBATCH -e odyssey_merge.err
#SBATCH --mail-type=END

# activate anvio
source ~/virtual-envs/anvio-dev-venv/bin/activate

se=$(cat sra_single_end.txt | sed 's/^[a-z_]*/se_profs/')
pe=$(cat sra_paired_end.txt | sed 's/^[a-z_]*/pe_profs/')

# make anvio profiles, -M 0 means don't skip profiling contigs less than 2.5kb
anvi-merge $se $pe -c idrD-CONTIGS.db -o idrD-MERGED -M 0 --skip-concoct-binning

Retrieving the coverage and variability data

Now that all the data was together, we extracted the coverages and variability information per site per gene for each metagenome. To do this, we made a dummy collection to keep anvi’o happy with the command anvi-script-add-default-collection -p idrD-MERGED/PROFILE.db. Then, we exported the coverages with anvi-get-split-coverages -c idrD-CONTIGS.db -p idrD-MERGED/PROFILE.db -C DEFAULT -b EVERYTHING -o idrD_covs.tsv and the variability information with anvi-gen-variability-profile -c idrD-CONTIGS.db -p idrD-MERGED/PROFILE.db -C DEFAULT -b EVERYTHING --include-split-names -o idrD_var.tsv

Here are first 10 rows of the coverage data:

nt_position split_name  sample_name coverage
0   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
1   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
2   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
3   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
4   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
5   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
6   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
7   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0
8   Acinetobacter_baumannii_strain_XH858_idrD_split_00001   ERR1121387  0

And here are the first 10 rows of the variability data:

entry_id    unique_pos_identifier   pos pos_in_contig   split_name  sample_id   corresponding_gene_call in_partial_gene_call    in_complete_gene_call   base_pos_in_codon   codon_order_in_gene codon_number    gene_length coverage    cov_outlier_in_split    cov_outlier_in_contig   A   C   G   N   T   reference   consensus   competing_nts   departure_from_reference    departure_from_consensus    n2n1ratio   entropy
0   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  DRR046092   -1  0   0   0   -1  0   -1  13  0   0   0   9   4   0   0   C   C   CG  0.3076923076923077  0.3076923076923077  0.4444444444444444  0.6172417697303416
1   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  DRR046097   -1  0   0   0   -1  0   -1  14  0   0   0   10  4   0   0   C   C   CG  0.2857142857142857  0.2857142857142857  0.4 0.5982695885852573
2   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  SRR1284495  -1  0   0   0   -1  0   -1  35  0   0   0   31  0   0   4   C   C   CT  0.11428571428571428 0.11428571428571428 0.12903225806451613 0.355382896246011
3   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  SRR3067121  -1  0   0   0   -1  0   -1  18  0   0   0   14  2   0   2   C   C   CG  0.2222222222222222  0.2222222222222222  0.14285714285714285 0.6837389058487535
4   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  SRR3067123  -1  0   0   0   -1  0   -1  27  0   0   1   21  2   0   3   C   C   CT  0.2222222222222222  0.2222222222222222  0.14285714285714285 0.7544627023259549
5   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  SRR3067131  -1  0   0   0   -1  0   -1  36  1   1   1   32  2   0   1   C   C   CG  0.1111111111111111  0.1111111111111111  0.0625  0.4643566259363562
6   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  SRR3067135  -1  0   0   0   -1  0   -1  54  0   0   0   49  4   0   1   C   C   CG  0.09259259259259259 0.09259259259259259 0.08163265306122448 0.3548290085661212
7   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  SRR3067140  -1  0   0   0   -1  0   -1  13  0   0   1   9   1   0   2   C   C   CT  0.3076923076923077  0.3076923076923077  0.2222222222222222  0.937155853065701
8   0   155 155 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  SRR3067141  -1  0   0   0   -1  0   -1  61  1   1   0   57  1   0   3   C   C   CT  0.06557377049180328 0.06557377049180328 0.05263157894736842 0.27891059970489007

The coverage and varibility files were copied down from the cluster and then plotted with R, described in the next section.

We also ran these analyses for Prevotella spp., described later on in this document, but the coverage data here and variability data here are here as the code below will process both at the same time.

Inspecting metagenome coverage

Preliminary data processing

First, we read in the data like this:

library(ggplot2)
library(reshape2)
library(plyr)
library(dplyr)
library(tidyr)
library(entropy)

reads_path <- "~/idrD_analysis/idrD_covs.tsv"
var_path <- "~/idrD_analysis/idrD_var.tsv"
metadata_path <- "~/idrD_analysis/sra_metadata_accessions_gtt4KB_mgenomes_abbrev.txt"
prevo_cov_path <- "~/idrD_analysis/prevo_idrD-covs.tsv"  # this is for later
prevo_var_path <- "~/idrD_analysis/prevo_idrD-var.tsv"   # this is for later
covs <- read.table(reads_path, header = T, sep = "\t")
vari <- read.table(var_path, header = T, sep = "\t")

# check out the first few lines
head(covs,3)
head(vari,3)
##   nt_position                                            split_name sample_name
## 1           0 Acinetobacter_baumannii_strain_XH858_idrD_split_00001  ERR1121387
## 2           1 Acinetobacter_baumannii_strain_XH858_idrD_split_00001  ERR1121387
## 3           2 Acinetobacter_baumannii_strain_XH858_idrD_split_00001  ERR1121387
##   coverage
## 1        0
## 2        0
## 3        0

##   entry_id unique_pos_identifier pos pos_in_contig
## 1        0                     0 155           155
## 2        1                     0 155           155
## 3        2                     0 155           155
##                                                       split_name  sample_id
## 1 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  DRR046092
## 2 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001  DRR046097
## 3 Prevotella_jejuni_strain_CD3_33_chromosome_II_idrD_split_00001 SRR1284495
##   corresponding_gene_call in_partial_gene_call in_complete_gene_call
## 1                      -1                    0                     0
## 2                      -1                    0                     0
## 3                      -1                    0                     0
##   base_pos_in_codon codon_order_in_gene codon_number gene_length coverage
## 1                 0                  -1            0          -1       13
## 2                 0                  -1            0          -1       14
## 3                 0                  -1            0          -1       35
##   cov_outlier_in_split cov_outlier_in_contig A  C G N T reference consensus
## 1                    0                     0 0  9 4 0 0         C         C
## 2                    0                     0 0 10 4 0 0         C         C
## 3                    0                     0 0 31 0 0 4         C         C
##   competing_nts departure_from_reference departure_from_consensus n2n1ratio
## 1            CG                0.3076923                0.3076923 0.4444444
## 2            CG                0.2857143                0.2857143 0.4000000
## 3            CT                0.1142857                0.1142857 0.1290323
##     entropy
## 1 0.6172418
## 2 0.5982696
## 3 0.3553829

The file at metadata_path is the metadata downloaded earlier from NCBI and is the file we used earlier to get the curated metagenomes list, but we trimmed off some columns here that we did not use.

Next, we read and processed the metadata to identify which metagenome came from which category. The metadata called the HMP datasets only as “human metagenome”, not separating them by body site, but that information was stored in the “AnalyteType” column. So, we updated the “ScientificName” column, which held the desired information for all the other metagenomes, to reflect whether the metagenome was HMP oral or otherwise. We also manually looked through the “human metagenome” samples to identify other oral metagenomes that weren’t from the HMP. We added a column called “category” to the covs and vari dataframes to hold this information.

# drop unique "Run" id's
metadata <- distinct(read.table(metadata_path, header = T, sep = "\t"), Run, .keep_all=T)  
rownames(metadata) <- metadata$Run
# convert to character to be safe
metadata$ScientificName <- as.character(metadata$ScientificName)  
# rename the HMP gut samples' ScientificName
metadata$ScientificName[as.character(metadata$Analyte_Type) == 
                          "G_DNA_STOOL"] <- "HMP gut metagenome"  
# likewise for HMP oral
metadata$ScientificName[metadata$Analyte_Type %in% 
                          c("G_DNA_Tongue dorsum","G_DNA_Throat")] <- "HMP oral metagenome"  
metadata$ScientificName[metadata$Run %in% 
                          c("SRR5892193","SRR5892208","SRR5892209",
                            "SRR5892216","SRR5892235", "SRR5892239")] <- "Non-HMP oral metagenome"
metadata$ScientificName <- factor(metadata$ScientificName)  # convert back from character to factor

# add in metadata, gene type
covs$category <- metadata[as.character(covs$sample_name),"ScientificName"]
vari$category <- metadata[as.character(vari$sample_id),"ScientificName"]
covs <- unite(covs, "sample_split", c("sample_name","split_name"), remove=F)

# do the same for prevo
prevo_cov <- read.table(prevo_cov_path, header = T, sep = "\t")
prevo_var <- read.table(prevo_var_path, header = T, sep = "\t")
prevo_cov$category <- metadata[as.character(prevo_cov$sample_name),"ScientificName"]
prevo_var$category <- metadata[as.character(prevo_var$sample_id),"ScientificName"]
prevo_cov <- unite(prevo_cov, "sample_split", c("sample_name","split_name"), remove=F)

length(unique(as.character(covs$sample_name)))  # this many metagenomes
## [1] 1188

Over the course of our investigations (i.e., after some preliminary plots), we noticed some outlier samples that we removed for various reasons detailed in the manuscrtipt, and so we drop them here:

prevo_cov <- prevo_cov[!(prevo_cov$sample_name == "SRR628272"),] # prevo, low quality
prevo_var <- prevo_var[!(prevo_var$sample_id == "SRR628272"),]
covs <- covs[!(covs$sample_name == "SRR628272"),] # prevotella, low quality
vari <- vari[!(vari$sample_id == "SRR628272"),]
covs <- covs[!(covs$sample_name == "SRR1779144"),] # cronobacter, infant, skews y-axis
vari <- vari[!(vari$sample_id == "SRR1779144"),]
covs <- covs[!(covs$sample_name == "SRR2047620"),] # cronobacter, stem cell kids, skews y-axis
vari <- vari[!(vari$sample_id == "SRR2047620"),]
covs <- covs[!(covs$sample_name == "SRR1038387"),] # proteus, neonatal NEC, skews y-axis
vari <- vari[!(vari$sample_id == "SRR1038387"),]

covs <- covs[!(covs$sample_name == "SRR1781983"),] # drop the high rothia sample that looks ok but skews y axis
vari <- vari[!(vari$sample_id == "SRR1781983"),]

At this point, we have 1183 metagenomes from 45 categories:

## [1] "Number of metagenomes: 1183"

## [1] "Number of sampled categories: 45"

##  [1] "Homo sapiens"                       "metagenomes"                       
##  [3] "metagenome"                         "human gut metagenome"              
##  [5] "gut metagenome"                     "sediment metagenome"               
##  [7] "Homo sapiens neanderthalensis"      "marine metagenome"                 
##  [9] "root"                               "HMP oral metagenome"               
## [11] "human metagenome"                   "human skin metagenome"             
## [13] "hydrothermal vent metagenome"       "terrestrial metagenome"            
## [15] "oral metagenome"                    "bioreactor metagenome"             
## [17] "food metagenome"                    "human oral metagenome"             
## [19] "indoor metagenome"                  "Non-HMP oral metagenome"           
## [21] "insect metagenome"                  "wastewater metagenome"             
## [23] "soil metagenome"                    "oyster metagenome"                 
## [25] "fossil metagenome"                  "pig gut metagenome"                
## [27] "Gallus gallus"                      "bovine gut metagenome"             
## [29] "Bos indicus"                        "freshwater metagenome"             
## [31] "anaerobic digester metagenome"      "hypersaline lake metagenome"       
## [33] "groundwater metagenome"             "microbial mat metagenome"          
## [35] "milk metagenome"                    "upper respiratory tract metagenome"
## [37] "biogas fermenter metagenome"        "compost metagenome"                
## [39] "hot springs metagenome"             "bovine metagenome"                 
## [41] "marine sediment metagenome"         "soda lake metagenome"              
## [43] "surface metagenome"                 "aquatic metagenome"                
## [45] "invertebrate gut metagenome"

Filtering metagenomes based on category

The 1183 metagenomes retained still need to be filtered, as we wanted to minimize reporting spurious coverage from metagenomes not representative of that idrD sequence. In other words, we wanted to investigate only metagenomes that seem to represent the idrD sequence of interest - while one sample may contain, say, Prevotella spp. population(s) with idrD gene, that sample might not contain any Cronobacter turicensis cells, so including that sample for every sequence would just add noise.

So, we employed a filtering criterion that, for each sequence, discarded metagenomes for which more than half of the nucleotide positions in that idrD sequence were not covered at least once. Note that this is applied to each sequence separately; thus, the metagenomes retained for each sequence are not tied to each other. Here is the function we wrote to do that:

filter_mostly_zeros <- function(df, proportion_gte0=0.5) {
    df_wide <- dcast(df, sample_split ~ nt_position, value.var = 'coverage')
    rownames(df_wide) <- df_wide[,1]

    # the !is.na part  as differing gene lengths between taxa so the dcast padded lengths with na to make square
    # count positions covered at least once
    proportions <- apply(df_wide[,2:ncol(df_wide)], 1, 
                         function(x) sum(1*(x[!is.na(x)] > 0))/length(x[!is.na(x)]))  
    keepThese <- names(proportions)[proportions >= proportion_gte0]  # keep samples with enough bases covered
    df <- df[as.character(df$sample_split) %in% keepThese,]  # subset
    return(df)
}

We also wrote a small function to bin categories together. This will be relevant later. This function takes a dataframe (like covs) and a named list, where each list element’s name is the desired name of the bin to be made (e.g. 'Human oral metagenomes') and that list element is a vector of the categories to be combined into that bin (e.g., c("HMP oral metagenome", "Non-HMP oral metagenome"))).

bin_categories <- function(df, bins){
    df$category <- as.character(df$category)
    for(b in names(bins)){
        df$category[df$category %in% bins[[b]]] <- b
    }
    df$category <- factor(df$category, levels = unique(as.character(df$category)))
    return(df)
}

With these, we generated a summary heatmap of all of the data. This is the function to generate the figure:

plot_summary_heatmap <- function(df, proportion_gte0=0.5, bins=NULL) {

    if (!is.null(bins)){
        df <- bin_categories(df, bins)
    }
    
    # filter non-relevant metagenomes
    df <- filter_mostly_zeros(df=df, proportion_gte0 = proportion_gte0)
    
    df$tax <- gsub("_"," ", gsub("_idrD.*$","", as.character(df$split_name)))
    df$tax <- factor(df$tax, unique(as.character(df$tax)))

    # summarize to get log10 mean coverage by category
    df <- ddply(df, ~tax + category, summarise, LogMeanCov=log10(mean(coverage)))

    ggplot(data = df, aes(x = tax, y = category)) +
        geom_tile(size = 0.3, alpha = 1, aes(fill = LogMeanCov)) +
        scale_fill_gradient(low = 'black', high = '#00FF00') +
        xlab("idrD sequences") + ylab("Metagenome categories") + labs(fill="Log10 Mean Coverage") +
        scale_x_discrete(expand=c(0,0)) + scale_y_discrete(expand=c(0,0)) + theme_minimal() +
        theme(panel.grid = element_blank(), axis.line = element_line(colour = 'black', size = 0.5), 
              axis.text = element_text(colour='black'), axis.text.x = element_text(angle = 90, hjust = 1), 
              axis.ticks = element_line(colour='black', size=0.3),
              panel.background = element_rect(fill = 'grey50', color = 'black'))
}

and this is how we called it to make the plot:

plot_summary_heatmap(covs)

This heatmap shows, for each gene, the mean coverage of that gene for metagenomes from each category. Rows are metagenome categories, columns are idrD sequences from the different species, and each cell is shaded by the mean coverage over that entire gene from all metagenomes in that category, log10 transformed. From this plot, it becomes apparent that the various human microbiomes are interesting, and that idrD sequences from the Acinetobacter, Pseudomonas, and Xanthomonas spp. are not particularly well covered in any metagenome, except in metagenomes with evenly high coverage across all sequences (e.g. biogas fermenter metagenomes). These species’ sequences will ultimately be dropped, to focus on Proteus, Rothia, Cronobacter, and Prevotella.

Binning metagenomes by origin

Based on the heatmap, we decided to focus on human oral and human gut metagenomes. We investigated a few different binning strageties, the one ultimately used was bin, the other two represent bins that distinguish HMP oral from other oral metganomes (binSplitOralHMP) and include the FijiCOMP dataset (binWithFiji).

# metagenome bin categories
bin <- list()
bin[["aquatic"]] <- c("soda lake metagenome", "marine sediment metagenome", "marine metagenome", 
                      "hypersaline lake metagenome", "hydrothermal vent metagenome", "hot springs metagenome", 
                      "groundwater metagenome", "freshwater metagenome", "aquatic metagenome")
bin[["soil_probably"]] <- c("terrestrial metagenome", "surface metagenome", "soil metagenome", 
                            "sediment metagenome")
bin[["animal"]] <- c("pig gut metagenome", "oyster metagenome", "invertebrate gut metagenome", "Gallus gallus", 
                     "bovine metagenome", "bovine gut metagenome",
                     "Bos indicus", "gut metagenome")
bin[["other"]] <- c("wastewater metagenome", "upper respiratory tract metagenome", "milk metagenome", 
                    "microbial mat metagenome", "metagenome","metagenomes", "indoor metagenome", 
                    "Homo sapiens", "fossil metagenome", "food metagenome", "compost metagenome", 
                    "bioreactor metagenome", "biogas fermenter metagenome", "anaerobic digester metagenome")
bin[["Non-HMP gut metagenome"]] <- c("human gut metagenome", "human metagenome")
binSplitOralHMP <- bin
bin[["Human oral metagenome"]] <- c("human oral metagenome", "HMP oral metagenome", "Non-HMP oral metagenome")
binWithFiji <- bin
binWithFiji[["Human oral metagenome"]] <- c(binWithFiji[["Human oral metagenome"]], "oral metagenome")
binSplitOralHMP[["Non-HMP oral metagenome"]] <- c("Non-HMP oral metagenome", "human oral metagenome")
binSplitOralHMP[["HMP oral metagenome"]] <- c("HMP oral metagenome")

When we bin, the heatmap now looks like this, which is Supplemental Figure 3 of the paper:

plot_summary_heatmap(covs, bins = bin)

Per-nucleotide coverage to visualize region-specific coverage

Because the heatmap summarizes all nucleotide positions across many metagenomes, we wanted to understand idrD representation in metagenome populations at a finer level of detail. This also provides an opportunity to gauge idrD variability in complicated populations, which may have implications for its role in subpopulation differentiation, as suspected from its role in interstrain competition.

So, we wrote a function to plot individual metagenome coverages for each gene for each habitat we were interested in. Here is the function:

plot_mean_covs_var_combo <-function(df, v, proportion_gte0=NULL, bins=NULL, subTaxPos=NULL, 
                                    subsite=NULL, min_var_cov=0, scale_axes='free', cols = NULL,
                                    rhsBox=NULL,rhsCols="darkgreen") {
    
    # subset to certain metagenome categories if requested
    if (!is.null(subsite)){
        df <- df[df$category %in% c(subsite, unlist(bins[subsite])),]
        v <- v[v$category %in% c(subsite, unlist(bins[subsite])),]
    }

    # make taxonomy column with pretty names
    df$tax <- gsub("_"," ", gsub("_idrD.*$","", as.character(df$split_name)))
    df$tax <- factor(df$tax, levels = sort(unique(as.character(df$tax))))

    v$tax <- gsub("_"," ", gsub("_idrD.*$","", as.character(v$split_name)))
    v$tax <- factor(v$tax, levels = sort(unique(as.character(v$tax))))

    # subset to certain nuclotide windows
    print("subsetting...")
    if (!is.null(subTaxPos)){
        df <- df[grep(paste(names(subTaxPos),collapse = "|"), as.character(df$split_name)),]
        v <- v[grep(paste(names(subTaxPos),collapse = "|"), as.character(v$split_name)),]
        for (subg in names(subTaxPos)){
            if (is.numeric(subTaxPos[[subg]])) {
                df <- filter(df, !(grepl(subg, as.character(split_name)) & (!nt_position %in% subTaxPos[[subg]])))
                v <- filter(v, !(grepl(subg, as.character(split_name)) & (!pos %in% subTaxPos[[subg]])))
            }
        }
        df$tax <- factor(df$tax, levels = sapply(names(subTaxPos), # reorder by custom value
                                                 function(x) grep(x, levels(df$tax), value = T)))  
        v$tax <- factor(v$tax, levels = levels(df$tax))  # reorder to match
    }
    print("done subsetting")

    print("about to filter")
    if(!is.null(proportion_gte0)){
        df <- filter_mostly_zeros(df = df, proportion_gte0 = proportion_gte0)
    }
    v <- v[as.character(v$sample_id) %in% unique(as.character(df$sample_name)),]  # keep these the same
    print('done filtering')

    # bin metagenome categories if needed
    if (!is.null(bins)){
        df <- bin_categories(df, bins)
        v <- bin_categories(v, bins)
    }

    # count number of metagenomes to display for each sequence x metagenome combination
    numMgenomes <- ddply(df, ~tax + category, summarise, nMeta=length(unique(as.character(sample_name))))

    # order the metagenome categories as specified
    if(!is.null(subsite)){
        df$category <- factor(df$category, levels = subsite)
        v$category <- factor(v$category, levels = subsite)
    }

    # drop any variant positions if coverage is too low (default = don't drop any variant positions)
    v <- v[v$coverage >= min_var_cov,]
    
    ## re-compute the Shannon Entropy of SNPs from each position
    # requires adding in counts of non-variant metagenomes and then re-calculating entropy
    v <- merge(df, y =v, by.x = c('split_name', 'sample_name', 'nt_position'), 
               by.y = c("split_name","sample_id","pos"), all.x = T)
    v <- v[!is.na(v$coverage.y),]
    v$category <- v$category.x

    # for positions where only some metagenomes had variance, add back the difference
    v$A[v$coverage.x > v$coverage.y & v$reference == "A"] <- 
      v$A[v$coverage.x > v$coverage.y & v$reference == "A"] +
        v$coverage.x[v$coverage.x > v$coverage.y & v$reference == "A"] -
        v$coverage.y[v$coverage.x > v$coverage.y & v$reference == "A"]
    v$G[v$coverage.x > v$coverage.y & v$reference == "G"] <- 
      v$G[v$coverage.x > v$coverage.y & v$reference == "G"] +
        v$coverage.x[v$coverage.x > v$coverage.y & v$reference == "G"] -
        v$coverage.y[v$coverage.x > v$coverage.y & v$reference == "G"]
    v$T[v$coverage.x > v$coverage.y & v$reference == "T"] <-
      v$T[v$coverage.x > v$coverage.y & v$reference == "T"] +
        v$coverage.x[v$coverage.x > v$coverage.y & v$reference == "T"] -
        v$coverage.y[v$coverage.x > v$coverage.y & v$reference == "T"]
    v$C[v$coverage.x > v$coverage.y & v$reference == "C"] <- 
      v$C[v$coverage.x > v$coverage.y & v$reference == "C"] +
        v$coverage.x[v$coverage.x > v$coverage.y & v$reference == "C"] -
        v$coverage.y[v$coverage.x > v$coverage.y & v$reference == "C"]

    # calculate the entropy for each site
    variability <- ddply(v, ~nt_position + tax.x + category, summarise, origEntropy=mean(entropy), 
                         ent = entropy.empirical(y = c(sum(A), sum(C), sum(G), sum(T)), unit = 'log2'))

    # sort the taxa in the specified order
    variability$tax <- factor(variability$tax.x, levels = levels(df$tax))


    # coverage plot
    q <- ggplot(df, aes(x = nt_position, y = coverage)) +
        geom_line(size = 0.4, aes(group = sample_name, color = tax), alpha = 0.3) + 
        geom_text(data=numMgenomes, aes(label=nMeta), x=Inf, y = Inf, hjust = 1, vjust = 1) +
        facet_grid(category~tax, scales = scale_axes, switch = "y") + 
        theme_minimal() + guides(color = F) +
        xlab("Nucleotide position in idrD") + ylab("Metagenome coverage (X)") +
        scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) +
        theme(panel.grid = element_blank(), axis.line = element_line(colour = 'black', size = 0.5), 
              axis.text = element_text(colour='black'), axis.ticks = element_line(colour='black', size=0.3), 
              strip.background.y = element_rect(fill = '#dcdcdc', color = NA),
              strip.background.x = element_rect(fill = '#ffffff', color=NA), 
              strip.text = element_text(color='black'), strip.text.x = element_text(face = 'bold'))
    if (!is.null(cols)){
        q <- q + scale_color_manual(values = cols)
    }

    # var plot
    p <- ggplot(variability, aes(x = nt_position, y = ent)) +
        geom_col(data = df, aes(x = nt_position, y = coverage*0), alpha = 0) +
        geom_col(alpha = 1, size = 0.1, fill = 'grey20', color = NA) +
        facet_grid(category~tax, scales = scale_axes) +
        scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0), position = 'right') +
        xlab("Nucleotide position in idrD") + ylab("Shannon Entropy") + theme_minimal() +
        theme(panel.grid = element_blank(), axis.line = element_line(colour = 'black', size = 0.5), 
              axis.text = element_text(colour='black'), axis.ticks = element_line(colour='black', size=0.3), 
              strip.background.y = element_rect(fill = '#dcdcdc', color = NA),
              strip.background.x = element_rect(fill = '#ffffff', color=NA), 
              strip.text = element_text(color='black'), strip.text.x = element_text(face = 'bold'))
    if (!is.null(rhsBox)){
        rhsBox <- data.frame(tax=names(rhsBox),matrix(unlist(rhsBox), nrow = length(rhsBox), ncol = 2, byrow = T))
        rhsBox$tax_distinct <- rhsBox$tax
        rhsBox$tax <- sapply(as.character(rhsBox$tax), function(x) grep(x, levels(df$tax), value = T))
        rhsBox$tax <- factor(rhsBox$tax, levels = levels(df$tax))
        q <- q + geom_segment(data=rhsBox, 
                              aes(x = X1, xend = X2, y = Inf, yend = Inf), size = 1, color = rhsCols)
    }


    print(q)
    print(p)
}

We manually defined a few additional variables to specify the taxa, sites, and regions of interest. Here they are:

# C terminal domains from taxa we care about, see ms for how start/stop positions were identified
sub4cterm <- list()
sub4cterm[["Proteus"]] <- c(4333:4746)
sub4cterm[["Rothia"]] <- c(283:702)
sub4cterm[["Cronobacter"]] <- c(4174:4572)
sub4cterm[["Prevotella"]] <- c(1816:2250)
sub4all <- list()
sub4all[["Proteus"]] <- 'all'
sub4all[["Rothia"]] <- 'all'
sub4all[["Cronobacter"]] <- 'all'
sub4all[["Prevotella"]] <- 'all'

# C terminal domains from Prevos we care about (nucleotide positions)
prevos <- list()
prevos[["jejuni"]] <- c(1816:2250)
prevos[["C561"]] <- c(238:672)
prevos[["fusca"]] <- c(835:1257)
prevos[["denticola"]] <- c(3861:4295)

# predicted Rhs start/stop sites
rhs4 <- list()
rhs4[["Proteus"]] <- c(2329,4332)
rhs4[["Rothia"]] <- c(40,276)
rhs4[["Cronobacter"]] <- c(3937,4173)
rhs4[["Prevotella"]] <- c(1585,1815)
# predicted CTD start/stop sites
rhs4[["Proteu"]] <- c(4333,4746)
rhs4[["Rothi"]] <- c(283,702)
rhs4[["Cronobacte"]] <- c(4174,4572)
rhs4[["Prevotell"]] <- c(1816,2250)

# predicted region with active fold for the 4 taxa
active4 <- list()
active4[["Proteus"]] <- c(4339,4581)
active4[["Rothia"]] <- c(286,546)
active4[["Cronobacter"]] <- c(4180,4431)
active4[["Prevotella"]] <- c(1822,2067)

# predicted region with active fold for the Prevotella
activePrevo <- list()
activePrevo[["jejuni"]] <- c(1822,2067)
activePrevo[["C561"]] <- c(238,492)
activePrevo[["fusca"]] <- c(841,1089)

# bins/sites to focus on for plotting
subsites <- c("HMP gut metagenome", "Non-HMP gut metagenome", "Human oral metagenome")#, "oral metagenome")
subsitesSplitHMP <- c(subsites, "Non-HMP oral metagenome", "HMP oral metagenome")
subsitesWithFiji <- c(subsites, "oral metagenome", "human metagenome")
subsitesWithHuman <- c(subsites, "human metagenome")

sub4cols <- c("#ee2922","#68c0d3","#aa9a4a","#8a10aa")
prevoCols <- c("#8a10aa","#ca60aa","#4a109a")

The sub* and prevos lists define the plotting windows for each taxa, a value of all means all nucleotides will be used while a range of numbers mean that only coverages or variability information in that range will be plotted. Importantly, in the plot_mean_covs_var_combo function, the filtration step happens after defining the nucleotide range to be plotted. Thus, metagenomes are filtered based on the proportion of non-zero-coverage positions within the nucleotide range plotted.

The rhs4 list elements with full names (e.g. Proteus) define the start and stop positions for each sequence’s Rhs core domain, while the elements with truncated names (e.g., Proteu; simply to add a second element with a searchable name) define the CTD start and stop positions.

The subsites* vectors are a convenient way to command the plotter function to focus on a certain set of metagenome categories (including bin names).

With this information, we can generate the plots that become Figure 4B-D.

Figure 4B was generated as such:

plot_mean_covs_var_combo(df = covs, v = vari, proportion_gte0 = 0.5, bins = bin, subTaxPos = sub4all, 
                         subsite = subsites, cols = sub4cols, rhsBox = rhs4, 
                         rhsCols = rep(c(rep("#3a6b46",4),rep("#83ca95",4)),2))
## [1] "subsetting..."
## [1] "done subsetting"
## [1] "about to filter"
## [1] "done filtering"

The x-axis is nucleotide position, and each subpanel represents a different combination of metagenome category x idrD sequence. Columns of subpanels are different species (labelled at the top), and rows are the metagenome categories (labelled on the left). Each line traces the coverage provided by a single metagenome, and the number of metagenomes for that given idrD + metagenome category combination is shown in the upper right corner of each subpanel. The second plot generated by this command shows the variability of each position across the different metagenomes. The ggplot function does not like to overlay disparate y axes, hence the separate plot. The variability is the Shannon Entropy of all variant (and invariant) nucleotide frequencies reported by all metagenomes for any position with variability. This is possible since Anvi’o counted, for each metagenome, all nucleotide counts that occurred at in a metagenome for each position with a SNP (relative to the reference idrD sequence), which is reported in the variability file that is vari.

It is important to note that nucleotide-level coverage patterns are always inherently wavy, besides from real biological signal, from a combination of stochastic/methodological, biological features (e.g. varying GC content). For more on this, please check out this helpful article that dives deeply into the sources of wavy coverage.

The variability information gives context into what is happening with the coverage and helps us decide whether the mapping is “real” (representing a biological population) or “spurious” (mapping artifacts or else coverage from distant populations not considered to be part of the population from which the reference sequence originated). For instance, if one region has more coverage than the rest of the genome, but also more variability, then it is a strong indicator that there is a different population (or duplication of the gene) containing that region but not the rest (or the rest is present in that population/duplicate but is so divergent that bowtie2 cannot map it). Alternatively, if in that example coverage were to go up but the variability did not, then the same multiple-population or duplication scenario exists, but that region is clearly more conserved at the nucleotide level between populations/duplicates.

To make the in-text Figure 4B, we showed only the coverage data; for the supplemental figure version, we overlaid the coverage figure on top of the variability figure with Inkscape

Note that the FijiCOMP samples were dropped, otherwise it would look like this:

plot_mean_covs_var_combo(df = covs, v = vari, proportion_gte0 = 0.5, bins = binWithFiji, subTaxPos = sub4all, 
                         subsite = subsitesWithFiji, cols = sub4cols, rhsBox = rhs4, 
                         rhsCols = rep(c(rep("#3a6b46",4),rep("#83ca95ff",4)),2))
## [1] "subsetting..."
## [1] "done subsetting"
## [1] "about to filter"
## [1] "done filtering"

The FijiCOMP samples (the low-coverage “noise” added to the P. jejuni x Human oral metagenome subpanel) are all very noisy and low-coverage, so potentially erroneus reads or spurious mapping, so we felt safe in discarding them.

We then focused in to just the C-terminal domain (CTD) - this is Figure 4C - by passing the function the nucleotide ranges in the CTD sub4cterm like so:

plot_mean_covs_var_combo(df = covs, v = vari, proportion_gte0 = 0.5, bins = bin, subTaxPos = sub4cterm, 
                         subsite = subsites, cols = sub4cols, rhsBox = active4, rhsCols = rep("#003380ff", 8))
## [1] "subsetting..."
## [1] "done subsetting"
## [1] "about to filter"
## [1] "done filtering"

Based on this figure, we were very interested in diggng deeper into what was going on with the Prevotella in the oral cavity. The enzymatic work showed that the 5’ end of the CTD contained the active sites (region spanned by dark blue bars) while the 3’ region was more variable but still necessary (3’ truncations abolished activity). So, we were curious if the precipitous drop in metagenome coverage immediately after the first part represented a true truncation in the metagenome-sampled population(s), or whether there truly was a 3’ region(s) in the metagenome data but the P. jejuni idrD sequence was simply too distant, in nucleotide and presumably evolutionary space, to recruit/represent them.

So, we took sequences from three other Prevotella spp., mapped the same human oral metagenomes against them (using the same scripts and commands described earlier, just modified to specify the appropriate new reference files & output) - coverage data here and variability data here.

This produced the prevo_cov and prevo_var dataframes that have been appearing in the scripts above. So then we plotted the coverage of their CTDs like so:

plot_mean_covs_var_combo(df = prevo_cov, v = prevo_var, proportion_gte0 = 0.5, bins = bin, 
                         subTaxPos = prevos, subsite = subsites, cols = prevoCols, 
                         rhsBox = activePrevo, rhsCols = rep("#003380ff", 3))
## [1] "subsetting..."
## [1] "done subsetting"
## [1] "about to filter"
## [1] "done filtering"

And it was immediately clear that the 3’ region of CTDs from P. sp. C561, and P. fusca to a lesser extent, recruited reads, while P. jejuni kept its original trend. It was also encouraging to see that the variability went down, reflecting that having the other Prevotella as alternative mapping targets ameliorated a significant amount of the nucleotide diversity (i.e., the P. sp. C561 sequence was more representative, and the sequence diversity was apparently due to other Prevotella populations, not some extremely unrelated population). The ‘valley’ in the middle of the P. sp. C561 CTD is puzzling, and potentially holds more insight into subpopulation diversity.

Updated: