Actinomyces comparative genomics
Overview
This is a narrative methods providing a reproducible workflow of the analyses used in our paper The saccharibacterium TM7x elicits differential responses across its host range. In this paper, we describe how we found that Nanosynbacter lyticus TM7x can establish an ectosymbiotic relationship on several related Actinomyces spp., but the inital relationship differs among the susceptible strains. That is, TM7x behaves like a parasite with A. odontolyticus XH001 in the lab, documented here, where naive XH001 experience a strong crash in growth upon exposure to TM7x, and after a few passages they regain balance and achieve a stable relationship with TM7x. However, not all Actinomyces strains that could host TM7x had this crash. Specifically, out of 13 susceptible Actinomyces strains spanning ~74% AAI and two named species, three strains never crashed in response to TM7x, regardless of the initial TM7x dosage. This observation is exciting, as it suggests that across a moderate host range, the nature of the symbiosis, e.g., parasitism, might not be the same across all potential hosts. But further, since we have genomes for the majority of these strains, we might glean some some insights into what makes A. spp. able to host TM7x or why some strains are negatively vs. neutrally affected by hosting TM7x.
First we manually downloaded genomes from NCBI corresponding to the 23 strains we characterized experimentally. For convenience here are their names:
A_odontolyticus_ATCC_17929
A_odontolyticus_ATCC_17982
A_meyeri_ATCC_35568
A_odontolyticus_F0309
A_sp_HMT_180_F0310
A_sp_HMT_172_F0311
A_sp_HMT_849_F0330
A_sp_HMT_848_F0332
A_sp_HMT_171_F0337
A_sp_HMT_178_F0338
A_sp_HMT_175_F0384
A_sp_HMT_170_F0386
A_sp_HMT_448_F0400
A_massiliensis_F0489
A_johnsonii_F0510
A_graevenitzii_F0530
A_sp_HMT_877_F0543
A_sp_HMT_414_F0588
A_sp_ICM39
A_sp_ICM47
A_sp_ICM58
A_meyeri_W712
A_odontolyticus_XH001
While collecting the genomes, we changed each fasta’s defline to be like
>A_meyeri_W712_ctg1
from the more complicated default NCBI header.
There are many ways to do this but we did it like
awk '/>/ {print ">A_meyeri_W712_ctg" i++} !/>/ {print $0}' A_W712.fasta > A_W712-renamed.fasta
.
The original genomes went into a directory called original_genomes
and
the renamed files into genomes
.
Side note: We ran all of these analyses on an 8-core 2019 MacBook Pro with 32GB RAM, except for the functional annotation with Interproscan, which we ran on Harvard University’s Odyssey cluster to make that step faster. This cluster uses the Slurm job scheduler; we have included the header information for the sake of repeatability and to provide context about the resources used and the means of parallelization.
Pangenome construction
To relate the genomes based on gene content and investigate potentially shared distinctive features, we made a pangenome. We used anvi’o, a framework for analysis and visualization of ’omics data, for the majority of this project.
First we added all our genomes’ contigs (contiguous chunks) into a
single contigs database. First we combined all the renamed fastas, like
cat genomes/*renamed.fasta > Actinomyces-23.fa
. Then we converted this
to an anvi’o contigs database with
anvi-gen-contigs-database -f Actinomyces-23.fa -o Actinomyces-23-CONTIGS.db
During this step, anvi’o invoked Prodigal to call genes on our contigs.
Annotating gene calls
The predicted genes were then annotated using Interproscan that is a useful bundle of a variety of annotation sources like Pfam and Tigrfam. To annotate the genomes rather quickly, we parallelized the process by choping the job up into multiple batches which ran concurrently on our university’s cluster. To get the gene sequences (amino acid), we ran
anvi-get-sequences-for-gene-calls -c Actinomyces-23-CONTIGS.db --wrap 0 --get-aa-sequences -o Actinomyces-23-gene-calls.faa
This was chopped up into chunks of 5000 genes apiece by
split -l 10000 -d -a 4 Actinomyces-23-gene-calls.faa Actinomyces-23-gene-calls.faa-
and then we set up the column names of the output table with
echo -e "gene_callers_id\tsource\taccession\tfunction\te_value" > interpro-results-formatted-Actinomyces-23.tsv
The split command generated 11 files starting of the form
PREFIX.faa-BATCH (e.g. Actinomyces-23-gene-calls.faa-0004
), which
fed into the following Slurm job array script:
#!/bin/bash
#SBATCH -N 1 #1 nodes of ram
#SBATCH -n 12 # 12 cores from each
#SBATCH --contiguous
#SBATCH --mem=12G #per node
#SBATCH -t 0-6:00:00
#SBATCH -p shared
#SBATCH --array=0-10%11
#SBATCH --job-name="anvi-ipAct"
#SBATCH -o odyssey_anviIPact.out
#SBATCH -e odyssey_anviIPact.err
#SBATCH --mail-type=END
# set this up first
# split -l 10000 -d -a 4 Actinomyces-23-gene-calls.faa Actinomyces-23-gene-calls.faa-
# echo -e "gene_callers_id\tsource\taccession\tfunction\te_value" > interpro-results-formatted-Actinomyces-23.tsv
prefix="Actinomyces-23"
taskNum=$(printf %04d $SLURM_ARRAY_TASK_ID) # format slurm task id to be padded with 0s like split
# identify which batch this array element matches
batch=$prefix-gene-calls.faa-$taskNum
# load java for interproscan
module load jdk/1.8.0_172-fasrc01
# run interproscan on this batch
./../../interproscan-sh/interproscan-5.36-75.0/interproscan.sh -i $batch -o ${batch}.tsv -f tsv --appl TIGRFAM,Pfam,SUPERFAMILY,ProDom,Gene3D
# format it manually and append to main table
cat ${batch}.tsv | awk -F"\t" '{print $1 FS $4 FS $5 FS $6 FS $9}' >> interpro-results-formatted-$prefix.tsv
Once this finished, we imported the results into the contigs database
with
anvi-import-functions -c Actinomyces-23-CONTIGS.db -i interpro-results-formatted-Actinomyces-23.tsv
We also ran HMM profiles for a custom set of canonically single copy
core genes anvi’o knows about, with
anvi-run-hmms -c Actinomyces-23-CONTIGS.db -T 8
(the -T 8
used 8
threads to speed things up). From this information anvi’o will later be
able to estimate each genome’s completeness and redundancy, based on the
number of single copy core genes missing or duplicated, respectively.
Then we made a dummy profile collection since that is required for the next step by anvi’o, like so:
anvi-profile --blank-profile -c Actinomyces-23-CONTIGS.db -o Actinomyces-23-PROFILE -S dummy
The main purpose of this is to hold a ‘collection’ that tells anvi’o which contigs came from which genome. To do this anvi’o needs a file with two columns where the left column is the contig name and right column is the collection name (here, genome name) to which that contig belongs.
We accomplished this with a quick bash oneliner to trim off the _ctg#
suffix from the contig deflines, like
awk -F"_ctg" '/>/ {print $0 "\t" $1}' Actinomyces-23.fa | sed 's/>//g' > collection_mapper_Actinomyces-23.txt
Then we can move the profile database to an convenient place like
mv Actinomyces-23-PROFILE/PROFILE.db Actinomyces-23-PROFILE.db
, then
import the file into the PROFILE db to associate the contigs into
genomes:
anvi-import-collection -p Actinomyces-23-PROFILE.db -c Actinomyces-23-CONTIGS.db -C Genomes collection_mapper_Actinomyces-23.txt --contigs-mode
With this information in hand we can then write a file that contains all
the info for how to identify each genome and where to find its data,
which we call Actinomyces-23-internal-genomes-table.txt
whose contents
are copied here for convenience:
name bin_id collection_id profile_db_path contigs_db_path
A_odontolyticus_ATCC_17929 A_odontolyticus_ATCC_17929 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_odontolyticus_ATCC_17982 A_odontolyticus_ATCC_17982 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_meyeri_ATCC_35568 A_meyeri_ATCC_35568 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_odontolyticus_F0309 A_odontolyticus_F0309 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_180_F0310 A_sp_HMT_180_F0310 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_172_F0311 A_sp_HMT_172_F0311 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_849_F0330 A_sp_HMT_849_F0330 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_848_F0332 A_sp_HMT_848_F0332 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_171_F0337 A_sp_HMT_171_F0337 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_178_F0338 A_sp_HMT_178_F0338 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_175_F0384 A_sp_HMT_175_F0384 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_170_F0386 A_sp_HMT_170_F0386 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_448_F0400 A_sp_HMT_448_F0400 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_massiliensis_F0489 A_massiliensis_F0489 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_johnsonii_F0510 A_johnsonii_F0510 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_graevenitzii_F0530 A_graevenitzii_F0530 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_877_F0543 A_sp_HMT_877_F0543 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_HMT_414_F0588 A_sp_HMT_414_F0588 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_ICM39 A_sp_ICM39 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_ICM47 A_sp_ICM47 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_sp_ICM58 A_sp_ICM58 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_meyeri_W712 A_meyeri_W712 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
A_odontolyticus_XH001 A_odontolyticus_XH001 Genomes Actinomyces-23-PROFILE.db Actinomyces-23-CONTIGS.db
Now we can pull the relevant information for these genomes using the following anvi’o command:
anvi-gen-genomes-storage -i Actinomyces-23-internal-genomes-table.txt -o Actinomyces-23-GENOMES.db
And then use the resultant GENOMES.db file to calculate the pangenome:
anvi-pan-genome -g Actinomyces-23-GENOMES.db -o Actinomyces-23-PAN -n Actinomyces-23-mcl6 --mcl-inflation 6 --use-ncbi-blast -T 10
This step is where we operationally define gene homology. To do this,
anvi’o first uses blastp (--use-ncbi-blast
) to compute the amino acid
similarities between each pair of genes (technically, predicted ORFs).
Then, the blastp similarities are used to generate a network linking
genes, where edges are blastp similarities and nodes are genes.
MCL, a network clustering algorithm, is then
applied to this network to identify ‘clusters’ of genes on the graph,
that represent genes clustered together in amino-acid space. These
operationally defined gene homologs are so called gene clusters. The
tendency of MCL to split or lump clusters is based on the hyperparameter
--mcl-inflation
which we set to 6, as that has produced good results
for genus-level pangenomes.
Adding some miscellaneous data
Once the pangenome was done, we imported information about the number of
contigs and the TM7x susceptibility of each genome. We found the number
of contigs for each genome by running
awk '{print $2}' collection_mapper_Actinomyces-23.txt | uniq -c
and
then manually adding the corresponding TM7x information from the wet
experiments. This resulted in the following table:
layer Hosting NumCtgs
A_odontolyticus_ATCC_17929 Crash 26
A_odontolyticus_ATCC_17982 Crash 2
A_meyeri_ATCC_35568 Crash 15
A_odontolyticus_F0309 Crash 6
A_sp_HMT_180_F0310 Crash 6
A_sp_HMT_172_F0311 No_crash 215
A_sp_HMT_849_F0330 Resistant 76
A_sp_HMT_848_F0332 Resistant 1
A_sp_HMT_171_F0337 Resistant 280
A_sp_HMT_178_F0338 Crash 43
A_sp_HMT_175_F0384 Resistant 7
A_sp_HMT_170_F0386 Resistant 76
A_sp_HMT_448_F0400 Resistant 12
A_massiliensis_F0489 Resistant 233
A_johnsonii_F0510 Resistant 324
A_graevenitzii_F0530 Resistant 29
A_sp_HMT_877_F0543 Crash 1
A_sp_HMT_414_F0588 Resistant 1
A_sp_ICM39 Crash 146
A_sp_ICM47 No_crash 143
A_sp_ICM58 No_crash 1
A_meyeri_W712 Crash 1
A_odontolyticus_XH001 Crash 5
This was imported into the pangenome for visual display with
anvi-import-misc-data -p Actinomyces-23-PAN/Actinomyces-23-mcl6-PAN.db -t layers Actinomyces-23-external-table.txt
We also converted the susceptibility information to a numeric variable (resistant=0, No_crash=0.5, Crash=1) and similarly imported it.
Altogether, this generated the pangenome displayed in Figure 5, when colored and formatted, and the genomes are oriented by gene cluster frequencies.
Enriched functions
Based on the layout in the pangenome and the gene cluster that were visibly core to each group, we wondered if we could identify functions enriched in each group. To that end, we applied a built-in function in anvi’o that ultimately compares genomes based on what predicted functions they share or don’t share, rather than specific homologs (as is displayed in the pangenome) (Shaiber et al., submitted). This method works by passing anvi’o a list of which genomes belong in which category, a functional annotation (here, Pfam), and then anvi’o counts up, for each function, the number of genomes by category. From this, a table is made that lists each function, its ‘enrichment score’ reflecting the ratio of in- vs. out-of-category genomes having that predicted function, a p-value and FDR-adjusted q-value, and the proportion of genomes in each category containing that function (so we biologists can decide whether we believe the machine’s estimate of significant enrichment). All this is done with the following command:
anvi-get-enriched-functions-per-pan-group -p Actinomyces-23-PAN/Actinomyces-23-mcl6-PAN.db -g Actinomyces-23-GENOMES.db --category Hosting --annotation-source Pfam -o Actinomyces-23-ENRICHED-PFAM.txt
This generated Table S4 of the manuscript, from which the top genes in relevant categories were taken to report in the in-text Table 1.
Investigating single-gene trees from core gene clusters
While presence/absence of gene clusters or functions may distinguish features of Actinomyces strains important determining TM7x response, this method assumes that the phenotype is based on the complete presence or absence of a gene/function. However, it is biologically plausible (and evident in many host/phage systems) that a conserved gene may be the target, in which case seemingly minor amino-acid sequence variants can effect a strong change in the host/symbiont relationship. To address this second scenario, we decided to look for amino acid sequence variants in gene clusters core to all 23 genomes, or core to the 13 genomes that could host TM7x. Specifically, we looked for variants that distinguished the various groups (e.g. non-crashers vs crashers vs resistant).
Extracting genes and generating gene trees
So first we exported all gene clusters that were present in all 23 or
the 13 TM7x-hosting genomes, but only gene clusters represented by a
single gene sequence per genome. We accomplished that by first
summarizing the pangenome so that we record in the pangenome database
which gene clusters occur in which core sets (e.g. core to all genomes,
core to resistant, etc):
anvi-summarize -g Actinomyces-23-GENOMES.db -p Actinomyces-23-PAN/Actinomyces-23-mcl6-PAN.db -C cores -o Actinomyces-23-SUMMARY
followed by
gunzip Actinomyces-23-SUMMARY/Actinomyces-23-mcl6_gene_clusters_summary.txt.gz
And the resultant *summary.txt
file contains information for all gene
clusters and genomes in the pangenome, which is Table S3 in the
manuscript.
We then pulled out the gene clusters of interest:
for gc in $(grep 'Susceptible_core' Actinomyces-23-SUMMARY/Actinomyces-23-mcl6_gene_clusters_summary.txt | awk -F"\t" '{print $2}' | uniq); do
anvi-get-sequences-for-gene-clusters -p Actinomyces-23-PAN/Actinomyces-23-mcl6-PAN.db -g Actinomyces-23-GENOMES.db \
--max-num-genes-from-each-genome 1 --gene-cluster-id $gc -o susceptible_core_gcs/$gc.faa
done
And
for gc in $(grep 'All_core' Actinomyces-23-SUMMARY/Actinomyces-23-mcl6_gene_clusters_summary.txt | awk -F"\t" '{print $2}' | uniq); do
anvi-get-sequences-for-gene-clusters -p Actinomyces-23-PAN/Actinomyces-23-mcl6-PAN.db -g Actinomyces-23-GENOMES.db \
--max-num-genes-from-each-genome 1 --gene-cluster-id $gc -o all_core_gcs/$gc.faa
done
This identified and saved alignments for 419 gene clusters for the 13 susceptible strains and 291 gene clusters for all 23 genomes.
The deflines contained more information than we wanted in the tip labels
so we simplified with sed -i "" 's/>.*genome_name:/>/; s/|.*$//' *faa
Then we generated phylogenies with
FastTree
for each gene with for gc in $(ls); do FastTree $gc > $gc.nwk; done
which produced a newick tree for each gene cluster.
Identifying gene clusters producing interesting topologies
Now that we had all these gene trees, we screened through them quickly with a custom python script:
from ete3 import Tree
import glob
from itertools import combinations
import re
import pandas as pd
# this is the dendrogram produced by gene cluster frequency in pangenome in Fig 5, also cartoonized in Fig 6a
refAll = Tree("(((A_sp_HMT_848_F0332,A_graevenitzii_F0530),(((A_sp_HMT_448_F0400,A_massiliensis_F0489),\
A_sp_HMT_414_F0588),(((A_sp_HMT_175_F0384,A_sp_HMT_171_F0337),A_sp_HMT_170_F0386),\
(A_sp_HMT_849_F0330,A_johnsonii_F0510)))),(((((A_sp_ICM58,A_sp_HMT_172_F0311),A_sp_ICM47),\
((((A_odontolyticus_F0309,A_odontolyticus_ATCC_17982),A_odontolyticus_ATCC_17929),A_sp_HMT_180_F0310),\
(A_sp_ICM39,A_odontolyticus_XH001))),(A_meyeri_W712,A_meyeri_ATCC_35568)),(A_sp_HMT_877_F0543,A_sp_HMT_178_F0338)));")
# Define tips belonging to the various clades
noCrashClade = ['A_sp_ICM58', 'A_sp_HMT_172_F0311', 'A_sp_ICM47']
crashersInternal = ['A_odontolyticus_F0309','A_odontolyticus_ATCC_17982','A_odontolyticus_ATCC_17929',
'A_sp_HMT_180_F0310','A_sp_ICM39','A_odontolyticus_XH001']
crashersPolyphyletic = ['A_meyeri_W712','A_meyeri_ATCC_35568','A_sp_HMT_877_F0543','A_sp_HMT_178_F0338']
resistant = ['A_sp_HMT_848_F0332','A_graevenitzii_F0530','A_sp_HMT_448_F0400','A_massiliensis_F0489',
'A_sp_HMT_414_F0588','A_sp_HMT_175_F0384','A_sp_HMT_171_F0337','A_sp_HMT_170_F0386','A_sp_HMT_849_F0330',
'A_johnsonii_F0510']
comparisons = {} # to save a record matching gene clusters with their type of topology
usefulTrees = {} # to save the associated trees
# For the 23 core set
for path in glob.glob("/Users/dutter/g4/bat_tm7/core_gene_clusters/*nwk"):
t = Tree(path) # read tree
# Course check for whether non-crashers and resistant are each monophyletic
if t.check_monophyly(values=noCrashClade, target_attr='name')[0] and \
t.check_monophyly(values=resistant, target_attr='name')[0]:
# extract the gene cluster ID from the filename
gene_cluster_id = re.sub(r"^.*/(.*).faa.nwk", '\\1', path)
# Now check ideal case - non-crashers are sister to resistant, together sister to the crashers - Fig 6c
if t.check_monophyly(values=noCrashClade + resistant, target_attr='name')[0] and \
t.check_monophyly(values=crashersInternal, target_attr='name')[0]:
comparisons[gene_cluster_id] = '(((noCrash) + (resistant)) + (most_crashers))'
usefulTrees[gene_cluster_id] = t
# but also interested if all three groups are monophyletic but we don't specify the relationship between them - Fig 6b
elif any([t.check_monophyly(values=crashersInternal + list(combo), target_attr='name')[0]
for length in [0, 1, 2, 3, 4] for combo in combinations(crashersPolyphyletic, length)]):
comparisons[gene_cluster_id]='no_crash, resistant, most non_crashers each monophyletic but variably related'
usefulTrees[gene_cluster_id] = t
# do likewise for the gene clusters core to 13 susceptible genomes - Fig 6d
for path in glob.glob("/Users/dutter/g4/bat_tm7/susceptible_core_gcs/*nwk"):
t = Tree(path) # read tree in
# no resistant genomes so only check if non-crashers and the odontolyticus group of 6 genomes are each monophyletic regardless of relationship
if t.check_monophyly(values=noCrashClade, target_attr='name')[0] and \
any([t.check_monophyly(values=crashersInternal + list(combo), target_attr='name')[0]
for length in [0, 1, 2, 3, 4] for combo in combinations(crashersPolyphyletic, length)]):
# extract the gene cluster ID from the filename
gene_cluster_id = re.sub(r"^.*/(.*).faa.nwk", '\\1', path) # extract
comparisons[gene_cluster_id] = '(noCrash) + (most_crashers)'
usefulTrees[gene_cluster_id] = t
# read in the gene cluster summary information to get functions
pan = pd.read_csv("~/g4/bat_tm7/Actinomyces-23-SUMMARY/Actinomyces-23-mcl6_gene_clusters_summary.txt", sep="\t", dtype='str')
pd.set_option('display.max_columns', 7)
# subset summary table to only gene clusters identified here and columns of interest
subPan = pan[pan['gene_cluster_id'].isin(list(comparisons.keys()))]
subPan = subPan[['gene_cluster_id','Pfam','TIGRFAM','Preferred_Name']].drop_duplicates()
# add a column called treetype that describes the relationship for that gene
subPan['treetype'] = subPan['gene_cluster_id'].map(comparisons)
print(subPan)
#subPan.to_csv('~/g4/bat_tm7/Actinomyces-23-gene-tree-results.tsv', sep="\t") # turned off here for example
gene_cluster_id Pfam \
3168 GC_00000129 MerR HTH family regulatory protein
3398 GC_00000139 Ribosomal L18 of archaea, bacteria, mitoch. an...
4111 GC_00000170 HIT domain
5721 GC_00000240 ABC transporter
7975 GC_00000338 SecA DEAD-like domain
8527 GC_00000362 Peptidase family M41
18005 GC_00000930 Haloacid dehalogenase-like hydrolase
18007 GC_00000930 Haloacid dehalogenase-like hydrolase
18811 GC_00000992 G5 domain
19344 GC_00001033 Pterin binding enzyme
19669 GC_00001058 Prephenate dehydrogenase
19678 GC_00001058 NaN
20202 GC_00001099 Cytidylate kinase
20423 GC_00001116 Aldo/keto reductase family
20436 GC_00001117 4-alpha-glucanotransferase
20644 GC_00001133 L,D-transpeptidase catalytic domain
21996 GC_00001237 NUDIX domain
TIGRFAM Preferred_Name \
3168 NaN merR2
3398 L18_bact: ribosomal protein uL18 rplR
4111 NaN hit
5721 NaN ugpC
7975 secA: preprotein translocase, SecA subunit secA
8527 FtsH_fam: ATP-dependent metallopeptidase HflB ftsH
18005 NaN yutF
18007 HAD-SF-IIA: HAD hydrolase, family IIA yutF
18811 NaN NaN
19344 DHPS: dihydropteroate synthase folP
19669 NaN tyrA
19678 NaN tyrA
20202 cmk: cytidylate kinase cmk
20423 NaN NaN
20436 malQ: 4-alpha-glucanotransferase malQ
20644 NaN enhA_2
21996 NaN mutT
treetype
3168 no_crash, resistant, most non_crashers each mo...
3398 (((noCrash) + (resistant)) + (most_crashers))
4111 no_crash, resistant, most non_crashers each mo...
5721 (((noCrash) + (resistant)) + (most_crashers))
7975 no_crash, resistant, most non_crashers each mo...
8527 no_crash, resistant, most non_crashers each mo...
18005 (noCrash) + (most_crashers)
18007 (noCrash) + (most_crashers)
18811 (noCrash) + (most_crashers)
19344 (noCrash) + (most_crashers)
19669 (noCrash) + (most_crashers)
19678 (noCrash) + (most_crashers)
20202 (noCrash) + (most_crashers)
20423 (noCrash) + (most_crashers)
20436 (noCrash) + (most_crashers)
20644 (noCrash) + (most_crashers)
21996 (noCrash) + (most_crashers)
And this is the data for Figure 6b-d. To put these few genes displaying
this topology in context, the diversity of tree topologies can be
visualized with a density tree overlaying the hundreds of gene trees.
First the genes are concatenated into a multiphylo file, with the tree
name in the first column and the corresponding newick tree in the second
column:
``` bash
for file in $(ls core_gene_clusters/GC_0000*nwk); do
# get the gene cluster ID from filename
gc=$(echo $file | sed 's,^.*/,,; s,\..*$,,')
# print ID, space, newick tree for this tree
sed "s/^/$gc /" $file
done > core_gene_clusters.nwk
for file in $(ls susceptible_core_gcs/GC_0000*nwk); do
# get the gene cluster ID from filename
gc=$(echo $file | sed 's,^.*/,,; s,\..*$,,')
# print ID, space, newick tree for this tree
sed "s/^/$gc /" $file
done > susceptible_core_gcs.nwk
Then we can overlay all the newick trees for the 291 genes core to all 23 genomes:
library(ape)
library(phangorn)
allCore <- read.tree("core_gene_clusters.nwk")
densiTree(allCore, consensus = consensus(allCore), type = 'cladogram', alpha = 0.05, use.edge.length=F, scaleX = T, scale.bar = F)
And the 419 gene clusters core to the 13 susceptible strains:
susceptibleCore <- read.tree("susceptible_core_gcs.nwk")
densiTree(susceptibleCore, consensus=consensus(susceptibleCore), type = 'cladogram', alpha = 0.05, use.edge.length=F, scaleX = T, scale.bar = F)
Phylogenetic analysis 1: Amino Acid Identity (AAI)
To get an idea of how different our genomes are, phylogenetically, since many represent unnamed species, we compared average amino acid identity (AAI). This metric works by calling genes with Prodigal, comparing translated AA sequences with Diamond for all gene pairs for each genome. For context, Luo et al. 2014 have a nice Figure 2 showing % AAI for genomes across various taxomonic levels. Species and genus have fairly broad AAI distributions, but all species are above 80% AAI with the majority being above 90%, while congeners are in the 50-85% range.
We used CompareM to compute
AAI using their default workflow aai_wf
, which we invoked like so:
comparem aai_wf genomes genomes_aai --cpus 8 --file_ext .fasta
The key output was a file genomes_aai/aai/aai_summary.tsv
that
contains the AAI information for each genome. We then modified Titus
Brown’s python
script
to parse the output into a nice table, our modified file copied here:
#! /usr/bin/env python
from __future__ import print_function, division
import sys
import numpy
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('aai_summary_tsv')
parser.add_argument('-o', '--output')
args = parser.parse_args()
assert args.output, 'please specify --output'
indices = {}
with open(args.aai_summary_tsv, 'rt') as fp:
lines = fp.readlines()
lines = [ x.strip().split('\t') for x in lines ]
lines = lines[1:]
# assign unique indices to the thing
filenum = 0
for row in lines:
g1 = row[0]
g2 = row[2]
if g1 not in indices:
indices[g1] = filenum
filenum += 1
if g2 not in indices:
indices[g2] = filenum
filenum += 1
print('...got {} genomes.'.format(filenum))
D = numpy.zeros((filenum, filenum))
for row in lines:
g1 = row[0]
n1 = indices[g1]
g2 = row[2]
n2 = indices[g2]
sim = row[5]
D[n1,n2] = D[n2,n1] = float(sim)
for i in range(filenum):
D[i,i] = 100.0
x = []
for k, v in indices.items():
x.append((v, k))
x.sort()
labeloutname = args.output + '.labels.txt'
print('saving labels to: {}'.format(labeloutname))
with open(labeloutname, 'w') as fp:
fp.write("\n".join([ tup[1] for tup in x ]))
print('saving distance matrix to: {}'.format(args.output))
with open(args.output, 'w') as fp:
numpy.savetxt(fp, D, header='\t'.join([z[1] for z in x]))
if __name__ == '__main__':
main()
We called this file format_aai.py
and used it like
./format_aai.py genomes_aai/aai/aai_summary.tsv -o genomes_aai/aai/aai_summary_fmt.tsv
The output was then manually formatted the last bit, producing the final
table we called Actinomyces-23-aai.tsv
that we incorporated into
anvi’o, also copied here:
layers A_ICM47 A_F0309 A_ATCC17929 A_ATCC35568 A_F0588 A_ATCC17982 A_F0489 A_F0400 A_XH001 A_F0543 A_F0337 A_F0310 A_F0332 A_F0330 A_ICM58 A_F0338 A_W712 A_ICM39 A_F0386 A_F0384 A_F0530 A_F0510 A_F0311
A_ICM47 100 85.95 85.37 82.5 58.06 85.91 58.27 57.87 86.21 73.37 59.14 83.02 57.04 58.36 88.26 73.69 82.47 86.45 58.51 58.41 60.09 58.26 88.39
A_F0309 85.95 100 89.96 83.22 57.92 97.04 58.63 57.89 94.78 73.82 59.11 84.63 57.39 58.4 87.27 74.16 82.99 91.83 58.36 58.78 60.48 58.46 87.14
A_ATCC17929 85.37 89.96 100 83.07 58.01 89.98 58.43 57.94 89.42 73.71 59.29 84.53 57.24 58.57 86.58 74.41 83.01 90.63 58.48 58.68 59.92 58.7 86.37
A_ATCC35568 82.5 83.22 83.07 100 57.81 83.25 57.85 57.43 83.21 74.12 58.78 84.74 56.44 58.35 83.16 74.73 98.98 83.53 57.86 57.96 58.98 58.26 83.07
A_F0588 58.06 57.92 58.01 57.81 100 58.06 73.14 73.67 58.27 58.99 71.12 58.09 57.31 70.91 58.42 59.64 57.78 58.42 71.14 70.7 63.88 70.91 58.45
A_ATCC17982 85.91 97.04 89.98 83.25 58.06 100 58.45 58.11 94.88 73.85 59.25 84.43 57.31 58.48 87.42 74.32 83.28 91.9 58.23 58.78 60.36 58.4 87.35
A_F0489 58.27 58.63 58.43 57.85 73.14 58.45 100 77.68 58.46 58.91 71.95 58.57 57.3 71.43 58.73 59.59 57.98 58.72 71.51 71.98 64.36 71.48 58.61
A_F0400 57.87 57.89 57.94 57.43 73.67 58.11 77.68 100 58.21 58.47 71.26 57.86 56.9 71.44 58.41 59.07 57.42 58.03 71.45 71.16 63.89 71.42 58.04
A_XH001 86.21 94.78 89.42 83.21 58.27 94.88 58.46 58.21 100 73.72 59.41 84.41 57.25 58.72 87.77 74.43 83.24 93.04 58.54 58.76 60.11 58.65 87.7
A_F0543 73.37 73.82 73.71 74.12 58.99 73.85 58.91 58.47 73.72 100 59.84 74.25 57.15 59.31 74.33 97.71 74.1 74.03 58.99 59.25 59.55 59.07 74.05
A_F0337 59.14 59.11 59.29 58.78 71.12 59.25 71.95 71.26 59.41 59.84 100 59.26 58.04 87.01 59.63 60.29 58.85 59.26 87.89 91.43 65.63 87.33 59.47
A_F0310 83.02 84.63 84.53 84.74 58.09 84.43 58.57 57.86 84.41 74.25 59.26 100 56.94 58.56 83.98 74.67 84.48 84.44 58.22 58.7 59.76 58.55 83.5
A_F0332 57.04 57.39 57.24 56.44 57.31 57.31 57.3 56.9 57.25 57.15 58.04 56.94 100 57.89 57.46 57.79 56.61 57.23 57.41 57.44 57.27 57.8 57.48
A_F0330 58.36 58.4 58.57 58.35 70.91 58.48 71.43 71.44 58.72 59.31 87.01 58.56 57.89 100 58.75 59.95 58.32 58.68 88.49 86.75 64.95 97.22 58.88
A_ICM58 88.26 87.27 86.58 83.16 58.42 87.42 58.73 58.41 87.77 74.33 59.63 83.98 57.46 58.75 100 74.52 83.12 87.8 58.88 59.12 60.71 58.92 96.47
A_F0338 73.69 74.16 74.41 74.73 59.64 74.32 59.59 59.07 74.43 97.71 60.29 74.67 57.79 59.95 74.52 100 74.7 74.31 59.33 59.79 60.09 59.67 74.4
A_W712 82.47 82.99 83.01 98.98 57.78 83.28 57.98 57.42 83.24 74.1 58.85 84.48 56.61 58.32 83.12 74.7 100 83.55 58.01 58.07 58.82 58.25 82.91
A_ICM39 86.45 91.83 90.63 83.53 58.42 91.9 58.72 58.03 93.04 74.03 59.26 84.44 57.23 58.68 87.8 74.31 83.55 100 58.45 58.81 60.1 58.7 87.6
A_F0386 58.51 58.36 58.48 57.86 71.14 58.23 71.51 71.45 58.54 58.99 87.89 58.22 57.41 88.49 58.88 59.33 58.01 58.45 100 87.33 64.49 88.47 58.68
A_F0384 58.41 58.78 58.68 57.96 70.7 58.78 71.98 71.16 58.76 59.25 91.43 58.7 57.44 86.75 59.12 59.79 58.07 58.81 87.33 100 64.93 87.01 59.02
A_F0530 60.09 60.48 59.92 58.98 63.88 60.36 64.36 63.89 60.11 59.55 65.63 59.76 57.27 64.95 60.71 60.09 58.82 60.1 64.49 64.93 100 64.86 60.56
A_F0510 58.26 58.46 58.7 58.26 70.91 58.4 71.48 71.42 58.65 59.07 87.33 58.55 57.8 97.22 58.92 59.67 58.25 58.7 88.47 87.01 64.86 100 58.88
A_F0311 88.39 87.14 86.37 83.07 58.45 87.35 58.61 58.04 87.7 74.05 59.47 83.5 57.48 58.88 96.47 74.4 82.91 87.6 58.68 59.02 60.56 58.88 100
Genome names had gotten formatted differently between the analyses, so
we made a quick matching file called matcher.txt
to synchronize the
formats:
full_name abbrev
A_odontolyticus_ATCC_17929 A_ATCC17929
A_odontolyticus_ATCC_17982 A_ATCC17982
A_meyeri_ATCC_35568 A_ATCC35568
A_odontolyticus_F0309 A_F0309
A_sp_HMT_180_F0310 A_F0310
A_sp_HMT_172_F0311 A_F0311
A_sp_HMT_849_F0330 A_F0330
A_sp_HMT_848_F0332 A_F0332
A_sp_HMT_171_F0337 A_F0337
A_sp_HMT_178_F0338 A_F0338
A_sp_HMT_175_F0384 A_F0384
A_sp_HMT_170_F0386 A_F0386
A_sp_HMT_448_F0400 A_F0400
A_massiliensis_F0489 A_F0489
A_johnsonii_F0510 A_F0510
A_graevenitzii_F0530 A_F0530
A_sp_HMT_877_F0543 A_F0543
A_sp_HMT_414_F0588 A_F0588
A_sp_ICM39 A_ICM39
A_sp_ICM47 A_ICM47
A_sp_ICM58 A_ICM58
A_meyeri_W712 A_W712
A_odontolyticus_XH001 A_XH001
We then plotted the AAI information like so:
library(ggplot2)
library(reshape2)
aai <- read.csv("Actinomyces-23-aai.tsv", sep="\t") # matrix of aai values
matcher <- read.csv("matcher.txt", sep="\t") # to correct as some of the genome names don't match
rownames(matcher) <- matcher$abbrev
aai$layers <- matcher[aai$layers, "full_name"]
colnames(aai)[2:ncol(aai)] <- as.character(matcher[colnames(aai)[2:ncol(aai)], "full_name"])
# this is the order from the pangenome dendrogram
panorder <- c("A_sp_HMT_178_F0338","A_sp_HMT_877_F0543","A_meyeri_ATCC_35568","A_meyeri_W712","A_odontolyticus_XH001","A_sp_ICM39","A_sp_HMT_180_F0310","A_odontolyticus_ATCC_17929","A_odontolyticus_ATCC_17982","A_odontolyticus_F0309","A_sp_ICM47","A_sp_HMT_172_F0311","A_sp_ICM58","A_johnsonii_F0510","A_sp_HMT_849_F0330","A_sp_HMT_170_F0386","A_sp_HMT_171_F0337","A_sp_HMT_175_F0384","A_sp_HMT_414_F0588","A_massiliensis_F0489","A_sp_HMT_448_F0400","A_graevenitzii_F0530","A_sp_HMT_848_F0332")
aaiLong <- melt(aai)
# order rows and columns to match the pangenome
aaiLong$layers <- factor(gsub("_", " ", aaiLong$layers), levels = gsub("_", " ", rev(panorder)))
aaiLong$variable <- factor(gsub("_", " ", aaiLong$variable), levels = gsub("_", " ", rev(panorder)))
# plot
ggplot(aaiLong, aes(x=layers,y=variable,fill=value)) +
geom_tile(width=1.1, height=1.1) + geom_text(color='black',size=1, aes(label=round(value,1))) +
scale_y_discrete(expand = c(0,0)) + scale_x_discrete(expand = c(0,0)) + labs(x='Genome',y='Genome',fill='% AAI') + coord_equal() +
scale_fill_gradientn(colours = c('steelblue3','lightgoldenrodyellow', 'red'), limits=c(60,100), na.value = '#cffacf')+ theme_bw() +
theme(axis.text = element_text(color = 'black'), axis.text.x=element_text(angle=90, hjust=1, vjust = 0.5))
Phylogenetic analysis 2: PhyloPhlAn2
For a more phylogenetic comparison of the genomes, we applied PhyloPhlAn2 to the 23 genomes. PhyloPhlAn works on a set of conserved core genes, so first we set up the database of core genes, based on Actinomyces odontolyticus which we picked as a representative since phylophlan2_setup_database prefers to work exclusively with a single species identifier:
phylophlan2_setup_database.py -g s__Actinomyces_odontolyticus -o phylophlan_Act --verbose | tee phylophlan_Act_setup.log
Once complete we ran the actual phylophlan2 analysis, which generated
the tree based on the identified concatenated core gene sequences. Of
note, we specified --min_num_entries 23
to force it to only use core
genes found in all genomes, which we deemed important as we had set up
the core database based on A. odontolyticus but we are including other
species as well. Other parameters were chosen based on the PhyloPhlAn
wiki (--diversity medium
due to being a genus-level analysis and
default trimming/thresholding values)
phylophlan2.py -i genomes -o phylophlan_Act -d s__Actinomyces_odontolyticus \
--trim greedy --not_variant_threshold 0.99 --remove_fragmentary_entries \
--fragmentary_threshold 0.67 --min_num_entries 23 -t a -f isolates_config.cfg \
--diversity medium --force_nucleotides --nproc 8 --verbose | tee phylophlan_Act_output.log
Comparing phylophlan with the pangenome dendrogram
To directly compare the agreement between the phylogenomic tree and the pangenome’s arrangement of genomes based on gene cluster content (and indirectly, the correlation between phylogeny and genome content) we plotted them against each other in R.
library(dendextend)
library(ape)
library(phytools)
panTree <- read.tree("Actinomyces-23-frequencies.nwk") # the dendrogram from pangenome obtained with anvi-export-misc-data -t layer_orders
phlanTree <- read.tree("phylophlan_Act/RAxML_bestTree.genomes_refined.tre") # pylophlan tree
# correct the tip names
phlanTree$tip.label <- matcher[phlanTree$tip.label,1]
# root phylophlan tree roughly, to place the susceptible group sister to resistant group
phlanTree <- root(phlanTree, outgroup = panTree$tip.label[1:10], resolve.root = T)
# make a named list of colors correponding to TM7x hosting ability
colList <- setNames(c(rep('#a82828',10), rep('#77398f',3), rep('#273075',10)),panorder)
# plot it
plot.cophylo(cophylo(phlanTree, panTree, assoc = cbind(rev(panorder), rev(panorder)), rotate = T), fsize=0.4, link.col=colList, link.lty=1, link.lwd=3)
Rotating nodes to optimize matching...
Done.
And they agree exceedingly well.