Introduction

In this assignment you will do some comparative genomics of some Pseudomoas aeruginosa isolates. This is very much based on what we did in module 7.

Your data is a subset of the data described in this paper. Have a brief look at the Introduction of this paper to understand why this species is of interest to comparative genomic studies.

Write a short report in RMarkdown, use the same section headers as in this document below. As usual, you include essential code, figures and some short text discussing the results and the questions below.

Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.

Data

Several isolates of Pseudomonas aeruginosa have been sequenced, using Illumina HiSeq paired-end sequencing. Each student have their own subset of 20 genomes. These are specified in your personal genome table. In the folder /mnt/courses/BIN310/assignment3/genome_tables/ you find several tab-separated text files, and you use only the file with your username in it. It is probably a good idea to copy this to your own assignment3-folder, and store it under the name genomes_table.txt to make it similar to what we did in module 7.

Note that this table has a column source_type (among others) indicating the genomes come from two different sources. In the trees you make below, color the branches according to this and see if the genomes from the two sources separate or not.

Tree based on MLST sequence types

Inside the genome table you find 7 columns specifying the MLST alleles for each genome. This has been obtained in the way we did, for a single genome, in assignment 2. You first make a tree based only on these data.

Compute a 20x20 distance matrix based on this, where the distance between two genomes (rows) is simply the number of differing alleles, i.e. it is always an integer from 0 to 7. From these distances, compute a neighbor joining tree, and plot this. Add colors according to the source_type information for each genome.

Hint: In module 7 we did not do exactly this, but we did something similar when we constructed a pan-matrix tree. Instead of the pan-matrix you now have the MLST sequence types

Tree based on core-gene alignment

Next, we want to make a tree based on a core-gene alignment. In the folder /mnt/courses/BIN310/assignment3/contigs/ you find the assembled contigs for the genomes. In the genome table, the column contigs_file give you the names of the files you should use.

First, run prokka to annotate each of the 20 assembled genomes. Store only the output GFF-file, and name each file as <genome_id>.gff where <genome_id> is the text in the genome_id column for each genome. Put these GFF-files into a separate folder.

Next, use roary and the GFF-files from above as input, to compute the pan genome. Make certain you get a core-gene alignment from roary.

Finally, read the core-gene alignment into R, compute evolutionary distances and a neighbor joining tree for the genomes. Plot and color as above.

Tree based on mash-distances

In the genome table it is also specified the raw read files (R1 and R2) for each genome. These are found in /mnt/courses/BIN310/assignment3/fastq/, and again you only use the data for your genomes.

Compute mash sketches, and then mash distances between all pairs of genomes directly from the reads.

Read the distances into R, and compute again a neighbor joining tree based on this. Plot and color as before.

Discuss

Briefly discuss the results you got.

How similar/different are the trees you got based in the three different approaches?
Is there an indication that source_type is important for how the genomes are related?
If we think in terms of resources required from we have the isolated DNA to we have the plotted tree, which method is the simplest and which requires most resources? Describe the differences in a few words. Which approach would you choose for comparing Pseudomonas, based on your current results?

Rubrics

In order to pass this the RMarkdown report must contain:

The code for computing distances and plotting the MLST tree.
The shell scripts for finding the core-genes and code for making and plotting the tree.
The shell scripts for computing the mash-distances and code for making and plotting the tree.
Short text discussing the results and answer to specified questions.