Introduction

In this assignment you will make a genome assembly from some Illumina data, evaluate this and also search for some specific marker genes in the assembled contigs. This is mostly based on what we have done in modules 4 and 5.

Write a short report in RMarkdown, use the same section headers as in this document below. You may use figures produced by quast in your report, learn how to include figure files in RMarkdown. You must also write some short text discussing the results and the questioons below.

Essential R and shell code (scripts) you used must be included in the RMarkdown document. You can insert chunks for both R and bash in RMarkdown. To just show the code, without running it, set the eval=FALSE, like this:

```{bash, eval=FALSE}
<put your code here>
```

The same works for chunks of R-code as well, in case you want to show code without running it. Note that any code that produces plots must be run in order to produce the plots!

Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.

Data

Several isolates of Pseudomonas aeruginosa have been sequenced, using Illumina HiSeq paired-end sequencing. Each student have their own isolate data set. You will find your data in

$COURSES/BIN310/assignment2/fastq/<username>_R1.fq.gz
$COURSES/BIN310/assignment2/fastq/<username>_R2.fq.gz

where you replace <username> by your orion user name. Please don’t copy these files, just read them from where they are.

In your report, include first the answers these questions:

How many read pairs are there in your data set?
What are the read lengths? (mean, minimum and maximum)

You can base your answers on reading the R1 file only, since the read lengths are usually very similar in the R2 file.

Assembly

Compute an assembly of your data. Make certain your shell scripts output to the $SCRATCH disk, and copy only the fasta-files with contigs back to your working directory. Remember to delete from $SCRATCH the stuff you don’t need, once you are done.

Assembly quality

In order to assess the quality of the assembly, compare it to 3 different reference genomes. You find these in $COURSES/BIN310/assignment2/ as well. They all have the species name Pseudomonas_aeruginosa in their names, and the file extension .fna. The strain PAO1 is the NCBI Reference Genome for this species. The strain DSM 50071 is a typical lab-strain you can purchase. The strain Carb01 63 is included because it has a larger genome than the other two.

Use the software quast, with these genomes as references, one by one. Output to different folders.

Include some relevant results/figures in your report, and state something about the assembly quality based on your results. NB! The quast outputs results in a folder, and in this you find a subfolder named basic_stats in which you may find some useful figures (.pdf files).

Search for MLST markers

Once you have the assembled contigs, make a BLAST search in the contigs for the 7 MLST markers for this species. These 7 housekeeping genes (acsA, aroE, guaA, mutL, nuoD, trpE) should be present in all Pseudomonas aeruginosa, but we want to know exactly which variants are in your genome.

The fasta file $COURSES/BIN310/assignment2/MLST_alleles.fasta contains many variants of each of the 7 genes. Search with these as queries against the contigs, and use the blastn version of BLAST (not tblastn). In order to see if a marker variant is found with 100% identity we need to have, in the output, both the percent identity and the length of each marker sequence and the alignment. Add the following to your command line to ensure this:

-outfmt '6 qseqid sseqid qlen length pident bitscore'

This means your BLAST output will have 6 columns:

qseqid identifier for query sequence (the marker variant)
sseqid identifier for contig (not really important this time)
qlen length of query sequence
length length of alignment
pident percent identity
bitscore the bit-score of the alignment

Many of the variants for each marker gene will match more or less your contigs. Find the best match for each of the marker genes, i.e. the one with the largest bit score. Verify if this has

100% percent identity match to the genome
The alignment covers the entire query sequence

In your report, list the sequence type of your genome, i.e. the text in the Header-line of each of the 7 markers that gave the perfect hits. Also, make some short answer to the following questions:

Why do we need to also consider the length of the alignment, not only the identity, when using BLAST?
If the best alignment of some marker did not match your genome perfectly, you may have stumbled across a new sequence variant. What would you do then? You do not need to code this, just describe with a few words the procedure to handle this.

Rubrics

In order to pass this the RMarkdown report must contain: