1 Learning goals

You will get access to the Orion computing cluster that we will use in BIN310. However, the next 3 weeks we will use our own local computers, but from week 5 we will use Orion for all heavy computations. Since some experience more hassle than others with this, it is nice we have some time to fix this before we start using Orion every day.

R is a ‘workhorse’ in the courses given at KBM. We assume all are familiar with the use of R and RStudio. This is not a course where we teach new coding concepts, but we will make use of RStudio as our ‘workbench’, and we need to write some pieces of R code frequently. We take some time to refresh R coding this week, but be prepared to constantly update yourself on this.

We also have a look at some background for the sequencing data we will meet in BIN310. Summarized, these are the learning goals for this module:

  • Get an overview of BIN310 with respect to topics and practical information.
  • How to access the High Performance Computing (HPC) cluster Orion at NMBU.
  • Learn how to set up up a local library for R packages.
  • Get a basic overview of the sequencing technologies and how they work.
  • Understand the raw data format - fastq files.
  • Refreshing and possibly extending our basic coding abilities in R.



2 Introduction to BIN310

I have recorded two short lectures with an introduction to this course:

I will talk about this on our opening day (September 4.), but here you have a recorded version as well.



3 Accessing Orion

In this course we will spend time working on a High Performance Computing (HPC) cluster called Orion here at NMBU. This is a standard working environment for most computational sciences, including computational biology. Our very first step is therefore to get access to this.

3.1 The orion HPC

A High Performance Computing cluster can be seen as a collection of (more or less) powerful computers linked together. Each of these computers in the cluster we refer to as a node. Each node may in itself not be particularly powerful, at least compared to a powerful gaming PC you may have at home, but together they make a computing facility with resources exceeding what we typically have locally. We immediately divide the nodes into two distinct categories:

  1. The login node. This is the special node where you log into the HPC. You are not meant to do any computations or other heavy work on this node! It is purely for logging in and small operations.
  2. Computing nodes. These are the nodes doing all the heavy work. You rarely log into these nodes directly, with one exception, and that is with RStudio as we will see below.

Compared to a local PC, the advantages of a computing cluster are typically:

  • Memory. This limits how much data you can process in one go, e.g. how many numbers of how large texts you can read into R or Python (not the disk space). A standard laptop may have 16 gigabytes. The largest nodes in an HPC may have 1000 gigabytes (1 Terabyte), allowing us to work with larger data sets.
  • Parallel computing. We can spread computations out on many threads and/or many jobs, and speed up computations compared to doing it sequentially one task at the time. We will come back to what we mean by threads and jobs.
  • Effective use of resources. The HPC has a queuing system that ensures our tasks are started once there are resources available. This means no resources are wasted by (expensive) computers not being used.
  • Backup. Data you store on an HPC are backed up every day (with some exceptions), i.e. nothing is lost in case your local PC breaks down or is stolen.
  • User community. Since there are many users, almost all problems you stumble across has already be seen, and solved, by someone else.
  • Expert help. There are some expert people at the IT department you can turn to in case of special problems.

In Norway we have a national HPC facility called sigma2. It is quite likely that several of you will be using this in the future, and you may read about sigma2 here.

In addition to this, we have at NMBU a local computing cluster called Orion. This is the HPC we will be using in BIN310. It is small compared to the national facilities but is a nice platform for learning the necessary skills. It has the same operating system and queuing system as sigma2, and is similar in all ways. Thus, what you learn by using Orion is very much transferable to sigma2 and other similar HPC facilities.

We will now take some necessary steps to access the orion HPC.

3.1.1 Visit the orion website

The documentation website for orion is https://orion.nmbu.no/. NB! You may need a VPN connection (see below) to actually reach this site.

You may start looking into this site now, but it will probably be more useful once you have started to make use of the facilities. Bookmark this site.

3.1.2 Establish a VPN connection to NMBU

In order to access orion, you need to be inside the firewall, for security reasons. This means you need a Virtual Private Network (VPN) connection to NMBU. This may be required even if you are at campus.

How to establish a VPN connection depends a little on your PC. Here are some links to help you:

Please consult the IT service resources at NMBU for help on this, this is not something we (BIN310 teachers) can help you with, our NMBU-employee computers are set up to not require VPN.

Setting up this is something you do once, but each time you re-start your computer, you need to establish a VPN connection before you can access Orion. Previous experiences is that some computers are more problematic than others with respect to establish a VPN connection to NMBU.

Please make certain you have a VPN connection! Contact the proper IT help services if you run into problems. We (BIN310 teachers) are probably of little help, but let me know if you have some problems. Then I can give feedback to the IT department to improve things for later.

3.1.3 Get an account

To be able to log into orion, you must first get an account, i.e. you must be registered with a username and a password.

For this course we have created some student accounts on Orion. These are all based on students registered for the course in Fagpersonweb on Monday 1. September. You should get an email from the IT department on your NMBU email some time after 1. September about this. You use the same username on Orion as you use anywhere else at NMBU, but you need a separate password.

In the email from the IT you get a dummy password. The dummy password is for first time login only, and you must immediately change this, see below how to do this.

In case you did not get any email from the IT about this before Friday 5. september email me at using the subject orion student and using your NMBU email. Since we will not use Orion right away, there is no need for panic, but we also want this to be settled as quickly as possible.

3.1.4 Connecting to orion with a Terminal

Let us connect to orion by using a software that we will refer to as a Terminal. This opens a single Terminal-window access to orion. This is the basic way of connecting to orion, or any other HPC. A Terminal is simply a window on your local computer that typically may look something like this:

A terminal window. This is the software Putty in Windows, where I (larssn) have logged into Orion.
A terminal window. This is the software Putty in Windows, where I (larssn) have logged into Orion.

Mac users: The Mac operating system is very similar to the operating system of Orion (both are variants of UNIX). Thus, a Mac already has a Terminal app that you make use of now! Start the Terminal app on your Mac, and you have a window for logging into orion. In this window always type

ssh <username>@login.orion.nmbu.no

where you replace <username> by your proper Orion username. After this you press the Return-key, and assuming your VPN connection is fine, you will be prompted for the password. Type it in, and hit Return.

NOTE: Nothing is displayed while typing the password (for security reasons), i.e. it may look like it is not being typed. But it is! Thus, you need to be able to type your password blindfolded! If this was successful you should get some text scrolling over the screen, see my short video for Windows users below.

Windows users: You need to first install a software that gives you a Terminal window, unlike the Mac this is not standard on Windows computers. The Orion website above suggests using either MobaXterm or putty. Both are free software tools, and may be installed from the NMBU Software Center.

This video is from 2023, and the usernames were a little different then, but the procedure is the same.

3.1.5 Changing password

When you log onto orion for the first time, using your dummy password, you will immediately be asked to change it (see the MobaXterm video above).

Just do as the text in the Terminal window tells you. Again, when typing the old and new passwords, nothing will appear on the screen, but it is working!

After completion the new password is the one you use from now on. Please do not forget this! You should now end the session, and start a new one, and log in to see if the new password is now working. It may take some time for the new password to be registered.

3.1.6 Command lines in the BIN310 documents

In the small black text box above you can see the first example of my writing a UNIX command line in this course (ssh <username>@login.orion.nmbu.no). All commands/codes written with such black background means written directly in the Terminal window. This is what we refer to as the command line. This is the basic access to a UNIX system, there is no graphical interface (Desktop). We will type a lot of commands on command lines like this in BIN310.

On our local computers we are used to a graphical interface, i.e. some Desktop on which we can see files as small images, and where we can click and drag stuff. UNIX system also have this, but on a HPC facility this is not used. Here our only ‘window’ into the system is a Terminal window, and in order to communicate with the system we need to type in commands on the command line. You may recognize this from the Console window in RStudio, where you can type R-commands directly, and get output text. The reasons this command line facility has survived, and is still very much in use, are several. It requires little computing resources, and also give us a more direct access to the operating system. It also reflects this is actually a computer in the original meaning of the word, not like our Mac or Windows ‘computers’ who are only used for communication, administration and gaming, and hardly ever for actual computing.



3.2 We will come back to Orion later

We want you all to get an account and everything fixed to access the Orion now. However, we will not start using the Orion until in week/module 5. The reason is that in modules 2, 3 and 4 are all about metabarcoding data processing and analysis and such data are usually small enough to be handled well by our local computers. We could have used Orion for this as well, but since it is not really necessary we have chosen not to this year. Only when we start working with whole genome and metagenome data we need the power of the Orion. Thus, if you experience problems with VPN or accessing Orion, you now have some time to fix this before we start using it in week 5.





4 R and RStudio

You should have some familiarity with R coding as this is a prerequisite for this course. In this course we will not do very much statistics, but use R for data wrangling. In most cases this means reading text files into R, do some filtering, selection, sorting, simple calculations etc, and then do some plotting of this.

You will also need to make some reports in RMarkdown and hand in the HTML output from this.

We will later in BIN310 use the Orion computing cluster at NMBU, but we start out by using our local computers the first weeks. We assume you have R and RStudio on your local computer. Update R and RStudio to the newest versions now at the start of the semester.

4.1 Configuring RStudio

You can configure RStudio more or less as you like, but some settings are actually important to be aware of, see the following short video:

Here is a short video with some practical hints on configuring RStudio

This was made using RStudio on Orion (we come back to that later) but the same applies to RStudio on you local computer as well.

4.2 Installing R-packages

We will, as always, need some extra R packages during BIN310. Before you start installing R-packages, it is a good idea to create a specified R library, which is simply a folder in which you install all your R-packages.

RStudio will always suggest a folder the first time you try to install a package. This is often OK to accept on a local computer. However, even on a local computer RStudio may choose a sub-optimal folder, e.g. many problems could have been avoided by never storing R packages in the cloud! The main reason we focus on this here now is that when we later start using RStudio on Orion life becomes simpler if we decide where the R packages should be stored (not RStudio). The procedure will then be the same as we sketch here now.

You may omit this if you have done a proper job along these lines already.

4.2.1 Make your own R-library

First, create a folder somewhere on your local disk (not in some cloud). As an example, I have on my laptop a folder named Rlib directly under the C: disk (Windows), i.e. the folder is C:\Rlib in Windows language. You may call the folder whatever you like, and put it wherever you like as long as it is local on you laptop. This is where you want RStudio to install all R packages.

Next, we need to tell RStudio to use this folder. There are several ways to do this. We now focus on using the .Rprofile file since this is what we will do on Orion later.

Whether you have a Windows, Mac or linux computer, you always have a HOME folder somewhere. To locate this, run the following code in the Console window of RStudio:

path.expand("~")
## [1] "C:/Users/larssn/OneDrive - Norwegian University of Life Sciences/Documents"

The folder listed is what RStudio consider to be your HOME folder.

In this folder you must put a file named .Rprofile (note it starts with a dot). In RStudio, just open a new R script (File - New File - R Script). Fill it with the line

.libPaths("C:/Rlib")

And replace the text with the full path to the R library folder your created above. Note:

  • In R the separator between folder is slash / even in Windows, where we otherwise use backslash \.
  • There must be an empty line at the end of this file.

Now you save this file in the HOME folder you revealed above, and call the file exactly .Rprofile.

Every time R starts, it will look through your HOME folder for a file named .Rprofile. If such a file exists, the R commands in it will be executed. The command .libPaths is used to set the path to the folder in which to install R packages. Thus, if you did this correctly you should now:

  • Shut down RStudio.
  • Start it again.
  • Run .libPaths() with no arguments (empty parenthesis) in the Console.

It will typically list more than one path, but the first one should be the one you created and specified in your .Rprofile file.

Here is a short video illustrating the procedure

Again, this video was made running RStudio on Orion. But the procedure is the same on you local computer, you just choose/create the folder you want to use on you local computer. We will later do the same on Orion, and then you may want to take a look at this video again.



4.3 Install some R packages

We install R-packages from various sources. Here are some exercises that you need to do, we will need these packages (and more) later:

4.3.1 Exercise - Install from CRAN

Install the microseq package from CRAN. The Comprehensive R Archive Network (CRAN) is the main repository for R packages. Use the Tools - Install Packages... in RStudio. Search for microseq. This is a small package we have made, and we use it mostly for reading/writing sequence files to/from tables. There are other packages for doing this, but they tend to not store data in tables. R loves tables!

To be able to read compressed data files, you also need to install the package R.utils from CRAN. Do this right away!

Below you will need the package devtools. Install this unless you already have it.

We will also need the tidyverse packages from CRAN in BIN310. You may have these already, if not install tidyverse now. This takes quite some time!

It is also a good idea to update all packages atthe start of the semester. Use the Update button at the top of your Packages tab in RStudio!

4.3.2 Exercise - Install from Bioconductor

Install the packages phyloseq and dada2 from Bioconductor. Try to google and follow the instructions.

The Bioconductor is an alternative to the CRAN repository, and has a lot of useful stuff for bioinformatics. All Bioconductor packages are installed in a similar way, using the BiocManager().

4.3.3 Exercise - Install from GitHub

Next week we will need the R package Rsearch. This also relies on you installing the software tool vsearch independent of R. Go to the GitHub site https://github.com/CassandraHjo/Rsearch, read the instructions and install Rsearch and also the extra software vsearch.

4.4 RStudio on Orion

Later in BIN310 we will use RStudio running directly on Orion (not your local PC). It looks and behaves quite similar to what you are familiar with on your local PC. We will then use the same procedure for telling R where to install the R packages as we did above. We will come back to this in week 5.





5 Sequencing

By sequencing we refer to the technology that allows us to actually read the sequence of DNA (or RNA) molecules. In this course we will focus on DNA sequencing data, but the technologies are the same for RNAseq.

Considering microbial genomics in general we can divide sequencing efforts into the following two categories:

  • Amplicon sequencing (also known as targeted sequencing or metabarcoding)
  • Whole genome sequencing (also known as WGS or shotgun sequencing)

They differ mainly in what is done to the DNA before the actual sequencing takes place. With WGS we want to sequence entire genomes, and no particular region is targeted. With metabarcoding or amplicon sequencing, we first copy (amplify) some region from the genomes, usually by PCR (Polymerase Chain Reaction), and then sequence only these amplified pieces.

Anyway, the data we get from sequencing are DNA sequences that we call reads. A read is a copy of the actual DNA sequence. Reads may be short or long, depending on the technology, and we typically get a lot of them. Our raw data are these reads.

This year we will focus on metabarcoding data in the first part of the course, and then turn the attention to WGS data in the remaining weeks.

5.1 Why sequencing?

In this course we do not devote much time to this question. But, it seems rather obvious that in order to understand and being able to predict what happens in biology, the knowledge of the DNA sequence is a basic piece of information. It is not, however, the case that in order to understand any biology, you need to do some sequencing. One could get this impression some times, since ‘everyone’ is sequencing these days.

Sequencing has become rather cheap and easily available, and for this reason a lot of such data are being produced. In this course we focus on how to process such data, and how to make some rather basic analyses from them. The more sophisticated analyses will, however, require a more focused attention to one particular application (why you sequenced). In order to make this course useful for as many as possible, we try to avoid too specific applications, and rather spend time learning the skills and theory that everyone may benefit from. The downside of this is of course that the more exciting findings are outside our scope. When you start on your own project (Master/PhD) you will have the time to dig into your data looking for answers to your questions!

5.2 Online course

Instead of me talking about sequencing, it is better to make use of the endless sources of teaching material on the internet.

Below is listed some links to an online course from San Diego State University, US. The topics overlap quite a lot with our BIN310, focusing on genome sequencing and bacteria (even the lecturer looks somewhat similar!). I suggest we listen to what Rob Edwards has to say about sequencing. You should look through these videos, and learn:

In the last video above Rob Edwards lists a number of steps on our left hand side of the board. We will also have a look at this, except for the metabolic modelling, which is outside the scope of BIN310.

Also, all this is about genome sequencing. In the first part of BIN310 we will instead focus on metabarcoding data, which means we sequence targeted amplicons instead of random pieces of the genomes. But in most of BIN310 modules we will also work with whole (meta)genome sequencing data.

5.3 Sequencing technologies

There are a multitude of different technologies available for us to do DNA sequencing. We will mainly focus on Illumina sequencing in this course, since this is the most widely used today. Illumina data is what you meet everywhere. The Illumina technology, and some others, are often referred to as Next Generation Sequencing (NGS) or second generation sequencing (or HTS=High Throughput Sequencing by Rob Edwards).

In addition, we will also briefly see some data from newer technologies for producing long reads, often referred to as third generation technologies. These are usually either produced by machines from Pacific Bioscience or Oxford Nanopore, and we refer to these as PacBio or Nanopore data.

Again, let us listen to Rob Edwards:

There are a couple of other sequencing technologies mentioned in some other Rob Edwards videos. You may have a look at them as well, but we will not dig into this here.

There are also a number of nice videos that animate the Illumina sequencing process, have a look at this: https://www.youtube.com/watch?v=fCd6B5HRaZ8

5.4 Brief summary

Here is a brief summary of the sequencing types and technologies, seen from a BIN310 perspective.

Used with WGS Used with amplicon Paired-end Read lengths (bases) Precision Technology
Short-reads yes yes usually 150-300 high Illumina
Long-reads yes yes no 1000 - 100 000 medium-low Nanopore, PacBio

We notice both short and long read technologies are used both for WGS and metabarcoding type of sequencing, but we could say that for metabarcoding the more precise short-reads are preferred, while for WGS the long reads are beneficial if you want to assemble genomes. We will see more of this as we go along.

How many reads do we get from a sequencing run? This varies a lot. When metabarcoding sequencing, we need only a few thousand reads, since we have copied a small piece of the genome, and only sequence this piece. For full genome sequencing we need more, and for metagenomes even more. From deep sequencing of a highly diverse metagenome we may have 100 million reads or more. Remember also that with e.g. Nanopore sequencing, the reads are long (many thousand bases) and the number of reads may not be that large. It may still be a large data set, it is the number bases we sequenced that counts.





6 Sequencing data

Regardless of sequencing technologies, the actual output from sequencing comes as text-files following the fastq format. Well, in case of Nanopore technology there is actually a base-calling step first, but if we consider this as part of the sequencing itself (it is in the other technologies), we can say that our job starts by having some fastq files with sequencing data.

6.1 The fastq files

As a bioinformatician, the fastq-files are usually the raw data we start out with. These fastq-files includes the reads (sequences) as well as some header (identifier) for each read. But, they also contain a quality sequence for each read, that indicates its quality.

The quality scores are also sometimes referred to as phred scores. This is due to the software phred that was developed during the human genome project, and from which these scores originate. Some of the analyses we will do make use of this quality information, but in many cases it is also ignored. We should be aware that sequencing technologies may be poorly calibrated, such that quality scores do not always reflect the absolute error probability as mentioned in the video above. A higher score means a better sequencing result, but exactly how good or how much better varies a lot.

The fastq files we get from the sequencing machines are sometimes big, many gigabytes per file. This is why we should always try to keep them compressed when storing them, to save disk space. You will typically see the file extension .fastq.gz on such files, where the .gz indicates a compressed (gzipped) file. Most software (but not all) working with fastq files will be able to read the compressed file directly, and we need not decompress it. If you want to inspect the actual text file you have to decompress it first, but this is rarely something we do.

6.2 The quality score binomial variables

In the short video above we saw how the quality symbols in fastq files can be converted to quality scores, and then again how these can be converted to probability of error.

How many errors can we expect to see in a read? Let us now answer this question by recapitulating binomial variables from introductory statistics.

Assume the quality scores reflect the probabilities of error exactly. If we have a read with \(L\) bases then each base has an error probability. Let us denote them \(p_1, p_2,...,p_L\) for each of the bases along the read. Consider now a single base and its error probability \(p_i\). What is the expected number of errors here? Well, the actual number of errors can be either 0 or 1 of course. The expected value is the average number of errors we would get if we had \(n\) such bases, all having the same error probability. This is the same as coin flipping! You have a coin where \(p_i\) is the probability of Heads. You flip the coin \(n\) times and count how many Heads you ended with. This number of Heads is a binomial variable. The expected value of this binomial variable is defined as \(n\cdot p_i\). Thus, the expected number of errors at the single base is \(1\cdot p_i=p_i\) since there is only this single base.

But, the read has several bases, in fact \(N\) of them. The expected number of errors at each position is then \(p_1, p_2,...,p_N\) and the total number for the whole read is \[EE = p_1 + p_2 + \cdots + p_N = \sum_{i=1}^{N}p_i\] We use \(EE\) as short for Expected number of Errors in a read. It is simply the sum of error probabilities! We will see this again next week when we start processing metabarcoding data.

6.2.1 Exercise - plotting expected errors

In this exercise we will make a visual display of the expected number of errors for some reads. We want you to do this exercise to

  • Refresh some R coding
  • Get a feeling of how much R coding we will meet in BIN310.
  • Understand the relations between quality scores and the expected number of errors. You will read the text above, but learning by doing is much more powerful.

Download (right-click, save as) these two compressed fastq files:

These are examples of Illumina paired-end sequencing data. Here we have amplified a specific piece of DNA from several bacteria in some environment, resulting in many smallish pieces or fragments of double-stranded DNA, typically of a few hundred bases long. Then, these double-stranded fragments have been sequences from both sides, resulting in a read-pair from each fragment. Illumina is often (but not always) paired-end sequencing, i.e. both ends of the genome fragments have been sequenced. This typically results in a pair of fastq files as above. Note the only difference between the file names is the R1 and R2. The R1 file is often referred to as the R1, the left or the forward reads, and the R2 are the R2, the right of the reverse reads.

Personally, I dislike the use of ‘forward’ and ‘reverse’, because it gives you the impression all reads in the R1 file are from the forward strand of the genome. They are not! Roughly half the reads are from the forward and half from the reverse, in both files. After cutting/amplifying the fragments, we loose all control of forward and reverse strand, and I prefer to simply use the term R1 and R2 reads. The R1 reads are all sequenced in round 1 of the Illumina sequencing machine, while the R2 are all sequenced in round 2. This has nothing to do with strands on the genomes!

Here is what you should do:

  • Read the files with readFastq() from the microseq package. You should get 2 tables.
  • Rename all columns by adding _1 and _2 for the two files, respectively. This means the Header column is named Header_1 in the R1 table and Header_2 in the R2 table.
  • Bind the two tables into a single table. Note that reads are in the same order in both files, i.e. read number 1 in R1 is in pair with read number 1 in R2 etc.
  • Sample at random 1000 read pairs.
  • Split the quality texts into vectors of single characters using str_split(). Store this as new columns Cvec_1 (from Quality_1) and Cvec_2 (from Quality_2). Since str_plit()returns a list, these columns become list columns.
  • Convert the single character vectors to vector of quality scores. Below is a function you should make use of for doing this:
# Converts ASCII characters to phred scores
c2s <- function(Cvec){  # Cvec is a vector of characters
  Svec <- numeric(length(Cvec))
  for(i in 1:length(Cvec)){
    Svec[i] <- strtoi(charToRaw(Cvec[i]), 16L) - 33
  }
  return(Svec)   # Svec is a vector of integers
}
  • Include this function in your code, and use it to create two new columns Svec_1 and Svec_2 containing vectors of quality score values.
  • Then, from the vectors of quality score values compute the vectors of error probabilities. Store in columns Pvec_1 and Pvec_2.
  • Compute from the probabilities the expected number of errors in each R1 and R2 read. Store in columns EE_1 and EE_2.
  • Make a visual display of the EE values using the ggplot2 package. You choose the proper way to display them, but we should see how the R1 and R2 differ. Also, remember these are pairwise data, we should perhaps be able to compare R1 and R2 within each pair?

This exercise will require R coding a little bit beyond the basics from STIN100 or similar courses. It typically makes use of

  • List columns in tables in R
  • The use of custom made functions
  • The use of lapply() and sapply() for looping, even if a standard for-loop may also be used here
  • The use of inline functions is useful in the apply-functions, but also in other cases

All this are elements we find useful in bioinformatics. You may consult an AI to learn more about these topics, but make certain you actually learn something. It is of no value to let the AI solve it all without you picking up anything.

6.2.2 Exercise solution

library(tidyverse)
library(microseq)

# Reading data and renaming
R1.tbl <- readFastq("data/metabarcoding_R1.fastq.gz") |>
  rename(Header_1 = Header,
         Sequence_1 = Sequence,
         Quality_1 = Quality)
R2.tbl <- readFastq("data/metabarcoding_R2.fastq.gz") |>
  rename(Header_2 = Header,
         Sequence_2 = Sequence,
         Quality_2 = Quality)

# Binding data into one table
# and sampling 1000 read pairs at random
pairs.tbl <- bind_cols(R1.tbl, R2.tbl) |>
  slice(sample(1:n(), 1000))

# Splitting Quality texts into single characters
pairs.tbl <- pairs.tbl |>
  mutate(Cvec_1 = str_split(Quality_1, "")) |>
  mutate(Cvec_2 = str_split(Quality_2, ""))

# Including custom function here now
c2s <- function(Cvec){  # Cvec is a vector of characters
  Svec <- numeric(length(Cvec))
  for(i in 1:length(Cvec)){
    Svec[i] <- strtoi(charToRaw(Cvec[i]), 16L) - 33
  }
  return(Svec)   # Svec is a vector of integers
}

# Converting single characters to phred scores
pairs.tbl <- pairs.tbl |>
  mutate(Qvec_1 = lapply(Cvec_1, c2s)) |>
  mutate(Qvec_2 = lapply(Cvec_2, c2s))

# Converting phred scores to error probabilities
# using an inline function in lapply
pairs.tbl <- pairs.tbl |>
  mutate(Pvec_1 = lapply(Qvec_1, function(q){return(10^(-q/10))})) |>
  mutate(Pvec_2 = lapply(Qvec_2, function(q){return(10^(-q/10))}))

# Compute expected number of errors
pairs.tbl <- pairs.tbl |>
  mutate(EE_1 = sapply(Pvec_1, sum)) |>
  mutate(EE_2 = sapply(Pvec_2, sum))

# Plotting as points to preserve the paired information
# Sort by EE_1 value to get less noisy output
pairs.tbl |>
  arrange(EE_1) |>
  mutate(pair = 1:n()) |>
  pivot_longer(cols = c(EE_1, EE_2), names_to = "type", values_to = "EE") |>
  ggplot(aes(x = pair, y = EE, color = type)) +
    geom_point(size = 2, alpha = 0.5)

We note that R2 reads in general have higher EE values than R1 reads, as expected from what we know about Illumina data. How big are these values? The largest values seem to be around 10-20, i.e. we must expect 10-20 errors in these reads. If you compute the read lengths you will find they are in most cases around 300 bases. This means up to around 5% of the reads may be errors, in the worst cases.