This is a continuation from Part 2.

Introduction

In this post we will go through some simple analysis of LC-MS/MS data as that is one of the more common methods used in our field and the concepts are largely transferable to other types of MS data.

The point of this post is less about teaching R or certain packages (we may do deep-dives using other languages and packages in the future) and more about exposing readers to concepts and how to think about the underlying data.

The Data

We will use some data I acquired previously to make things easier… on me.

It’s from an LC-MS/MS run of a single Micromonospora extract, in a previously published study.

Experimental info

The sample is a Micromonospora extract. The extraction was performed from a bacterial culture growing on solid A1 agar media following the protocol of Bligh,E. G. and Dyer, W. J. (9). Agar cultures were divided into 1 cm³ pieces and 3 mm glass beads were added. Extraction solvent was added in three steps with vigorous vortexing between steps 1) 1:2 (v/v) CHCl3:MeOH, 2) CHCl3 in 1/3 the added volume of step one, 3) H2O in 1/3 the added volume of step one. From the resulting two-layer liquid partition, the organic layer was retained for further analysis.

The extract was analyzed via LC-MS/MS with a method adapted from that described by Goering et al. Experiments were performed on an Agilent 1200 workstation connected to a Thermo Fisher Scientific Q-Exactive mass spectrometer with an electrospray ionization source. Reversed-phase chromatography was performed by injection of 20 μL of 0.1 mg/mL of extract at a 0.3 mL/min flow rate across a Phenomenex Kinetex C18 RPLC column (150 mm x 2.1 mm i.d., 2 μm particle size). Mobile phase A was water with 0.1% formic acid and mobile phase B was acetonitrile with 0.1% formic acid. Mobile phase B was held at 15% for 1 minute, then adjusted to 95% over 12 minutes, where it was held for 2 minutes, and the system re-equilibrated for 5 minutes. The mass spectrometry parameters were as follows: scan range 200-2000 m/z, resolution 35,000, scan rate ~3.7 per second. Data were gathered in profile and the top 5 most intense peaks in each full spectrum were targeted for fragmentation that employed a collision energy setting of 25 eV for Higher-energy Collisional Dissociation (HCD) and isolation window of 2.0 m/z.

The mzXML file was created with ProteoWizard’s msconvert, using default settings.

Code (Basics)

Set Up an 🇷 session

The rest of this tutorial will take place using R.

Here we will install and then load mzR, a Bioconductor package for parsing mass spectrometry data. Vignette here. For plotting we’ll be using ggplot2 and plotly.

if (!require("mzR", quietly = TRUE)){
  if (!require("BiocManager", quietly = TRUE)){
      install.packages("BiocManager")
  }
  BiocManager::install("mzR")
}
  
library(mzR)

Download LC-MS/MS example data

Next let’s download the LC-MS/MS data we will be working with to a temporary directory (i.e. the directory will be deleted upon closing the R session).

There are two files:

an mzXML file which contains centroid data (peaks only)
an mzML file which contains profile data (raw data,not peak-picked)

GNPS used to require mzXML so that’s the reason for both mzXML and mzML formats.

Warning: This is a 22 MB and 306 MB download.

# I have slow internet so I'll increase the amount of time the download is allowed to take
options(timeout=240)

temporary_directory <- tempdir()

# 22.3 MB
peaks_file_path <- file.path(temporary_directory, "B022.mzXML" )
download.file(url = "ftp://massive.ucsd.edu/v01/MSV000081555/peak/B022.mzXML",
             destfile = peaks_file_path)
             
# 306.1 MB
raw_mzml_path <- file.path(temporary_directory, "B022.mzML" )
download.file(url = "ftp://massive.ucsd.edu/v01/MSV000081555/raw/FullSpectra-mzML/B022_GenbankAccession-KY858245.mzML",
              destfile = raw_mzml_path)

How to open mzXML/mzML data in R

If we look at the first ten lines of:

peaks_file_path

[1] "/tmp/RtmpHtTeyr/B022.mzXML"

We can see it is indeed an mzXML file:

cat(readLines(peaks_file_path, n=10), sep = "\n")

<?xml version="1.0" encoding="ISO-8859-1"?>
<mzXML xmlns="http://sashimi.sourceforge.net/schema_revision/mzXML_3.2"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://sashimi.sourceforge.net/schema_revision/mzXML_3.2 http://sashimi.sourceforge.net/schema_revision/mzXML_3.2/mzXML_idx_3.2.xsd">
  <msRun scanCount="4399" startTime="PT0.0673789S" endTime="PT1200.11S">
    <parentFile fileName="file:///C:\Users\chase\Downloads\LCMSNORTHWESTERN\Example\Input_Folder/20170719_mwm1013_metabologenomics_actinolunaC182x100_B022.raw"
                fileType="RAWData"
                fileSha1="b739a75b1c680e889940f7b35fe9ef07ee5bcd62"/>
    <msInstrument msInstrumentID="1">
      <msManufacturer category="msManufacturer" value="Thermo Scientific"/>

I’m a fan of mzR due to its speed (under the hood is a lot of fast C++ code), and that it lazily loads the data from mzXML/mzML files (it doesn’t read everything into memory unless you request it).

Here we will tell mzR to lazily open the mass spec file we just downloaded. We can see it returns a handle to the file, which contains 4399 “scans”. A scan being a mass spectrum.

msfile_handle <- mzR::openMSfile(peaks_file_path)
msfile_handle

Mass Spectrometry file handle.
Filename:  B022.mzXML 
Number of scans:  4399

mzR uses S3 object oriented programming which is difficult if you are only used to R’s usual functional programming style. You don’t have to worry much about it because most of what I’ll show is functional, but if you do care there are a number of object based methods you can use.

We can see how mzR “opened/parsed” the file, here using C++ code from ProteoWizard.

msfile_handle@backend

C++ object <0x5624ada3aa70> of class 'Pwiz' <0x5624aecbea90>

One of the most useful {mzR} functions is header() which provides summarizing information about each scan in the dataset. Each scan is numbered sequentially (seqNum/acquisitionNum).

summary_data <- header(msfile_handle)
head(summary_data, 5)

THis allows us to do things like filtering for only positive mode MS² scans.

filtered_df <- summary_data[summary_data$polarity == 1, ][summary_data$msLevel == 2, ]
head(filtered_df, 5)

The other useful function I’ll bring up in this post retrieves the actual mass spectra. If just provided the file handle it will load every scan in the file as a separate two-column matrix. For each matrix the first column represents m/z and the second column is intensity. If a scan number is provided it will return the two-column matrix for that specific scan.

Let’s look at the first five lines of the fourth scan/mass spectrum.

# note: mzR::peaks() and mzR::spectra() are interchangeable
single_spectrum <- mzR::peaks(msfile_handle, scans=4)
head(single_spectrum, 5)

           mz intensity
[1,] 150.0265 32913.336
[2,] 151.0238  2110.815
[3,] 151.0272  3636.793
[4,] 152.0564  4872.385
[5,] 153.0907  2387.040

Check our assumptions

Whenever you get new data the first thing you should do is get a feel for the data and confirm any assumptions that could influence your analysis. Some, but not all of the ways you can do so are shown here.

We can use mzR::openMSfile() to open the file in R.

full_spectra_handle <- mzR::openMSfile(raw_mzml_path)

Using the header() function we can peak at the first 10 rows of the summary information about each spectrum in the file. This is not reading any spectra data yet but pulling metadata about each spectrum that is stored within the mzML file, as described in post 2. We can see there is both MS¹ and MS² spectra i this file.

Are the number of data points for the MS¹ and MS² spectra within the range I expected? (`peaksCount` column)

header_table <- header(full_spectra_handle)
head(header_table, 10)

Is this centroid or profile data?

All the values in header_table$centroided are false so all the scans in the file are profile. This aligns with the large numbers seen in the peaksCount column above.

table(header_table$centroided)


FALSE 
 4399

What max intensities can I expect?

summary(header_table$totIonCurrent)

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    12168   1841128  10199948  14827967  19327535 341675010

What are the min/max retention times?

paste0("Minimum retention time: ", min(header_table$retentionTime), " s")

[1] "Minimum retention time: 0.06737892 s"

paste0("Maximum retention time: ", max(header_table$retentionTime) , " s")

[1] "Maximum retention time: 1200.10806 s"

Does the data contain positive or negative mode spectra? Both?

Here it’s positive because the only value within the polarity column is 1

table(header_table$polarity)


   1 
4399

How many MS¹ MS² scans are there?

table(header_table$msLevel)


   1    2 
3283 1116

Note

It should be noted that the peaksCount column has the same name whether you have loaded centroid or profile data and is best thought of as the number of data points within a single scan/spectrum.

In the next post we will do some introduction to analysis and plotting using the same data as in this post.

Reuse

CC BY 4.0

Introduction

The Data

Code (Basics)

Set Up an 🇷 session

Download LC-MS/MS example data

How to open mzXML/mzML data in R

Check our assumptions

Are the number of data points for the MS1 and MS2 spectra within the range I expected? (peaksCount column)

Is this centroid or profile data?

What max intensities can I expect?

What are the min/max retention times?

Does the data contain positive or negative mode spectra? Both?

How many MS1 MS2 scans are there?

Next

Reuse

Are the number of data points for the MS¹ and MS² spectra within the range I expected? (`peaksCount` column)

How many MS¹ MS² scans are there?