if (!require("mzR", quietly = TRUE)){
if (!require("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}::install("mzR")
BiocManager
}
library(mzR)
This is a continuation from Part 2.
Introduction
In this post we will go through some simple analysis of LC-MS/MS data as that is one of the more common methods used in our field and the concepts are largely transferable to other types of MS data.
The point of this post is less about teaching R or certain packages (we may do deep-dives using other languages and packages in the future) and more about exposing readers to concepts and how to think about the underlying data.
The Data
We will use some data I acquired previously to make things easier… on me.
It’s from an LC-MS/MS run of a single Micromonospora extract, in a previously published study.
The sample is a Micromonospora extract. The extraction was performed from a bacterial culture growing on solid A1 agar media following the protocol of Bligh,E. G. and Dyer, W. J. (9). Agar cultures were divided into 1 cm3 pieces and 3 mm glass beads were added. Extraction solvent was added in three steps with vigorous vortexing between steps 1) 1:2 (v/v) CHCl3:MeOH, 2) CHCl3 in 1/3 the added volume of step one, 3) H2O in 1/3 the added volume of step one. From the resulting two-layer liquid partition, the organic layer was retained for further analysis.
The extract was analyzed via LC-MS/MS with a method adapted from that described by Goering et al. Experiments were performed on an Agilent 1200 workstation connected to a Thermo Fisher Scientific Q-Exactive mass spectrometer with an electrospray ionization source. Reversed-phase chromatography was performed by injection of 20 ÎĽL of 0.1 mg/mL of extract at a 0.3 mL/min flow rate across a Phenomenex Kinetex C18 RPLC column (150 mm x 2.1 mm i.d., 2 ÎĽm particle size). Mobile phase A was water with 0.1% formic acid and mobile phase B was acetonitrile with 0.1% formic acid. Mobile phase B was held at 15% for 1 minute, then adjusted to 95% over 12 minutes, where it was held for 2 minutes, and the system re-equilibrated for 5 minutes. The mass spectrometry parameters were as follows: scan range 200-2000 m/z, resolution 35,000, scan rate ~3.7 per second. Data were gathered in profile and the top 5 most intense peaks in each full spectrum were targeted for fragmentation that employed a collision energy setting of 25 eV for Higher-energy Collisional Dissociation (HCD) and isolation window of 2.0 m/z.
The mzXML file was created with ProteoWizard’s msconvert, using default settings.
Code (Basics)
Set Up an 🇷 session
The rest of this tutorial will take place using R.
Here we will install and then load mzR, a Bioconductor package for parsing mass spectrometry data. Vignette here. For plotting we’ll be using ggplot2 and plotly.
Download LC-MS/MS example data
Next let’s download the LC-MS/MS data we will be working with to a temporary directory (i.e. the directory will be deleted upon closing the R session).
There are two files:
- an mzXML file which contains centroid data (peaks only)
- an mzML file which contains profile data (raw data,not peak-picked)
GNPS used to require mzXML so that’s the reason for both mzXML and mzML formats.
Warning: This is a 22 MB and 306 MB download.
# I have slow internet so I'll increase the amount of time the download is allowed to take
options(timeout=240)
<- tempdir()
temporary_directory
# 22.3 MB
<- file.path(temporary_directory, "B022.mzXML" )
peaks_file_path download.file(url = "ftp://massive.ucsd.edu/v01/MSV000081555/peak/B022.mzXML",
destfile = peaks_file_path)
# 306.1 MB
<- file.path(temporary_directory, "B022.mzML" )
raw_mzml_path download.file(url = "ftp://massive.ucsd.edu/v01/MSV000081555/raw/FullSpectra-mzML/B022_GenbankAccession-KY858245.mzML",
destfile = raw_mzml_path)
How to open mzXML/mzML data in R
If we look at the first ten lines of:
peaks_file_path
[1] "/tmp/RtmpHtTeyr/B022.mzXML"
We can see it is indeed an mzXML file:
cat(readLines(peaks_file_path, n=10), sep = "\n")
<?xml version="1.0" encoding="ISO-8859-1"?>
<mzXML xmlns="http://sashimi.sourceforge.net/schema_revision/mzXML_3.2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://sashimi.sourceforge.net/schema_revision/mzXML_3.2 http://sashimi.sourceforge.net/schema_revision/mzXML_3.2/mzXML_idx_3.2.xsd">
<msRun scanCount="4399" startTime="PT0.0673789S" endTime="PT1200.11S">
<parentFile fileName="file:///C:\Users\chase\Downloads\LCMSNORTHWESTERN\Example\Input_Folder/20170719_mwm1013_metabologenomics_actinolunaC182x100_B022.raw"
fileType="RAWData"
fileSha1="b739a75b1c680e889940f7b35fe9ef07ee5bcd62"/>
<msInstrument msInstrumentID="1">
<msManufacturer category="msManufacturer" value="Thermo Scientific"/>
I’m a fan of mzR due to its speed (under the hood is a lot of fast C++ code), and that it lazily loads the data from mzXML/mzML files (it doesn’t read everything into memory unless you request it).
Here we will tell mzR to lazily open the mass spec file we just downloaded. We can see it returns a handle to the file, which contains 4399 “scans”. A scan being a mass spectrum.
<- mzR::openMSfile(peaks_file_path)
msfile_handle msfile_handle
Mass Spectrometry file handle.
Filename: B022.mzXML
Number of scans: 4399
mzR uses S3 object oriented programming which is difficult if you are only used to R’s usual functional programming style. You don’t have to worry much about it because most of what I’ll show is functional, but if you do care there are a number of object based methods you can use.
We can see how mzR “opened/parsed” the file, here using C++ code from ProteoWizard.
@backend msfile_handle
C++ object <0x5624ada3aa70> of class 'Pwiz' <0x5624aecbea90>
One of the most useful {mzR} functions is header()
which provides summarizing information about each scan in the dataset. Each scan is numbered sequentially (seqNum/acquisitionNum).
<- header(msfile_handle)
summary_data head(summary_data, 5)
THis allows us to do things like filtering for only positive mode MS2 scans.
<- summary_data[summary_data$polarity == 1, ][summary_data$msLevel == 2, ]
filtered_df head(filtered_df, 5)
The other useful function I’ll bring up in this post retrieves the actual mass spectra. If just provided the file handle it will load every scan in the file as a separate two-column matrix. For each matrix the first column represents m/z and the second column is intensity. If a scan number is provided it will return the two-column matrix for that specific scan.
Let’s look at the first five lines of the fourth scan/mass spectrum.
# note: mzR::peaks() and mzR::spectra() are interchangeable
<- mzR::peaks(msfile_handle, scans=4)
single_spectrum head(single_spectrum, 5)
mz intensity
[1,] 150.0265 32913.336
[2,] 151.0238 2110.815
[3,] 151.0272 3636.793
[4,] 152.0564 4872.385
[5,] 153.0907 2387.040
Check our assumptions
Whenever you get new data the first thing you should do is get a feel for the data and confirm any assumptions that could influence your analysis. Some, but not all of the ways you can do so are shown here.
We can use mzR::openMSfile()
to open the file in R.
<- mzR::openMSfile(raw_mzml_path) full_spectra_handle
Using the header()
function we can peak at the first 10 rows of the summary information about each spectrum in the file. This is not reading any spectra data yet but pulling metadata about each spectrum that is stored within the mzML file, as described in post 2. We can see there is both MS1 and MS2 spectra i this file.
Are the number of data points for the MS1 and MS2 spectra within the range I expected? (peaksCount
column)
<- header(full_spectra_handle)
header_table head(header_table, 10)
Is this centroid or profile data?
All the values in header_table$centroided
are false so all the scans in the file are profile. This aligns with the large numbers seen in the peaksCount
column above.
table(header_table$centroided)
FALSE
4399
What max intensities can I expect?
summary(header_table$totIonCurrent)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12168 1841128 10199948 14827967 19327535 341675010
What are the min/max retention times?
paste0("Minimum retention time: ", min(header_table$retentionTime), " s")
[1] "Minimum retention time: 0.06737892 s"
paste0("Maximum retention time: ", max(header_table$retentionTime) , " s")
[1] "Maximum retention time: 1200.10806 s"
Does the data contain positive or negative mode spectra? Both?
Here it’s positive because the only value within the polarity
column is 1
table(header_table$polarity)
1
4399
How many MS1 MS2 scans are there?
table(header_table$msLevel)
1 2
3283 1116
It should be noted that the peaksCount
column has the same name whether you have loaded centroid or profile data and is best thought of as the number of data points within a single scan/spectrum.
Next
In the next post we will do some introduction to analysis and plotting using the same data as in this post.