This is a continuation from Part 3.

Introduction

In this post we will go through some simple analysis of LC-MS/MS data as that is one of the more common methods used in our field and the concepts are largely transferable to other types of MS data.

The point of this post is less about teaching R or certain packages (we may do deep-dives using other languages and packages in the future) and more about exposing readers to concepts and how to think about the underlying data.

The data

We will use some data I acquired previously to make things easier… on me. It’s from an LC-MS/MS run of a single Micromonospora extract that was usedin a previously published study.

Experimental info

The sample is a Micromonospora extract. The extraction was performed from a bacterial culture growing on solid A1 agar media following the protocol of Bligh,E. G. and Dyer, W. J. (9). Agar cultures were divided into 1 cm³ pieces and 3 mm glass beads were added. Extraction solvent was added in three steps with vigorous vortexing between steps 1) 1:2 (v/v) CHCl3:MeOH, 2) CHCl3 in 1/3 the added volume of step one, 3) H2O in 1/3 the added volume of step one. From the resulting two-layer liquid partition, the organic layer was retained for further analysis.

The extract was analyzed via LC-MS/MS with a method adapted from that described by Goering et al. Experiments were performed on an Agilent 1200 workstation connected to a Thermo Fisher Scientific Q-Exactive mass spectrometer with an electrospray ionization source. Reversed-phase chromatography was performed by injection of 20 μL of 0.1 mg/mL of extract at a 0.3 mL/min flow rate across a Phenomenex Kinetex C18 RPLC column (150 mm x 2.1 mm i.d., 2 μm particle size). Mobile phase A was water with 0.1% formic acid and mobile phase B was acetonitrile with 0.1% formic acid. Mobile phase B was held at 15% for 1 minute, then adjusted to 95% over 12 minutes, where it was held for 2 minutes, and the system re-equilibrated for 5 minutes. The mass spectrometry parameters were as follows: scan range 200-2000 m/z, resolution 35,000, scan rate ~3.7 per second. Data were gathered in profile and the top 5 most intense peaks in each full spectrum were targeted for fragmentation that employed a collision energy setting of 25 eV for Higher-energy Collisional Dissociation (HCD) and isolation window of 2.0 m/z.

The mzXML file was created with ProteoWizard’s msconvert, using default settings.

Set up an 🇷 session

The rest of this tutorial will take place using R.

Here we will install and then load mzR, a Bioconductor package for parsing mass spectrometry data. Vignette here. For plotting we’ll be using ggplot2 and plotly.

if (!require("mzR", quietly = TRUE)){
  if (!require("BiocManager", quietly = TRUE)){
      install.packages("BiocManager")
  }
  BiocManager::install("mzR")
}
if (!require("ggplot2", quietly = TRUE)){
    install.packages("ggplot2")
}
if (!require("plotly", quietly = TRUE)){
    install.packages("plotly")
}


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

if (!require("data.table", quietly = TRUE)){
    install.packages("data.table")
}
  
library(mzR)
library(ggplot2)
library(plotly)
library(data.table)

Download the data

Next let’s download the LC-MS/MS data we will be working with to a temporary directory (i.e. the directory will be deleted upon closing the R session).

There are two files:

an mzXML file which contains centroid data (peaks only)
an mzML file which contains profile data (raw data,not peak-picked)

GNPS used to require mzXML so that’s the reason for both mzXML and mzML formats.

Warning: This is a 22 MB and 306 MB download.

# I have slow internet so I'll increase the amount of time the download is allowed to take
options(timeout=240)

temporary_directory <- tempdir()

# 22.3 MB
peaks_file_path <- file.path(temporary_directory, "B022.mzXML" )
download.file(url = "ftp://massive.ucsd.edu/v01/MSV000081555/peak/B022.mzXML",
             destfile = peaks_file_path)

# 306.1 MB
raw_mzml_path <- file.path(temporary_directory, "B022.mzML" )
download.file(url = "ftp://massive.ucsd.edu/v01/MSV000081555/raw/FullSpectra-mzML/B022_GenbankAccession-KY858245.mzML",
              destfile = raw_mzml_path)

centroid_msfile_handle <- mzR::openMSfile(peaks_file_path)
profile_spectra_handle <- mzR::openMSfile(raw_mzml_path)
centroid_header_table <- header(centroid_msfile_handle)
profile_header_table <- header(profile_spectra_handle)

Analysis of C14 acyl-desferrioxamine

To start off we’ll inspect scan 2242 which is the MS¹ scan that contains the precursor for the MS² scan “2243” which appeared in Figure S6F and corresponds to a C14 acyl-desferrioxamine.

profile_header_table[profile_header_table$acquisitionNum > 2241 & profile_header_table$acquisitionNum < 2244, ]

Read scan/spectrum 2242 into R And convert it into a data frame with the columns “mass” and “intensity”

single_spectrum <- mzR::peaks(profile_spectra_handle, scans=2242)
single_spectrum <- as.data.frame(single_spectrum)
colnames(single_spectrum) <- c("mass", "intensity")

Based on the header table this should now be a two-column matrix containing 7076 rows.

head(single_spectrum, 5)

dim(single_spectrum)

[1] 7076    2

Plot a profile spectrum

The two-column matrix can be passed directly to R’s base plotting function, with the additional argument type="l" which stands for “line”.

plot(
  single_spectrum,
  type = "l",
)

Plot an isotopic envelope

Let’s zoom in on the ¹³C isotopic envelope of the “743.5622 m/z” precursor we are interested in. Note that R/code doesn’t have a concept of “zoom”, we just filter the data for the the area of interest and only plot that filtered data. Here we filter the data to only include rows where the mass column is greater than 743 and less than 750.

ggplot(
  data = subset(single_spectrum, mass > 743 & mass < 750),
  aes(
    x = mass, 
    y = intensity
  )
) + 
  geom_line(color="gray48") + 
  geom_point(size = 0.75, color="gray0")

And now zoom in further, to the monoisotopic peak.

ggplot(
  data = subset(single_spectrum, mass > 743 & mass < 744),
  aes(
    x = mass, 
    y = intensity
  )
) + 
  geom_line(color="gray48") + 
  geom_point(size = 0.75, color="gray0")

An important thing to take note of when doing most types of spectroscopy/spectrometry is the number of measurements across a peak. Here we are getting ~10 data points per ion/peak which is pretty good. The smaller the number of points, the worse your peak shape will be, the worse your accuracy and precision will be. Alternately, too many points can bloat your data size and sometimes make analyses more difficult. Additionally, more data points collected per scan can decrease the number of scans you can collect per unit of time. This is largely controlled by dwell time in the MS acquisition settings.

ggplot(
  data = subset(single_spectrum, mass > 743 & mass < 744),
  aes(
    x = mass, 
    y = intensity
  )
) + 
  geom_line(color="gray48") + 
  geom_point(size = 0.75, color="gray0") +
  geom_point(data = subset(subset(single_spectrum, mass > 743 & mass < 744), intensity > 100),aes(
    x = mass, 
    y = intensity
  ),  size = 3, color="red")

Plot an extracted ion chromatogram (EIC)

Let’s create a extracted ion chromatogram (EIC) for the “743.5622 m/z” precursor.

To do that we need to loop through all the MS¹ spectra and extract the intensity (in this case the max) of data points within an m/z range.

full_spectra_header <- header(profile_spectra_handle)
ms1_indices <- full_spectra_header[full_spectra_header$msLevel == 1, ]$seqNum

target_mass <- 743.5646
delta <- 0.01

left_window <- target_mass - delta
right_window <- target_mass + delta

# The lapply below creates a list of two-column data frames for each scan, like:
#   ret_time intensity
#   1163.068 0
list_of_data_frames  <- lapply(ms1_indices, 
  function(x){
    ret_time <- full_spectra_header[x, ]$retentionTime
    x <- mzR::spectra(profile_spectra_handle, x)
    x <- as.data.frame(x)
    colnames(x) <- c("mass", "intensity")
    x <- x[x$mass > left_window & x$mass < right_window,  ]
    if (nrow(x) > 0){
      return(data.frame(list(ret_time=ret_time, intensity=max(x$intensity))))
    } else {
      return(data.frame(list(ret_time=ret_time, intensity=0)))
    }
  }
)
# concatenate all the data frames together into a single data frame
eic_df <- do.call("rbind", list_of_data_frames)
remove(list_of_data_frames)

Corresponds to Figure S5E:

title = paste(
  "Extracted Ion Chromatogram: ",
  target_mass,
  " ",
  expression(italic("m/z")),
  " +/- ",
  delta,
   " Da")

ggplot(
  data = eic_df,
  aes(
    x = ret_time / 60,
    y = intensity
  )
) + 
  geom_line(color="gray48") +
  xlab("Retention Time (min)") +
      ggtitle(bquote("Extracted Ion Chromatogram:"~.(target_mass) ~italic("m/z")~"+/-"~.(delta) ~"Da"))

Let’s plot red circles where the instrument fragmented parent ions between 743 m/z & 745 m/z

ggplot(
  data = eic_df,
  aes(
    x = ret_time / 60,
    y = intensity
  )
) + 
  geom_line(color="gray48") + 
  geom_point(
    data = subset(full_spectra_header, precursorMZ > 743 & precursorMZ < 745),
    aes(x=retentionTime / 60, y= 5e5),
    color="red"
    ) +
  xlab("Retention Time (min)")

And we can look at that same information in table form to make sure points weren’t masked by plotting over each other.

subset(full_spectra_header, precursorMZ > 743 & precursorMZ < 745)

Another consideration for the experimentalist (and a good exam question 😛 ) is how you could obtain more scans of this 743.5622 m/z target ion. There’s multiple ways, with the most obvious being to run in targeted mode where you only fragment parent molecules within a tight range around 743.5622 m/z. But if you need untargeted MS you can mess with duty cycles, or adjust your chromatography to increase the elution peak width of the target compound; sometimes 5 minute chromatography isn’t the best chromatography for the job.

Plot a centroid spectrum

A centroid spectrum is a bit harder to plot than profile because a simple line won’t work. For example, look at this ugly mess (circles represent the data points):

centroid_spectrum <- as.data.table(mzR::peaks(centroid_msfile_handle, 2243))
colnames(centroid_spectrum) <- c("mass", "intensity")

plot(centroid_spectrum, type="b", pch = 16)

And with ggplot/plotly…

p <- ggplot(
            data=centroid_spectrum,
            aes(
              x = mass,
              y = intensity
              )
            ) + 
          geom_line() +
          geom_point()  + 
          xlab("m/z") + 
          ylab("Intensity")

ggplotly(p)

Let’s write a function that will use ggplot and plot each data point as a vertical rectangles.

centroid_plot <- function(df){
      df$x1 = df$mass -.2
      df$x2 = df$mass +.2
      df$y1 = 0
      df$y2 = df$intensity
      df2=df
      df2$mass <- NULL
      df2$intensity <- NULL

      p=ggplot() + 
          scale_x_continuous(name="m/z") + 
          scale_y_continuous(name="Intensity") +
          geom_rect(data=df2, mapping=aes(xmin=x1, xmax=x2, ymin=y1, ymax=y2), color="black", alpha=0.5) +
          geom_point(data=df2,aes(x = x1,
                         y = y2,
                         text = paste(round(x1 +0.2, 2), y2)),
                     color='transparent')

      return(p)
  }

df <- as.data.table(mzR::peaks(centroid_msfile_handle, 2243 ))
colnames(df) <- c("mass", "intensity")

centroid_plot(df)

Warning in geom_point(data = df2, aes(x = x1, y = y2, text = paste(round(x1 + :
Ignoring unknown aesthetics: text

And we can pass the ggplot output directly to ggplotly to make the plot interactive:

df <- as.data.table(mzR::peaks(centroid_msfile_handle, 2243))
colnames(df) <- c("mass", "intensity")

ggplotly(centroid_plot(df))

Warning in geom_point(data = df2, aes(x = x1, y = y2, text = paste(round(x1 + :
Ignoring unknown aesthetics: text

Plot mirror spectra

Rectangles were widened here so the would be easier to see on this blog.

transform_df <- function(df){
    df$x1 = df$mass - 0.75
    df$x2 = df$mass + 0.75
    df$y1 = 0
    df$y2 = df$intensity
    df2=df
    df2$mass <- NULL
    df2$intensity <- NULL
    return(df2)
}
    
centroid_plot <- function(df1, df2, top_color="red", bottom_color="blue"){
    df2$intensity <- -df2$intensity
    df1 <- transform_df(df1)
    df2 <- transform_df(df2)    
    
    p <- ggplot() + 
        scale_x_continuous(name="m/z") + 
        scale_y_continuous(name="Intensity") +
        geom_rect(data=df1, mapping=aes(xmin=x1, xmax=x2, ymin=y1, ymax=y2), color="grey30", fill=top_color, alpha=0.5) +
        geom_rect(data=df2, mapping=aes(xmin=x1, xmax=x2, ymin=y1, ymax=y2), fill=bottom_color, color="grey30", alpha=0.5) 

    return(p)
    
    }

df <- as.data.table(mzR::peaks(centroid_msfile_handle, 2243))
colnames(df) <- c("mass", "intensity")

ggplotly(centroid_plot(df, df))

Compare with a spectrum in the GNPS library

Download GNPS library spectrum CCMSLIB00000072054 (Acyl desferrioxamine C14).

temporary_directory <- tempdir()

# 22.3 MB
gnps_spectrum_df <- read.delim("https://metabolomics-usi.gnps2.org/csv/?usi1=mzspec%3AGNPS%3AGNPS-LIBRARY%3Aaccession%3ACCMSLIB00000072054", sep=",")
gnps_spectrum_df <- as.data.table(gnps_spectrum_df)
colnames(gnps_spectrum_df) <- c("mass", "intensity")

One important step that I haven’t discussed yet is normalizing intensity values. This is especially important when comparing data from different sources, where intensity levels may be drastically different. For example, if we just plot our spectrum (positive) against the GNPS library spectrum (negative), we get this:

ggplotly(centroid_plot(df, gnps_spectrum_df))

But if we normalize the intensity values of both spectra to a scale of 0 to 100 then we get a useful representation.

df$intensity <- df$intensity / max(df$intensity) * 100
gnps_spectrum_df$intensity <- gnps_spectrum_df$intensity / max(gnps_spectrum_df$intensity) * 100

ggplotly(centroid_plot(df, gnps_spectrum_df))

Reuse

CC BY 4.0