Part 1: Mass Spectrometry Data

First in a series of introductory posts about working with mass spectrometry data in R
beginner
mass spectrometry
r
Author
Roles

Chase M Clark

original draft, review & editing

Published

February 1, 2024

The Experiment

Mass spectrometry has become an integral analytical technique in natural product discovery, both in measuring accurate mass for chemical formula determination, to analyzing molecule fragmentation for structure elucidation and library searches.

I won’t go in to too much detail here but there are some important experimental considerations when approaching the analysis of mass spectrometry data.

For more information I’d recommended looking at learning material the big vendors usually have, such as this.

The Instrument

Ionizer

There are a number of ionizers that are in use in modern mass spectrometers, with electrospray ionization (ESI) being the most common in our field. Nethod of ionization is important to consider because it will effect which types molecules are ionized, how they are ionized, whether the molecules will remain largely intact or fragmented, and what types of adducts you can expect. For ESI instruments are often in run in positive (most often) or negative mode, and sometimes both (polarity switching) and you should be aware of this going into your analysis (or be prepared to extract the relevant metadata from the raw data).

Analyzer

“Mass spectrometry” is ultimately performed by the mass analyzer, the part that separates, filters, differentiates molecules based on mass to charge (and sometimes size/shape via ion-mobility). There are a number of analyzers on the market, with the most popular being quadrupoles, ion traps, orbitraps, time-of-flight, and combinations thereof. The type of analyzer is important to consider in the analysis as well, and the following should be thought about when approaching a new analysis.

  • What is the resolving power of the anlyzer(s)? Plural because, for example, the m/z filter window of a quadrupole that filters into a Time-Of-Flight (TOF) may or may not impact your analysis.
  • What mode was the instrument run in (e.g. for triple quads was it run in precursor ion scan, neutral loss scan, product ion scan and MRM/SRM mode?)
  • Often analyzers will have their efficacy rated in FWHM (full width at half maximum) which is a measure of resolving power
  • If you confuse resolving power with mass resolution you aren’t alone, there’s been much controversy over the years as they somewhat related. See this whitepaper by Agilent. Simply stated, resolving power measures how well you can separate two mass peaks in a spectrum and resolution is a measure of how “wide” your peaks are.
  • What is the scan speed of your analyzer(s)?
  • Some analyzers are very fast (i.e. they can analyze/separate many m/z per second), while others are slower. Some sacrifice sensitivity, accuracy, resolution, etc for higher scan speed. All things to be aware of.

Detector

While there are different detectors, this isn’t usually a concern during modern analyses. However, if you notice things like sensitivity being too high or low it could be good feedback to give to the instrument operator as it could be detector settings (though llikely sample concentration or ionization efficiency).

Another thing to note is that some instruments may have more than one detector and they may serve different purposes. For example, some Time-Of-Flight (TOF) instruments have both a “linear” detector and a “reflectron” detector that elongates the flight path allowing higher resolving power but lower m/z ceiling than the “linear” detector.

The Data

Raw data formats

Unfortunately different instrument vendors, and even different instruments from the same vendor, have their own unique data storage format. This is for a variety of good and bad reasons, the most convincing to me being that instruments with ever-increasing acquisition speeds and ever-increasing data size need faster/better software/hardware strategies to store data, which can provide a competitive advantage.

Raw data is llikely to come with file extensions (.wiff, .d, .raw/.RAW, .lcd, etc.) and some are locked in so that only the instrument vendor’s software can read the data.

Open-source data formats

Fortunately there are widely used, open, standard formats available. You will proabably encounter mzXML and/or its newer version, mzML; so, go with mzML if you have the option. mzXML files have the file extension mzXML and mzML files have the file extension .mzXML and .mzML.

Some vendor software allows converting a file in proprietary data format to mzML, otherwise your best bet is llikely the program msconvert available as part of the ProteoWizard software library. Unfortunately some vendor formats can only be converted on a Windows computer, a limitation of vendors only providing Windows-based DLLs (i.e. don’t complain to the ProteoWizard team about this).

msconvert can be used from both its GUI or at the command line

For an example of how to use the command line you can take a look at this zip of a directory that contains a batch file that converts a large number of files at once https://ccms-ucsd.github.io/GNPSDocumentation/fileconversion/#data-conversion-easy

I haven’t had the chance to try them but supposedly there are some relatively new Docker containers that can successfully run msconvert. If you know how badly this was needed then you know how exciting this would be/is.

Going forware I will only cover mzML/mzXML as they are by far the most commonly encountered open formats in our field. Other formats can be seen at https://www.psidev.info/specifications; and MGF at http://www.matrixscience.com/help/data_file_help.html

Next

In the next post we will dive into what mzML actually looks like, what spectra look like, etc.