About OpenMS

What is OpenMS

OpenMS is a free, open-source framework based on a C++ library with Python bindings. It is commonly used for liquid chromatography-mass spectrometry (LC-MS) data management and analyses. OpenMS provides an infrastructure for the rapid development of mass spectrometry related software as well as a rich toolset built on top of it. OpenMS is available under the three clause BSD licence and runs under Windows, macOS, and Linux operating systems.

OpenMS overview

OpenMS developers can create new C++ algorithms and tools, while users can execute tools or implement new algorithms or scripts in Python. Workflows integrate pyOpenMS scripts and OpenMS tools with third-party tools and external Python libraries to create scalable data-processing pipelines. For deployment, users can use pyOpenMS with web frameworks or deploy workflows on desktop, high-performance computing (HPC) or cloud infrastructure using one of the community-supported workflow systems.

OpenMS supports the Proteomics Standard Initiative (PSI) formats for MS data. The main contributors of OpenMS are currently the Eberhard-Karls-Universität in Tübingen, the Freie Universität Berlin, and the University of Toronto.

Get involved

OpenMS is developed by a group of core developers and the community. You can help spreading the idea of open source mass spectrometry analysis by:

  • Contribute to the development by giving us your feedback about the OpenMS project on Discord or become active by developing new tools yourself.

  • Donate to the OpenMS project using our opencollective account. All donations will be used strictly to fund the development of Openms’s open source software, documentation, and community.

  • Promote OpenMS either online (e.g. on X) or in your work group.

Installation

GNU/Linux

Install via Conda

Warning

At this time, we do not provide a conda package for our GUI tools. This means if you want to install e.g., TOPPView or SwathWizard for use in for example one of our tutorials, please refer to a different installation method below.

You can use conda or mamba to install the OpenMS library and tools without user interface. Depending on the conda channel, you can obtain release versions (bioconda channel) and nightly versions (openms channel).

  1. Follow the instructions to install conda or mamba. In the following, every mention of conda may be substituted by mamba for faster environment solving.

  2. We recommend to create a new environment with one of the supported python version versions:

     conda create -n openms python=3.10
    
  3. Add some channels to find dependencies:

     conda config --add channels defaults
     conda config --add channels bioconda
     conda config --add channels conda-forge
    

    Note

    You can also add the channels for your current environment only with the --env option.

    Warning

    The order of the channels is important!

    Note

    conda-forge might already be added if you are using Mambaforge.

  4. Install any of the following packages related to OpenMS

    openms contains all OpenMS C++ command-line tools. GUI applications like TOPPView currently cannot be installed via conda.

    libopenms is the C++ library required for the OpenMS C++ Tools to work. This is also an auto-installed dependency of openms.

    pyopenms is the python package that allows to use algorithms from libopenms in Python.

    openms-thirdparty are external tools that are wrapped in OpenMS with adapters. This package is required to use the adapters in the openms package.

    Warning

    Due to unavailability of a large part of the thirdparty tools for macOS via conda, we are not providing a openms-thirdparty package on macOS either.

    via bioconda for release versions

    conda install openms
    
    conda install libopenms
    
    conda install pyopenms
    
    conda install openms-thirdparty
    

    or our own openms channel for nightly snapshots (which are build based on the same bioconda dependencies)

    conda install -c openms openms
    
    conda install -c openms libopenms
    
    conda install -c openms pyopenms
    
    conda install -c openms  openms-thirdparty
    
Install via package managers

Packaged versions of OpenMS are provided for Fedora, OpenSUSE, Debian, and Ubuntu. You can find them to download here. For other GNU/Linux distributions or to obtain the most recent version of the library, installation should be done via building from the source code.

Important

These packages are not directly maintained by the OpenMS team and they can not be guaranteed to have the same behaviour as when building it from source code. Also, their availability and version is subject to change and support might be limited (due to unforeseen or untested behaviour). It is suggested not to install them parallel to our Debian package.

Note

Some thirdparty software used via adapter tools in OpenMS might also require an installed JavaVM.

Install via the provided Debian package

For Debian-based Linux users, it is suggested to use the deb-package provided. It is most easily installed with gdebi which automatically resolves the dependencies available in the PPA Repositories.

sudo apt-get install gdebi
sudo gdebi /PATH/TO/OpenMS.deb

If you encounter errors with unavailable packages, troubleshoot using the following steps.

  1. Qt5 (or one of its packages, e.g. qt5xbase) is missing.

    It might be because your Debian is too old to have a recent enough version in its official repositories. It is suggested to use the same packages that are used while building (make sure to adapt the Qt version and your Debian/Ubuntu version, here Xenial):

    sudo add-apt-repository ppa:beineri/opt-qt59-xenial
    sudo apt-get update
    

    Run the installation again.

  2. ICU with its libicu is missing.

    You can find the missing version on pkgs.org and install it with gdebi, too. You can have multiple versions of ICU installed.

  3. Error while executing a tool

    To ensure the tool functionality, make sure you add the OPENMS_DATA_PATH variable to your environment as follow export OPENMS_DATA_PATH=/usr/share/OpenMS

  4. Thirdparty installation of Qt5 in step 1

    Make sure you source the provided environment file using: source /opt/qt59/bin/qt59-env.sh

  5. Adapters are not finding thirdparty applications

    Executables for thirdparty applications can be found in: /usr/share/OpenMS/THIRDPARTY Add the folders in your PATH for a convenient use of the adapters.

Run OpenMS inside a (Bio)Container
  1. Install a containerization software (e.g., Docker or Singularity)

  2. Pull an image from one of the following registries:

  • OpenMS GitHub Container Registry for nightly binaries AND releases:

    On our registry, we provide one image for the library (with contrib) and one for the executables (with thirdparty).

    1. openms-library

    2. openms-executables

    They can be pulled/run via the following commands:

    docker pull ghcr.io/openms/openms-library
    docker pull ghcr.io/openms/openms-executables
    
    singularity run ghcr.io/openms/openms-library-sif
    singularity run ghcr.io/openms/openms-executables-sif
    

    Note

    Per default this results in the download of the latest nightly snapshot. Specify a release version (e.g., docker pull ghcr.io/openms/openms-library:3.1.0 to receive a stable version.

  • Otherwise, the BioContainers Registries and the associated Galaxy project provide native containers based on our bioconda packages for both Docker and Singularity.

    1. BioContainers libopenms

    2. BioContainers openms

    3. BioContainers openms-thirdparty

    4. BioContainers pyOpenMS

    Images of the containers can be pulled via or one of the following commands:

    docker pull quay.io/biocontainers/libopenms
    docker pull quay.io/biocontainers/openms
    docker pull quay.io/biocontainers/pyopenms
    docker pull quay.io/biocontainers/openms-thirdparty
    
    singularity run https://depot.galaxyproject.org/singularity/libopenms
    singularity run https://depot.galaxyproject.org/singularity/openms
    singularity run https://depot.galaxyproject.org/singularity/pyopenms
    singularity run https://depot.galaxyproject.org/singularity/openms-thirdparty
    

Note

If Singularity images fail to download or run, try to use the Docker images as Singularity will automatically convert them.

Dockerfiles to build different kind of images (e.g., for ArchLinux) yourself can be found on GitHub in our OpenMS/dockerfiles repository. They usually follow our build instructions closely, so you can have a look on how this is done in a clean environment.

Build OpenMS from source

To build OpenMS from source, follow the build instructions for Linux.

macOS

Install via Conda

Warning

At this time, we do not provide a conda package for our GUI tools. This means if you want to install e.g., TOPPView or SwathWizard for use in for example one of our tutorials, please refer to a different installation method below.

You can use conda or mamba to install the OpenMS library and tools without user interface. Depending on the conda channel, you can obtain release versions (bioconda channel) and nightly versions (openms channel).

  1. Follow the instructions to install conda or mamba. In the following, every mention of conda may be substituted by mamba for faster environment solving.

  2. We recommend to create a new environment with one of the supported python version versions:

     conda create -n openms python=3.10
    
  3. Add some channels to find dependencies:

     conda config --add channels defaults
     conda config --add channels bioconda
     conda config --add channels conda-forge
    

    Note

    You can also add the channels for your current environment only with the --env option.

    Warning

    The order of the channels is important!

    Note

    conda-forge might already be added if you are using Mambaforge.

  4. Install any of the following packages related to OpenMS

    openms contains all OpenMS C++ command-line tools. GUI applications like TOPPView currently cannot be installed via conda.

    libopenms is the C++ library required for the OpenMS C++ Tools to work. This is also an auto-installed dependency of openms.

    pyopenms is the python package that allows to use algorithms from libopenms in Python.

    openms-thirdparty are external tools that are wrapped in OpenMS with adapters. This package is required to use the adapters in the openms package.

    Warning

    Due to unavailability of a large part of the thirdparty tools for macOS via conda, we are not providing a openms-thirdparty package on macOS either.

    via bioconda for release versions

    conda install openms
    
    conda install libopenms
    
    conda install pyopenms
    
    conda install openms-thirdparty
    

    or our own openms channel for nightly snapshots (which are build based on the same bioconda dependencies)

    conda install -c openms openms
    
    conda install -c openms libopenms
    
    conda install -c openms pyopenms
    
    conda install -c openms  openms-thirdparty
    
Install via macOS installer

To install OpenMS on macOS, run the following steps:

  1. Download and install the macOS drag-and-drop installer from the archive.

  2. Double click on the downloaded file. It will start to open the OpenMS-<version>-macOS.dmg disk image file.

Opening OpenMS-<version>-macOS.dmg
  1. Verify the download.

Verifying OpenMS-<version>-macOS.dmg
  1. Agree to the license agreements.

License agreement
  1. Drag OpenMS to the Applications folder.

Move to Applications
  1. It will start copying to applications.

Preparing to copy to Applications Copying to Applications

To use TOPP as regular app in the shell, add the following lines to the ~/.profile file.

Warning

Known Installer Issues

  1. Nothing happens when you click OpenMS apps or the validity of the developer could not be confirmed.

    This usually means the OpenMS software lands in quarantine after installation of the .dmg. Since macOS Catalina (maybe also Mojave) all apps and executables have to be officially notarized by Apple but we currently do not have the resources for a streamlined notarization workflow.

    To have a streamlined experience without blocking popups, it is recommended to remove the quarantine flag manually, using the following steps:

    Open the Terminal.app and type the following (replace the first line with the actual installation directory):

    cd /Applications/OpenMS-<version>
    sudo xattr -r -d com.apple.quarantine *
    
  2. Bug with running Java based thirdparty tools like MSGFPlusAdapter and LuciphorAdapter from within TOPPAS.app

    If you face issues while running Java based thirdparty tools from within TOPPAS.app, run the TOPPAS.app from within the Terminal.app (e.g. with the open command) to get access to the path where Java is located. Java is usually present in the PATH of the terminal. Advanced users can set this path in the Info.plist of/inside the TOPPAS.app.

    export OPENMS_TOPP_PATH=<OpenMS-PATH>
    source ${OPENMS_TOPP_PATH}/.TOPP_bash_profile
    

    Make sure <OpenMS-PATH> points to the folder where OpenMS is installed locally (e.g., /Applications/OpenMS-<version>)

Run OpenMS inside a (Bio)Container
  1. Install a containerization software (e.g., Docker or Singularity)

  2. Pull an image from one of the following registries:

  • OpenMS GitHub Container Registry for nightly binaries AND releases:

    On our registry, we provide one image for the library (with contrib) and one for the executables (with thirdparty).

    1. openms-library

    2. openms-executables

    They can be pulled/run via the following commands:

    docker pull ghcr.io/openms/openms-library
    docker pull ghcr.io/openms/openms-executables
    
    singularity run ghcr.io/openms/openms-library-sif
    singularity run ghcr.io/openms/openms-executables-sif
    

    Note

    Per default this results in the download of the latest nightly snapshot. Specify a release version (e.g., docker pull ghcr.io/openms/openms-library:3.1.0 to receive a stable version.

  • Otherwise, the BioContainers Registries and the associated Galaxy project provide native containers based on our bioconda packages for both Docker and Singularity.

    1. BioContainers libopenms

    2. BioContainers openms

    3. BioContainers openms-thirdparty

    4. BioContainers pyOpenMS

    Images of the containers can be pulled via or one of the following commands:

    docker pull quay.io/biocontainers/libopenms
    docker pull quay.io/biocontainers/openms
    docker pull quay.io/biocontainers/pyopenms
    docker pull quay.io/biocontainers/openms-thirdparty
    
    singularity run https://depot.galaxyproject.org/singularity/libopenms
    singularity run https://depot.galaxyproject.org/singularity/openms
    singularity run https://depot.galaxyproject.org/singularity/pyopenms
    singularity run https://depot.galaxyproject.org/singularity/openms-thirdparty
    

Note

If Singularity images fail to download or run, try to use the Docker images as Singularity will automatically convert them.

Dockerfiles to build different kind of images (e.g., for ArchLinux) yourself can be found on GitHub in our OpenMS/dockerfiles repository. They usually follow our build instructions closely, so you can have a look on how this is done in a clean environment.

Build OpenMS from source

To build OpenMS from source, follow the build instructions for macOS.

Windows

Install via Windows installer

To Install the binary package of OpenMS & TOPP:

  1. Download the installer OpenMS-<version>-Win64.exe from the archive

  2. Execute the installer under the user account that later runs OpenMS and follow its instructions.

    You may see a Windows Defender Warning, since our installer is not digitally signed.

    Click on “More Info”, and then “Run anyways”.

    When asked for an admin authentication, please enter the credentials (it is not advised to directly invoke the installer using an admin account).

Tip

The windows installer works with Windows 10 and 11 (older versions might still work but are untested).

Known issues
  1. During installation, an error message pops up, saying:

    “The installation of the Microsoft .NET 3.5 SP1’ package failed!

    You must download and install it manually in order for Proteowizard to work. This should only happen if installion is done by selecting the “Third Party - Proteowizard” components. The reason is usually that .NET 3.5 SP1 is already installed (see Windows Control Panel). If it’s not installed, follow the instructions of the error message.

  2. During installation, an error message pops up, saying:

    “The installation of the Visual Studio redistributable package … failed. …”

    This is a known issue with a Microsoft package, we cannot do anything about it. The error message will give the location where the redistributable package was extracted to. Go to this folder and run the executable (usually named vcredistXXXX.exe) as an administrator (right-click and then select Run-As). You will likely receive an error message (this is also the reason why the OpenMS setup complained about it). You might have to find the solution to fix the problem in your local machine. If you’re lucky the error message is instructive and the problem is easy to fix.

  3. During installation, an error message pops up saying:

    “Error opening installation log file”

    To fix, check the system environment variables. Make sure they are apt. There should a TMP and a TEMP variable, and both should contain one directory only, which exists and is writable. Fix accordingly (search the internet on how to change environment variables on Windows).

  4. For Win8 or later, Windows will report an error while installing .net4 as it’s mostly included. But it might occur that .net3.5 does not get properly installed during the process.

    Fix is to enable the .NET Framework 3.5 yourself through Control Panel. See this Microsoft help page.aspx#ControlPanel) for detailed information. Even if this step fails, this does not affect the functionality of OpenMS, except for the executability of included third party tools (ProteoWizard).

Workflow Editor

You can run OpenMS TOPP tools from the command line using your custom scripts, or use powerful workflow systems designed to make workflow creation and maintenance more fun, find out more in Workflow Editor.

_images/KNIMELogoTM.svg _images/nextflow2014_no-bg.png _images/galaxy_project_logo.png _images/TOPPAS_logo_white.png
_images/KNIMELogoTM_white.svg _images/nextflow2014_no-bg-bright.png _images/galaxy_project_logo_white.png _images/TOPPAS_logo_dark.png

Graphical and Command-Line Tools

For instructions on how to install the OpenMS graphical and command-line tools, choose your operating system from the items below.

GNU/Linux

Installation on Linux

macOS

Installation on macOS

Windows

Installation on Windows

Community

Welcome to OpenMS!

OpenMS is a community-driven open source project developed by a diverse group of contributors. The OpenMS leadership has made a strong commitment to creating an open, inclusive, and positive community. Please read the OpenMS Code of Conduct for guidance on how to interact with others in a way that makes the community thrive.

Here’s how to get started:

We offer several communication channels to learn, share your knowledge and connect with others within the OpenMS community.


Discord

Discord allows the users to communicate in different channels, publicly as well as privately


GitHub

The following repositories can be used:

In case if you are confused, please ask your query here


GitHub issue tracker

The issue tracker can be used for:

OpenMS

Documentation


Gitter

A real-time chat room to ask questions about OpenMS.


OpenMS mailing list

These lists are the main form of receiving OpenMS-related updates, like new features, changes to the roadmap, and all kinds of project-wide decision making.

There are two mailing list currently:

  • open-ms-announcements - Announcements about OpenMS, such as for releases, developer meetings, sprints or conference talks are made on this list.

  • open-ms-general - For addressing any queries or suggestion from the users.


Twitter

Contact us or just follow the latest OpenMS news on Twitter.


Join the OpenMS Community

To thrive, the OpenMS project needs your expertise and enthusiasm. Not a coder? Not a problem! There are many ways to contribute to OpenMS.

If you are interested in becoming a OpenMS contributor (yay!) we recommend checking out our Contribute page.

Learning

Proteomics and metabolomics focus on complex interactions within biological systems; the former is centered on proteins while the latter is based on metabolites. To understand these interactions, we need to accurately identify the different biological components involved.

Liquid chromatography (LC) and mass spectrometry (MS) are the analytical techniques used to isolate and identify biological components in proteomics and metabolomics. LC-MS data can be difficult to analyze manually given its amount and complexity. Therefore, we need specialized software that can analyze high-throughput LC-MS data quickly and accurately.

Why use OpenMS

OpenMS is an open-source, C++ framework for analyzing large volumes of mass spectrometry data. It has been specially designed for analyzing high performance LC-MS data but over recent times, has been extended to analyze data generated by other techniques.

Note

OpenMS in recent times has been expanded to support a wide variety of mass spectrometry experiments. To design your analysis solution, contact the OpenMS team today.

To use OpenMS effectively, an understanding of chromatography and mass spectrometry is required as many of the algorithms are based on these techniques. This section provides a detailed explanation on LC and MS, and how they are combined to identify and quantify substances.

Liquid chromatography (LC)

Chromatography is a technique used by life scientists to separate molecules based on a specific physical or chemical property.

Video

For more information on chromatography, view this video.

There are many types of chromatography, but this section focuses on LC as it is widely used in proteomics and metabolomics.

LC separates molecules based on a specific physical or chemical property by mixing a sample containing the molecules of interest (otherwise known as analytes) in a liquid solution.

Key components of LC

An LC setup is made up of the following components:

  • A liquid solution, known as the mobile phase, containing the analytes.

  • A pump which transports the liquid solution.

  • A stationary phase which is a solid, homogeneous substance.

  • A column that contains the stationary phase.

  • A detector that plots the time it takes for the analyte to escape the column (retention time) against the analyte’s concentration. This plot is called a chromatogram.

Refer to the image below for a diagrammatic representation of an LC setup.

schematic illustration of an LC setup

How does LC work?

The liquid solution containing the analytes is pumped through a column that is attached to the stationary phase. Analytes are separated based on how strongly they interact with each phase. Some analytes will interact strongly with the mobile phase while others will be strongly attracted to the stationary phase, depending on their physical or chemical properties. The stronger an analyte’s attraction is to the mobile phase, the faster it will leave the column. The time it takes for an analyte to escape from the column is called the analyte’s retention time. As a result of their differing attractions to the mobile and stationary phases, different analytes will have different retention times, which is how separation occurs.

The retention times for each analyte are recorded by a detector. The most common detector used is the mass spectrometer, which we discuss later. However, other detection methods exist, such as:

  • Light absorption (photometric detector)

  • Fluorescence

  • Change in diffraction index

High performance liquid chromatography (HPLC)

HPLC is the most commonly used technique for separating proteins and metabolites. In HPLC, a high-pressured pump is used to transport a liquid (solvent) containing the molecules of interest through a thin capillary column. The stationary phase is ‘packed’ into the column.

Video

For more information on HPLC, view this video.

Several variations of HPLC exist such as:

  • Reversed-phase (RP) chromatography

  • Strong cation/anion exchange (SCX/SAX) chromatography

  • Affinity chromatography

  • Size exclusion chromatography

Special case of HPLC: Reversed-phase (RP) chromatography

RP chromatography is the most commony type of HPLC with biological samples. In reversed-phase liquid chromatography, the solid phase is modified to become hydrophobic, when it is originally hydrophilic, hence the term ‘reversed-phase’. The liquid phase is a mixture of water and an organic solvent. The separation of molecules happens based on the following behavior: hydrophilic analytes have a high affinity to the mobile phase and escape the column quickly while hydrophobic analytes have a high affinity towards the organic solvent and therefore, take a longer time to escape the column.

Video

For more information on RP chromatography, view this video.

Mass spectrometry (MS)

Mass spectrometry is an analytical technique used to determine the abundance of molecules in a sample.

Key components of MS

There are three key components in a mass spectrometer:

  • An ion source, which generates ions from the incoming sample. All mass spectrometry techniques rely on ionized molecules to control their movement in an electric field.

  • A mass analyzer, which separates the ions according to their mass-to-charge (m/z) ratio. There are several types such as time of flight (TOF), orbitrap and quadrupole mass analyzers. Depending on the mass analyzer, OpenMS offers calibration tools, so that highly accurate results can be achieved.

  • A detector, which scans ions at a given time point producing a mass spectrum, where the intensity is plotted against the m/z.

Refer to the image below for a diagrammatic representation of the key components in MS.

schematic illustration of a mass spectrometer

Ion source

We want the analytes to move through the electrostatic and electromagnetic fields in the mass analyzer. To achieve this objective, we need to convert them to ions by charging them. There are a number of ways to charge our analytes including:

  • Electrospray Ionization (ESI)

  • Matrix Assisted Laser Desorption/Ionization (MALDI)

  • Electron Impact Ionization (EI)

In proteomics and metabolomics, ESI and MALDI are used because they are soft ionization techniques. A soft ionization technique is one which charges analytes while keeping the molecules of interest largely intact, so that they can be characterized easily at a later stage. Hard ionization techniques such as EI shatter analytes in smaller fragments, making it difficult to characterize large molecules.

Given that OpenMS focuses on proteomic and metabolomic applications, we will describe ESI and MALDI in further detail.

Electrospray Ionization (ESI)

ESI can be broken down into the following steps.

  1. The sample is dissolved in a polar, volatile buffer.

  2. The sample - dissolved in the buffer - is pumped through a thin, stainless steel capillary.

  3. The sample is converted to small, charged, stable droplets (aerosolized) by applying high voltage.

  4. The aerosol is directed through regions of high vacuum until the droplets evaporate until only the charged molecules are left.

  5. The particles are fed to the mass analyzer.

Refer to the image below for a diagrammatic representation of the steps in ESI.

a simplified, schematic representation of ESI

Video

For more information on ESI, view this video.

Matrix Assisted Laser Desorption/Ionization (MALDI)

MALDI can be broken down into the following steps:

  1. The analytes are mixed with a small organic molecule known as a matrix.

  2. The mixture is exposed to radiation with short pulses of laser light, charging the matrix.

  3. The matrix transfers its charge to the analytes because the wavelength of the laser light is the same as the absorbance maximum of the matrix.

  4. The analytes become charged and are fed to the mass analyzer.

Refer to the image below for a diagrammatic representation of the steps in MALDI.

a simplified, schematic representation of MALDI

Video

For more information on MALDI, view this video.

Mass analyzer

Once the analytes have been charged by the ion source, we want to now sort the analytes by their mass-to-charge ratio for easy identification.

A number of mass analyzers exists. These include:

  • Quadrupole analyzer

  • Time-of-Flight analyzer

  • Orbitrap analyzer

The next sections describe each analyzer type in detail.

Quadrupole

In a quadropole analyzer, you can set the quadropole voltage so that ions with a specific m/z ratio travel through. The oscillating electrostatic fields stabilize the flight path for the ions so that they can pass through the quadropole. Other ions will be accelerated out of the quadropole and will not make it to the end.

Refer to the image below for a diagrammatic representation of the quadrupole analyzer.

a simplified, schematic representation of the quadrupole analyzer

Video

For more information on quadrupole analyzers, view this video.

Time-of-Flight (TOF)

In a time-of-flight analyzer, ions are extracted from the ion source through an electrostatic field in pulses in a field-free drift zone. An electrostatic mirror called a reflectron reflects the ions back onto the next component of mass spectrometry, the detector. The detector counts the particles and records the time of flight from extraction to the moment the particle hits the detector.

Refer to the image below for a diagrammatic representation of the TOF analyzer.

a simplified, schematic representation of TOF

Lighter ions fly faster than heavier ions of the same charge and will arrive earlier at the detector. Therefore, an ion’s time of flight depends on the ion’s mass. The ion’s time of flight is also dependant on the ion’s charge. This can be demonstrated by using the following equations:

  1. Potential energy is transferred to an ion with charge q accelerated by an electrostatic field with voltage.

\[ \begin{equation} E_p = qU_a \end{equation}\]
  1. The potential energy is converted to kinetic energy as the ion accelerates.

\[ \begin{equation} E_p = E_k = \frac{1}{2}mv^2 \end{equation}\]
  1. We know that for a given path,s, from extraction to the detector, the time of flight, t is equal to:

\[ \begin{equation} t = \frac{s}{v} \end{equation}\]

Therefore, t, for a given instrument’s path length, s, depends on an ion’s charge and mass.

\[ \begin{equation} t = \frac{s}{v} = \frac{s}{\sqrt{\frac{2qU_a}{m}}} \end{equation}\]

Video

For more information on TOF analyzers, view this video.

Orbitrap

The orbitrap analyzer is the most frequently used analyzer in mass spectrometry for proteomic and metabolomic applications. It consists of two outer electrodes and a central electrode. Ions are captured inside the analyzer because of an applied electrostatic field. The ions in the orbitrap analyzer oscillate around the central electrode along the axis of the electrostatic field at a set frequency, ω. This frequency is used to determine the mass-to-charge ratio using the following formula:

\[ \begin{equation} ω = \sqrt{\frac{kz}{m}} \end{equation}\]

, where k is a constant.

The following diagram is a conceptual representation of the orbitrap analyzer.

schematic illustration of a mass spectrometer

Video

For more information on orbitrap analyzers, view this video.

Identifying molecules with Tandem Mass Spectrometry (MS2)

To get better results, we can use two mass analyzers sequentially to generate and analyze ions. This technique is called tandem mass spectrometry or MS/MS (MS2). Tandem mass spectrometry is especially useful for linear polymers like proteins, RNA and DNA.

With MS2, ions called precursor ions are isolated and fragmented into ion fragments or product ions. A mass spectrum is recorded for both the precursor and the product ions.

Video

For more information on MS2, view this video.

Different fragmentation techniques to fragment peptides exist:

  • Collision-Induced Dissociation (CID)

  • Pulsed Q Dissociation (PQD)

  • Electron transfer dissociation (ETD)

  • Electron capture dissociation (ECD)

  • Higher energy collision dissociation (HCD)

CID is the most frequently used fragmentation technique and will therefore be discussed in more detail in the following section.

Collision-induced dissociation

Collision-induced dissociation is a method to fragment peptides using an inert gas such as argon or helium. Selected primary or precursor ions enter a collision cell filled with the inert gas. The application of the inert gas on the precursor ions causes the precursor ions that reach the energy threshold to fragment into smaller, product ions and or neutral losses. A mass spectrum is recorded for both the precursor ions and the product ions. The mass spectrum for the precursor ions will give you the mass for the entire peptide while the product ions will inform you about it’s amino acid composition.

Video

For more information on CID, view this video.

LC-MS

Liquid chromatography is often coupled with mass spectrometry to reduce complexity in the mass spectra. If complex samples were directly fed to a mass spectrometer, you would not be able to detect the less abundant analyte ions. The separated analytes from the liquid chromatography setup are directly injected into the ion source from the mass spectrometry setup. Multiple analytes that escape the column at the same time are separated by their mass-to-charge ratio using the mass spectrometer.

Refer to the image below for a diagrammatic representation of the LC-MS setup.

lc-ms setup

From the LC-MS setup, a set of spectra called a peak map is produced. In a peak map, each spectrum represents the ions detected at a particular retention time. Each peak in a spectrum has a retention time, mass-to-charge and intensity dimension.

From the LC-MS setup, a series of spectra are ‘stacked’ together to form what is known as a peak map. Each spectrum in a peak map is a collection of data points called peaks which indicate the retention time, mass-to-charge and intensity of each detected ion. Analyzing peak maps is difficult as different compounds can elute at the same time which means that peaks can overlap. Therefore, sophisticated techniques are required for the accurate identification and quantification of molecules.

The image below includes a spectrum at a given retention time (left) and a peak map (right).

peak map

Video

For more information on a specific application of LC-MS, view this video.

Identification and Quantification

While the combination of liquid chromatography and mass spectrometry can ease the process of characterising molecules of interest, further techniques are required to easily identify and quantify these molecules. This section discusses both labeled and label-free quantification techniques.

Labeling

Relative quantification is one strategy where one sample is chemically treated and compared to another sample without treatment. This section discusses a particular relative quanitification technique called labeling or stable isotope labeling which involves the addition of isotopes to one sample. An isotope of an element behaves the same chemically but has a different mass. Stable isotope labeling is used in mass spectrometry so that scientists can easily identify proteins and metabolites.

Two types of stable isotope labeling exist: chemical labeling and metabolic labeling.

Chemical labeling

During chemical labeling, the label is attached at specific functional groups in a molecule like the N-terminus of a peptide or specific side chains.

Chemical labeling occurs late in the process, therefore experiments that incorporate this technique are not highly reproducible.

Isobaric labeling

Isobaric labeling, is a technique where peptides and proteins are labeled with chemical groups that have an identical mass, but vary in terms of of distribution of heavy isotopes in their structure.

Video

For more information on isobaric labeling, view the following links:

OpenMS contains tools that analyze data from isobaric labeling experiments.

Metabolic labeling

During metabolic labeling, the organism is ‘fed’ with labeled metabolites. Metabolites include but are not limited to amino acids, nitrogen sources and glucose. Unlike chemical labeling, metabolic labeling occurs early in the study. Therefore, experiments that incorporate metabolic labeling are highly reproducible.

Stable Isotope Labeling with Amino Aids in Cell Culture (SILAC)

In SILAC, the labeled amino acids are fed to the cell culture. The labels are integrated into the proteins after a period. The labeled sample is then compared with the unlabeled sample.

OpenMS contains tools that analyze data from SILAC experiments.

Video

For more information on SILAC, view the following links:

Label-free quantification (LFQ)

LFQ is a cheap and natural method of quantifying molecules of interest. As the name suggests, no labeling of molecules is involved.

LFQ includes the following steps:

  1. Conduct replicate experiments.

  2. Generate LC-MS maps for each experiment.

  3. Find features in all LC-MS maps. A feature is a collection of peaks that belong to a chemical compound.

  4. Align maps to address shifts in retention times.

  5. Match corresponding features in different maps. We refer to this as grouping or linking.

  6. Identify feature groups, called consensus features.

  7. Quantify consensus features.

Video

For more information on LFQ, view this video. For more information on the steps involved in LFQ, view this video

Feature finding

Feature finding is method for identifying all peaks belonging to a chemical compound. Feature finding involves the following steps:

  1. Extension where we collect all data points we think belong to the peptide.

  2. Refinement where we remove peaks that we think do not belong to the peptide.

  3. Fit an optimal model to the isolated peaks.

The above steps are iterative; we repeat these steps until no improvement can be made to the model.

OpenMS contains a number of feature finding algorithms.

Video

For more information on feature finding, view this video.

Introduction

OpenMS offers a wide range of tools you can use for your mass spectrometry analysis. Users without programming experience or those who want to quickly run common pipelines and analyses on their data can check out our webapps.

Those who want to have more control over the settings in their pipeline can pick a pre-defined workflow or create a new one themselves with one of our supported workflow systems.

If you want to develop high performance pipelines including a large amount of data, you can use our TOPP tools either in the command line or create a new Nextflow workflow.

Should you be missing something you can also add new functions or classes to the OpenMS C++ core library.

Users

Developers

WebApps

Run pre-built analysis applications in your browser.

Run FLASHTaggerViewer, NuXL or UmetaFlow directly in your browser or host them locally.

pyOpenMS

Use the pyOpenMS python library to rapidly prototype methods and scripts.

Quickly prototype new methods and scripts or interface with other prominent data science, machine learning or visualization libraries in Python.

Workflow Editor

Use a supported workflow editor to create or run predefined workflows.

Use applications such as KNIME, Nextflow, Galaxy or our tool TOPPAS, to apply predefined workflows or custom workflows you have designed on your data.

TOPP Tools

Use over 100 command-line tools to automate pre-defined tasks efficiently.

Automate tasks and create workflows that can be saved, stored and used on multiple datasets.

TOPPView

Use the OpenMS graphical user interface to inspect your results.

Visualize your mass spectrometry data in 1D, 2D and 3D with TOPPView.

OpenMS C++ core library

Develop your own efficient tools and methods with the OpenMS C++ core library.

Using the OpenMS C++ core library directly provides faster access to tools and shorter run-times.

WebApps

FLASHTaggerViewer

Visualizes outputs from FLASH* tools

NuXL

A specialized protein nucleic-acid crosslink search engine

UmetaFlow

A universal metabolomics tool

Workflow Editor

Which workflow environment to choose for running OpenMS tools?

You can run OpenMS TOPP tools from the command line using your custom scripts, or use powerful workflow systems designed to make workflow creation and maintenance more fun:

KNIME

Free, open source, desktop app. An analytics platform with workflow editor and a nice drag-and-drop user interface. Very interactive and has built-in nodes for related tasks like working with chemical structures, databases, machine learning, scripting. Distributed computing is best achieved with a KNIME server (License required) which also allows user management and a web interface to interact with workflows. In KNIME you can easily construct your own workflows or just download our ready-made creations for the most common analysis tasks.

KNIME

Free, open source, desktop app. An analytics platform with workflow editor and a nice drag-and-drop user interface. Very interactive and has built-in nodes for related tasks like working with chemical structures, databases, machine learning, scripting. Distributed computing is best achieved with a KNIME server (License required) which also allows user management and a web interface to interact with workflows. In KNIME you can easily construct your own workflows or just download our ready-made creations for the most common analysis tasks.

Nextflow

Script/DSL-based workflow language, executor and utilities (such as the browser based launcher and supervisor nf-tower). Automatically runs on various different cloud (AWS, Google, …) and HPC environments (SLURM, LFS, Kubernetes, …). It is recommended to use our ready-made nf-core compatible workflows for ease of use via the browser-based configuration and launcher.

Nextflow

Script/DSL-based workflow language, executor and utilities (such as the browser based launcher and supervisor nf-tower). Automatically runs on various different cloud (AWS, Google, …) and HPC environments (SLURM, LFS, Kubernetes, …). It is recommended to use our ready-made nf-core compatible workflows for ease of use via the browser-based configuration and launcher.

Galaxy

Server and browser-based interactive workflow editor and runner. A public server instance can be used for testing and smaller experiments. Provides nice guided tutorials.

Galaxy

Server and browser-based interactive workflow editor and runner. A public server instance can be used for testing and smaller experiments. Provides nice guided tutorials.

TOPPAS

OpenMS’ build-in workflow system, with limited capabilities but easy to use and tailored to TOPP tools.

TOPPAS

OpenMS’ build-in workflow system, with limited capabilities but easy to use and tailored to TOPP tools.

KNIME

Installation

Click here to install KNIME, the OpenMS plugin and its required packages.

Ready-made KNIME workflows

You can get ready-made KNIME workflows and workflow components with OpenMS nodes from our community hub. You can easily drag-and-drop workflows into your opened KNIME analytics platform. For more, e.g., thirdparty OpenMS workflows use the search bar on the hub and search for “openms”.

Creating workflows with KNIME

Download Introduction to OpenMS in KNIME user tutorial containing hands-on training material covering also basic usage of KNIME. See the official KNIME Getting Started Guide for a more in-depth view of the KNIME functionality besides OpenMS.

If you face any issues, please contact us and specifically for the usage of OpenMS in KNIME, the KNIME community contribution forum.

Click here to create your own minimal workflow.

KNIME - Installation

Installation of OpenMS in KNIME is platform-independent across Windows, MacOSX, and Linux.

  1. Download the latest KNIME release from the KNIME website.

  2. In the full install of KNIME skip the following installation routine since all required plugins should be installed by default. For the standard (core) installation, follow the instructions here or in the extended user-tutorial.

  3. In KNIME click on Help > Install new Software.

  4. Install the required KNIME File Handling nodes from the official KNIME Update Site (a standard entry in the update sites). Choose the update site from the Work with: dropdown menu.

    Name: KNIME Analytics Platform 5.2.0 Update Site.

    Location: http://update.knime.org/analytics-platform/5.2.0

  5. Filter the results for File handling and select the KNIME File Handling Nodes. Click Next and install.

  6. Now, install the actual OpenMS plugin. Next to the Work with: dropdown menu, click on Add…. In the opening dialog fill in at least one of the following additional Update Sites (if not already present):

  7. Use the search or navigate to KNIME Community Contributions – Bioinformatics & NGS and select OpenMS. Then click Next and follow the installation instructions. A restart of KNIME might be necessary afterward. On Windows, if prompted to install additional requirements like the Microsoft Visual Studio Redistributable for the conversion software ProteoWizard that is packaged with our plugin.

  8. After a restart of KNIME the OpenMS nodes will be available in your Node Repository (panel on the lower left) under Community Nodes.

Minimal Workflow

Let us start with the creation of a simple workflow. As a first step, we will gather some basic information about the data set before starting the actual development of a data analysis workflow. This minimal workflow can also be used to check if all requirements are met and that your system is compatible.

  • Create a new workflow.

  • Add an File Importer node and an Output Folder node (found in Community Nodes > GenericKnimeNodes > IO) and a FileInfo node (to be found in the category Community Node > OpenMS > File Handling) to the workflow.

  • Connect the File Importer node to the FileInfo node, and the first output port of the FileInfo node to the Output Folder node.

Tip

In case you are unsure about which node port to use, hovering the cursor over the port in question will display the port name and what kind of input it expects.

The complete workflow is shown in below image. FileInfo can produce two different kinds of output files.

A minimal workflow calling FileInfo on a single file.

Figure 8: A minimal workflow calling FileInfo on a single file.

  • All nodes are still marked red, since we are missing an actual input file. Double-click the File Importer node and select Browse. In the file system browser select Example_DataIntroductiondatasetstinyvelos005614.mzML and click Open. Afterwards close the dialog by clicking Ok.

  • The File Importer node and the FileInfo node should now have switched to yellow, but the Output Folder node is still red. Double-click on the Output Folder node and click on Browse to select an output directory for the generated data.

  • Great! Your first workflow is now ready to be run. Press + F7 (shift key + F7; or the button with multiple green triangles in the KNIME Toolbar) to execute the complete workflow. You can also right click on any node of your workflow and select Execute from the context menu.

  • The traffic lights tell you about the current status of all nodes in your workflow. Currently running tools show either a progress in percent or a moving blue bar, nodes waiting for data show the small word “queued”, and successfully executed ones become green. If something goes wrong (e.g., a tool crashes), the light will become red.

  • In order to inspect the results, you can just right-click the Output Folder node and select View: Open the output folder You can then open the text file and inspect its contents. You will find some basic information of the data contained in the mzML file, e.g., the total number of spectra and peaks, the RT and m/z range, and how many MS1 and MS2 spectra the file contains.

Workflows are typically constructed to process a large number of files automatically. As a simple example, consider you would like to filter multiple mzML files to only include MS1 spectra. We will now modify the workflow to compute the same information on three different files and then write the output files to a folder.

  • We start from the previous workflow.

  • First we need to replace our single input file with multiple files. Therefore we add the Input Files node from the category Community Nodes > GenericKnimeNodes > IO.

  • To select the files we double-click on the Input Files node and click on Add. In the filesystem browser we select all three files from the directory Example_Data > Introduction > datasets > tiny. And close the dialog with Ok.

  • We now add two more nodes: the ZipLoopStart and the ZipLoopEnd node from the category Community Nodes > *GenericKnimeNodes > Flow and replace the FileInfo node with FileFilter from Community Nodes > OpenMS > File Handling.

  • Afterwards we connect the Input Files node to the first port of the ZipLoopStart node, the first port of the ZipLoopStart node to the FileConverter node, the first output port of the FileConverter node to the first input port of the ZipLoopEnd node, and the first output port of the ZipLoopEnd node to the Output Folder node.

The complete workflow is shown in the top right of the figure below.

A minimal workflow calling the FileFilter on multiple mzML files in a loop

Figure 9: The FileFilter workflow. Showing the configure dialog for FileFilter, and the level selector pane.

Now we need to configure the FileFilter to only store MS1 data. To do this we double click on the FileFilter node to open the configuration dialog (see left pane above), double click “level”, select 2 from the sub-pane (see bottom right panel above), and click delete. Repeat the process for 3. Select OK to exit the sub-pane, and then OK again in the configuration dialog.

Execute the workflow and inspect the output as before.

Now, if you open the resulting files in TOPPView, you can see that only the MS1 spectra remain.

In case you had trouble to understand what ZipLoopStart and ZipLoopEnd do, here is a brief explanation:

  • The Input Files node passes a list of files to the ZipLoopStart node.

  • The ZipLoopStart node takes the files as input, but passes the single files sequentially (that is: one after the other) to the next node.

  • The ZipLoopEnd collects the single files that arrive at its input port. After all files have been processed, the collected files are passed again as file list to the next node that follows.

Advanced topic: Metanodes

Workflows can get rather complex and may contain dozens or even hundreds of nodes. KNIME provides a simple way to improve handling and clarity of large workflows:

Metanodes allow to bundle several nodes into a single Metanode.

Task

Select multiple nodes (e.g. all nodes of the ZipLoop including the start and end node). To select a set of nodes, draw a rectangle around them with the left mouse button or hold Ctrl to add/remove single nodes from the selection.

Tip

There is a Select Scope option when you right-click a node in a loop, that does exactly that for you. Then, open the context menu (right-click on a node in the selection) and select Create Metanode. Enter a caption for the Metanode. The previously selected nodes are now contained in the Metanode. Double-clicking on the Metanode will display the contained nodes in a new tab window.

Task

Create the Metanode to let it behave like an encapsulated single node. First select the Metanode, open the context menu (right-click) and select Metanode > Convert to Component. The differences between Metanodes and components are marginal (Metanodes allow exposing user inputs, workflow variables and contained nodes). Therefore, we suggest to use standard Metanodes to clean up your workflow and cluster common subparts until you actually notice their limits.

Task

Undo the packaging. First select the Metanode/Component, open the context menu (right-click) and select Metanode/Component > Expand.

NextFlow

Nextflow is a workflow system for creating scalable, portable, and reproducible workflows. It is based on the dataflow programming model, which greatly simplifies the writing of parallel and distributed pipelines, allowing you to focus on the flow of data and computation. Nextflow can deploy workflows on a variety of execution platforms, including your local machine, HPC schedulers, AWS Batch, Azure Batch, Google Cloud Batch, and Kubernetes. Additionally, it supports many ways to manage your software dependencies, including Conda, Spack, Docker, Podman, Singularity, and more.[1]

Installation

Click here to install Nextflow only. Alternatively click here to follow the instructions for using nf-core curated pipelines in Nextflow.

Ready-made OpenMS nextflow workflows

SCALABLE NF-CORE COMPATIBLE NEXTFLOW PIPELINES

Click on “Launch” to configure the pipeline for your data online and launch it via nextflow’s tower app (by registering a compute environment there) or by copying a configuration token for your local computer or HPC head node.


Launch “https://nf-co.re/launch?pipeline=quantms


Launch “https://nf-co.re/launch?pipeline=mhcquant


Launch “https://nf-co.re/launch?pipeline=diaproteomics

References

Galaxy

Galaxy is an open-source web platform designed for processing and analyzing large quantities of biomedical data.

TOPP tools have been integrated into Galaxy to facilitate the creation and execution of workflows.

To use TOPP tools on Galaxy:

  1. Go to the website.

  2. Create an account.

  3. Go to Tools on the far left and scroll down.

  4. Search for “OpenMS”.

  5. You will see a list of TOPP tools.

topp tool list

Choose one of the TOPP tools from the list. You will be able to run it in isolation or use it to create a workflow.

TOPPAS

TOPPAS is the build-in workflow editor of OpenMS. All TOPP tools can be chained, configured and executed.

TOPPAS workflows run on the local machine where TOPPAS is executed and thus only scale according to the hardware at hand. No automatic distribution across a cluster is supported. Also, external tools (anything other than OpenMS TOPP tools), can only be called using a special ‘GenericWrapper’ node. We generally recommend to run only OpenMS-specific tools in TOPPAS and hand the resulting data to other tools, or use other workflow systems, such as KNIME which can fully integrate other tools.

The strong point of TOPPAS is that it ships with OpenMS natively in the graphical and command-line tools. It also has a very shallow learning curve, making it very intuitive to create workflows.

See the TOPPAS tutorial for more details.

TOPPView

Introduction

TOPPView is a graphical application for inspecting and visualizing MS and HPLC-MS data. It can be used to inspect files in mzML, mzData, mzXML and several other text-based file formats.

In each view, several datasets can be displayed simultaneously using the layer concept. This allows visual comparison of several datasets as well as displaying input data and output data of an algorithm together.

TOPPView is intended for visual inspection of the data by experimentalists as well as for analysis software by developers.

User interface

The following image illustrates different components of TOPPView’s user interface.

toppview user interface

Components include:

  • Display modes and options

  • A Viewer that displays visual data.

  • Layers

    You can visualize several datasets by creating layers. In the Layers window (dock window in the upper right corner), layers can be hidden and shown using the check box in front of each layer name.

  • Data filtering window

  • Log window to view textual output of applying TOPP tools to data.

  • Views windows that show tabulated information about the dataset.

Import a file

To import data into TOPPView:

  1. Go to File > Open file.

  2. Choose a file from the file importer and click Open.

  3. Select options from the following panel and click Ok:

file import options

You can choose to open the new dataset as a new window or new layer. Choosing new window will open a new tab. If you are planning on comparing multiple datasets and want to view them all at once, choose new layer.

TOPPView automatically selects the Map view depending on the data you have imported.

You should be able to see your data in the Viewer.

Process data in TOPPView

The following section details how to apply TOPP tools and filter data in TOPPView.

Apply TOPP tool

OpenMS provides a number of TOPP tools that can be applied to your data.

To apply a TOPP tool to your dataset:

  1. Select a layer in the Layers window. The selected layer will be highlighted blue.

display selected layer

  1. Go to Tools > Apply TOPP tool to whole layer. This will open a panel to select and configure your TOPP tool.

  2. Select a TOPP tool from the dropdown menu. A description of the TOPP tool will be displayed on the right. You may have to also specify the input argument as well though TOPPView might automatically select this option for you. To save the output to a file, specify the output argument.

display selected layer

  1. Specify the TOPP tool parameters by either:

    1. Loading an INI file by clicking Load and selecting an INI File from the file importer.
    2. Editing the parameters shown in the table and then saving the INI file. To edit a parameter, double click a row in the table and enter a value or choose from the options available. The modified value will be highlighted yellow. To save the parameters, click Store and enter a file name for the INI file.

    topp tool parameters

  2. Click Ok. You will be prompted to load the new dataset as a new window or a new layer. Choose an option and click Ok.

data import options

  1. If you chose to load the data in a new window, a new tab will appear. To view that data, select the tab. If you chose to load the data as a new layer, the data will be visualized in the Viewer. You can also see the new layer wihout a name in the Layers window.

layer loaded in Viewer

  1. (Optional) If you did choose to import the data as a new layer, give the new layer a name. To do this, right-click the layer in the Layers window and select Rename. Enter a name and click OK.

Filter data

You may only want to see some data from your dataset and hide the rest. OpenMS allows you to filter data based on the following fields: Intensity, Quality, Charge, Size and Meta data.

To filter your data:

  1. Select a layer from the Layers window.

display selected layer

  1. Open the Data filters window by clicking the tab at the bottom of the screen.

select data filters window

  1. Add a filter to the Data filters window by right-clicking the window and then selecting Add filter from the context menu.

  2. Select a field, select an operation and enter a value. For example, to exclude all peaks with an intensity of less than 6999, set field to Intensity, operation to => and set the value to 7000. Click Ok on the panel to apply the changes.

filtering options

  1. You should see only see data that satisfies the specified criteria.

Additional topics

You might want to check out the following topics:

Views in TOPPView

TOPPView offers three types of views – a 1D view for spectra, a 2D view for peak maps and feature maps, and a 3D view for peak maps. All three views can be freely configured to suit the individual needs of the user.

Action Modes and Their Uses

All three views share a similar interface. Three action modes are supported – one for translation, one for zooming and one for measuring:

  • Translate mode

    • It is activated by default

    • Move the mouse while holding the mouse button down to translate the current view

    • Arrow keys can be used to translate the view without entering translate mode (in 1D-View you can additionally use Shift to jump to the next peak)

  • Zoom mode

    • All previous zoom levels are stored in a zoom history. The zoom history can be traversed using CTRL + +/CTRL + - or the mouse wheel (scroll up and down)

    • Zooming into the data:

      • Mark an area in the current view with your mouse, while holding the left mouse button plus the CTRL key to zoom to this area.

      • You can also use your mouse wheel to traverse the zoom history.

      • If you have reached the end of the history, keep on pressing CTRL + + or scroll up, the current area will be enlarged by a factor of 1.25.

    • Pressing Backspace resets the zoom and zoom history.

  • Measure mode

    • It is activated using SHIFT.

    • Press the left mouse button down while a peak is selected and drag the mouse to another peak to measure the distance between peaks.

    • This mode is implemented in the 1D and 2D mode.

1D View

The 1D view is used to display raw spectra or peak spectra. Raw data is displayed using a continuous line. Peak data is displayed using one stick per peak. The color used for drawing the lines can be set for each layer individually. The 1D view offers a mirror mode, where the window is vertically divided in halves and individual layers can be displayed either above or below the “mirror” axis in order to facilitate quick visual comparison of spectra. When a mirror view is active, it is possible to perform a spectrum alignment of a spectrum in the upper and one in the lower half, respectively. Moreover, spectra can be annotated manually. Currently, distance annotations between peaks, peak annotations and simple text labels are provided.

The following example image shows a 1D view in mirror mode. A theoretical spectrum (lower half) has been generated using the theoretical spectrum generator (Tools > Generate theoretical spectrum). The mirror mode has been activated by right-clicking the layer containing the theoretical spectrum and selecting Flip downward from the layer context menu. A spectrum alignment between the two spectra has been performed (Tools > Align spectra). It is visualized by the red lines connecting aligned peaks and can be reset through the context menu. Moreover, in the example, several distances between abundant peaks have been measured and subsequently replaced by their corresponding amino acid residue code. This is done by right-clicking a distance annotation and selecting Edit from the context menu. Additionally, peak annotations and text labels have been added by right-clicking peaks and selecting Add peak annotation or by right clicking anywhere and selecting Add Label, respectively. Multiple annotations can be selected by holding down the CTRL key while clicking them. They can be moved around by dragging the mouse and deleted by pressing DEL.

TOPPView 1D

Through the context menu: of the 1D view you can:

  1. View/edit meta data.

  2. Save the current layer data.

  3. Change display settings.

  4. Add peak annotations or arbitrary text labels.

  5. Reset a performed alignment.

2D View

The 2D view is used to display peak maps and feature maps in a top-down view with color-coded intensities. Peaks and feature centroids are displayed as dots. For features, also the overall convex hull and the convex hulls of individual mass traces can be displayed. The color gradient used to encode for peak and feature intensities can be set for each layer individually.

The following example image shows a small section of a peak map and the detected features in a second layer.

Plot 2D Widget

In addition to the normal top-down view, the 2D view can display the projections of the data to the m/z and RT axis. This feature is mainly used to assess the quality of a feature without opening the data region in 3D view.

Through the context menu: of the 2D view you can:

  1. View/edit meta data

  2. View survey/fragment scans in 1D view

  3. View survey/fragment scans meta data

  4. View the currently selected area in 3D view

  5. Save the current layer data

  6. Change display settings

3D View

The 3D view can only display peak maps. Its primary use is the closer inspection of a small region of the map, e.g. a single feature. In the 3D view slight intensity differences are easier to recognize than in the 2D view. The color gradient used to encode peak intensities, the width of the lines and the coloring mode of the peaks can be set for each layer individually.

The following example image shows a small region of a peak map:

Plot 3D Widget

Through the context menu: of the 3D view you can:

  1. View/edit meta data.

  2. Save the current layer data.

  3. Change display setting.

Display Modes and View Options in TOPPView

All of the views support several display modes and view options. Display modes determine how intensities are displayed. View options configure the view.

TOPPView Icons

Display Modes

Intensity display modes determine the way peak intensities are displayed.

Linear

Normal display mode.

Percentage

In this display mode the intensities of each dataset are normalized with the maximum intensity of the dataset. This is especially useful in order to visualize several datasets that have large intensity differences. If only one dataset is opened it corresponds to the normal mode.

Snap to Maximum Intensity

In this mode the maximum currently displayed intensity is treated as if it was the maximum overall intensity.

View Options

View options configure the view of the current layer.

1D

Switching between raw data and peak mode.

2D (Peaks)

MS/MS precursor peaks can be highlighted. Projections to m/z and RT axis can be shown.

2D (Features)

Overall convex hull of the features can be displayed. Convex hulls of mass traces can be displayed (if available). A feature identifier, consisting of the feature index and an optional label can be displayed.

2D (Consensus)

The elements of a consensus feature can be displayed.

3D

Currently there are no options for 3D view.

Data Analysis in TOPPView

TOPPView also offers limited data analysis capabilities for single layers, which will be illustrated in the following sections. The functionality presented here can be found in the Tools menu:

TOPPView Tools Menu

TOPP Tools

Single TOPP tools can be applied to the data of the currently selected layer or to the visible data of the current layer. The following example image shows the TOPP tools dialog:

TOPPView Tools

To apply a TOPP tool, follow the instructions below:

  1. Select a TOPP tool and if necessary a type.

  2. Specify the command line option of the tool, that takes the input file name.

  3. Specify the command line option of the tool, that takes the output file name.

  4. Set the algorithm parameters manually or load them from an INI file.

Metadata

One can access the metadata, the layer is annotated with. This data comprises e.g. contact person, instrument description and sample description.

Meta Data Browser

Tip

Identification data, e.g. from a Mascot run, can be annotated to the spectra or features, too. After annotation, this data is listed in the metadata.

Statistics

Statistics about peak/feature intensities and peak meta information can be displayed. For intensities, it is possible to display an additional histogram view.

TOPPView Statistics

Data Editing in TOPPView

TOPPView offers editing functionality for feature layers.

After enabling the feature editing mode in the context menu of the feature layer, the following actions can be performed:

  • Features can be dragged with the mouse in order to change the m/z and RT position.

  • The position, intensity and charge of a feature can be edited by double-clicking a feature.

  • Features can be created by double-clicking the layer background.

  • Features can be removed by selecting them and pressing the DEL key.

TOPPView Hotkeys
File handling

Hotkey

Function

CTRL + O

Open file

CTRL + W

Close current window

CTRL + S

Save current layer

CTRL + SHIFT + S

Save visible data of current layer

Visualization options

Hotkey

Function

CTRL + R

Show/hide grid lines

CTRL + L

Show/hide axis legends

N

Intensity mode: Normal

P

Intensity mode: Percentage

S

Intensity mode: Snam-to-maximum

I

1D draw mode: peaks

R

1D draw mode: raw data

CTRL + ALT + Home

2D draw mode: increase minimum canvas coverage threshold (for raw peak scaling)

CTRL + ALT + End

2D draw mode: decrease minimum canvas coverage threshold (for raw peak scaling)

CTRL + ALT + +

2D draw mode: increase maximum point size (for raw peak scaling)

CTRL + ALT + -

2D draw mode: decrease maximum point size (for raw peak scaling)

Tip

Home on macOS keyboards is also Fn + ArrowLeft. End on macOSX keyboards is also Fn + ArrowRight.

Annotations in 1D view

Hotkey

Function

CTRL + B

Select all annotations of the current layer

DEL

Delete all currently selected annotations

Advanced

Hotkey

Function

CTRL + T

Apply TOPP tool to the current layer

CTRL + SHIFT + T

Apply TOPP tool to the visible data of the current layer

F4

Re-run TOPP tool

CTRL + M

Show layer meta information

CTRL + I

Annotate with identification results

1

Show precursor peaks (2D peak layer)

2

Show projections (2D peak layer)

5

Show overall convex hull (2D feature layer)

6

Show all convex hulls (2D feature layer)

7

Show numbers and labels (2D feature layer)

9

Show consensus elements (2D consensus layer)

Help

Hotkey

Function

F1

Show TOPPView online tutorial

SHIFT + F1

Activate What’s this? mode

TOPP Tools

TOPP - The OpenMS Pipeline is a set of tools for the analysis of HPLC-MS data. These tools can be either:

  • Executed from the command line or,

  • Applied individually using OpenMS graphical applications.

  • Applied in sequence as a workflow using a workflow editor such as KNIME, Nextflow or Galaxy.

Before you choose one of the above options, there are few concepts that need to be understood.

File formats

OpenMS only accepts files in certain formats, including but not limited to:

  • mzML: The HUPO-PSI standard format for mass spectrometry data.

  • featureXML: The OpenMS format for quantitation results.

  • consensusXML: The OpenMS format for grouping features in one map or across several maps.

  • idXML: The OpenMS format for protein and peptide identification.

Documented schemas of the OpenMS formats can be found here.

If your data is not in the above formats, you may need to use a file conversion TOPP tool.

Command Line Interface

Command line calls will depend on the TOPP tools used, as each TOPP tool has its own set of parameters. However, the following arguments are typically used:

  • -in

    Specify an input file in the command line using the -in argument. The input file should be in a supported format. If not, use the file converter to convert the file to one of the supported formats. For more information, view the file handling documentation.

  • -out

    Specify an output file in the command line using the -out argument.

  • -ini

    Specify an INI file in the command line using the -ini argument. TOPP uses INI files to set parameters specific to the command line tool being called.

  • -write_ini

    Create an INI file using the -write_ini file argument. Create an INI file with this call: <insert TOPP tool> -write_ini <insert output INI File> If you want a visual tool to assist setting parameters, use the INIFileEditor, an application provided when you download OpenMS. Otherwise, you can set the parameters from the command line.

  • -help

    Get information about basic options related to the tool using the -help parameter. For more advanced options (algorithmic parameters), use --help.

  • --help

    Get detailed information about algorithmic parameters using the --help parameter.

Many (but not all) command line calls will have the following structure:

<insert TOPP tool> -in <insert input mzML file> -out <insert output mzML file> -ini <insert INI file>

The following command line call uses the FileFilter tool to extract data from an mzML file. Note, that this call directly specifies the tool-specific parameters and doesn’t rely on an INI file:

break down of example command line call

TOPP INI files

TOPP INI files are XML-based files with an .ini extension. OpenMS uses TOPP INI files to set parameters for one or more TOPP tools. Here is an example of a TOPP INI file:

<PARAMETERS>

<NODE name="FileFilter">
  <NODE name="1">
    <ITEM name="rt" value="0:1200" type="string"/>
  </NODE>
  <NODE name="2">
    <ITEM name="mz" value="700:1000" type="string"/>
  </NODE>
</NODE>

<NODE name="common">
  <NODE name="FileFilter">
    <ITEM name="rt" value=":" type="string"/>
    <ITEM name="mz" value=":" type="string"/>
  </NODE>
  <ITEM name="debug" value="2" type="int"/>
</NODE>

</PARAMETERS>

Features, feature maps and featureXML files

An LC-MS feature is a construct in OpenMS that is used to describe a 2D peak caused by an analyte interacting with the stationary phase. Each feature contains the following metadata: an id, retention time, mass-to-charge ratio, intensity, overall quality and one or more convex hulls.

A feature map is a container for features. One feature map can contain many features.

A featureXML file is an XML based file which contains one feature map.

FeatureXML files can be created from mzML files using OpenMS’s feature detection algorithms.

Consensus feature, consensus maps, consensusXML files

A consensus feature is a special type of LC-MS feature that is quantified across multiple experiments. A consensus feature is formed by linking or grouping features with similar mass-to-charge ratios and intensities from various experiment runs. Each consensus feature references the features used to form the consensus feature.

Similar to a feature map, a consensus map is a container for consensus features. One consensus map can contain many consensus features.

ConsensusXML files can be created from featureXML files using OpenMS’s feature grouping algorithms.

Types of TOPP Tools

The following tools are offered:

  • File conversion

    TOPP file conversion tools can be used to convert files into a supported format.

  • File handling

    TOPP file handling tools are largely used to extract or merge data. For more information, view the File handling.

  • Centroiding

    The conversion of the “raw” ion count data acquired by the machine into peak lists for further processing is usually called peak picking or centroiding. The choice of the algorithm should mainly depend on the resolution of the data. OpenMS provides different algorithms for centroiding depending on the resolution of the data. For more information, view the Picking peaks section.

  • Spectrum processing

    A number of spectrum processing tools are available. These include peak filtering and peak normalization tools, as well as other miscellaneous tools.

  • Mass correction and calibration

    To ensure that your data is sound, OpenMS have provided a number of mass correction and calibration tools. The types of tools used will depend on the type of equipment you have employed. For more information, view the Calibration section.

  • Spectrum clustering

    Spectrum clustering is the grouping of spectra that have many peaks in common. OpenMS provides tools for spectrum clustering to identify molecules in large datasets more efficiently.

  • Map alignment

    When looking to identify molecules, it is common to run multiple experiments, where each experiment produces a set of data. In OpenMS, every set of data is represented by a feature map. Before combining feature maps to create a consensus map, it is advised to use OpenMS’s map alignment tools so that all your datasets are comparable and based on a common retention time axis. For more information, view the Map alignment section.

  • Feature linking

    OpenMS provides a number of algorithms for feature grouping or linking. For more information, view the Feature grouping section.

  • Quantitation

    A number of tools are available that allow for the identification and quantification of features. The tools you use will depend on the type of mass spectrometry experiment you have set up, and the type of molecules you wish to identify. For more information, view the Feature detection section.

  • Protein/Peptide identification

  • Protein/Peptide processing

  • Targeted experiments and OpenSWATH

  • Cross-linking

    Cross-linking is a technique where substances are chemically treated to create covalent bonds between different molecules. The strength of the covalent bonds can be quantified to indicate the proximity of certain molecules within a 3D structure.

  • Quality control

    OpenMS provides tools to measure the quality of LC-MS data. For more information, view the Quality control section.

For the full list of TOPP tools, visit the API reference website.

File Handling
General information about peak and feature maps

For general information about a peak or feature map, use the FileInfo tool.

  • It can print RT, m/z and intensity ranges, the overall number of peaks, and the distribution of MS levels.

  • It can print a statistical summary of intensities.

  • It can print some meta information.

  • It can validate XML files against their schema.

  • It can check for corrupt data in peak files See the FileInfo –help for details.

Problems with input files

If you are experiencing problems while processing an XML file, check if the file does validate against the XML schema:

FileInfo -v -in infile.mzML

Validation is available for several file formats including mzML, mzData, mzXML, featureXML and idXML.

Another frequently-occurring problem is corrupt data. You can check for corrupt data in peak files with FileInfo as well:

FileInfo -c -in infile.mzML

Converting your files to mzML

The TOPP tools work only on the HUPO-PSI mzML format. If you need to convert mzData, mzXML or ANDI/MS data to mzML, use the FileConverter, e.g.

FileConverter -in infile.mzXML -out outfile.mzML

For format names as file extension, the tool derives the format from the extension. For other extensions, the file formats of the input and output file can be given explicitly.

Converting between DTA and mzML

Sequest DTA files can be extracted from a mzML file using the DTAExtractor:

DTAExtractor -in infile.mzML -out outfile

The retention time of a scan, the precursor mass-to-charge ratio (for MS/MS scans) and the file extension are appended to the output file name.

To combine several files (e.g. DTA files) to an mzML file use the FileMerger:

FileMerger -in infile_list.txt -out outfile.mzML

The retention times of the scans can be generated, taken from the infile_list.txt or can be extracted from the DTA file names. See the FileMerger documentation for details.

Extracting part of the data from a file

To extract part of the data from an mzML file, use the FileFilter tool. It allows filtering for RT, m/z and intensity range or for MS level. To extract the MS/MS scans between retention time 100 and 1500, use the following command:

FileFilter -in infile.mzML -levels 2 -rt 100:1500 -out outfile.mzML

Conversion Between OpenMS XML Formats and Text Formats
Export of OpenMS XML formats

As TOPP offers no functionality for statistical analysis, this step is normally done using external statistics packages. In order to export the OpenMS XML formats into an appropriate format for these packages the TOPP TextExporter can be used.

It converts the the following OpenMS XML formats to text files:

  • featureXML

  • idXML

  • consensusXML

The use of the TextExporter is is very simple:

TextExporter -in infile.idXML -out outfile.txt

Import of feature data to OpenMS

OpenMS offers a lot of visualization and analysis functionality for feature data. Feature data in text format, e.g. from other analysis tools, can be imported using the TextImporter. The default mode accepts comma separated values containing the following columns: RT, m/z, intensity. Additionally meta data columns may follow. If meta data is used, meta data column names have to be specified in a header line. Without headers:

1201	503.123	1435000
1201	1006.246	1235200

Or with headers:

RT	m/z	Int	isHeavy	myMeta
1201	503.123	1435000	true	2
1201	1006.246	1235200	maybe	1

Example invocation:

TextImporter -in infile.txt -out outfile.featureXML

The tool also supports data from msInspect,SpecArray and Kroenik(Hardkloer sibling), just specify the -mode option accordingly.

Import of protein/peptide identification data to OpenMS

Peptide/protein identification data from several identification engines can be converted to idXML format using the IDFileConverter tool.

It can currently read the following formats:

  • Sequest output folder

  • pepXML file

  • idXML file

It can currently write the following formats:

  • pepXML

  • idXML

This example shows how to convert pepXML to idXML:

IDFileConverter -in infile.pepXML -out outfile.idXML

Picking Peaks

For low resolution data, consider to smooth the data first (Smoothing raw data) and subtract the baseline (Subtracting a baseline from a spectrum) before peak picking.

There are two types of PeakPickers: the PeakPickerWavelet and one especially suited for high resolution data (PeakPickerHiRes). This tutorial explains the PeakPickerWavelet. Use the file peakpicker_tutorial_2.mzML from the examples data (select File > Open example data).

The main parameters are the peak width and the minimal signal to noise ratio for a peak to be picked. If you don’t know the approximate fwhm of peaks, use the estimation included in the PeakPickerWavelet, set the flag estimate_peak_width to true. After applying the PeakPickerWavelet, observe which peak width was estimated and used for peak picking in the log window.

To estimate the peak width, use the measuring too to determine the fwhm of one or several representative peaks.

If the peak picker delivers only a few peaks even though the peak_with and signal_to_noise parameters are set to good values, consider changing the advanced parameter fwhm_lower_bound_factor to a lower value. All peaks with a lower fwhm than fwhm_lower_bound_factor * peak_width are discarded.

The following image shows a part of the spectrum with the picked peaks shown in green, the estimated peak width in the log window and the measured peak width.

TOPPView tools pppicked

Calibration

OpenMS offer two calibration methods: an internal and an external calibration. Both can handle peak data as well as profile data. To calibrate profile data, a peak picking step is necessary, the important parameters can be set via the ini-file. If you have already picked data, don’t forget the -peak_data flag.

The external calibration (TOFCalibration) is used to convert flight times into m/z values with the help of external calibrant spectra containing e.g. a polymer like polylysine. For the calibrant spectra, the calibration constants the machine uses need to be known as well as the expected masses. Then a quadratic function is fitted to convert the flight times into m/z-values.

The internal calibration (InternalCalibration) uses reference masses in the spectra to correct the m/z-values using a linear function.

In a typical setting one would first pick the TOF-data, then perform the TOFCalibration and then the InternalCalibration:

PeakPickerWavelet -in raw_tof.mzML -out picked_tof.mzML -ini pp.ini
TOFCalibration -in picked_tof.mzML -out picked.mzML -ext_calibrants ext_cal.mzML
               -ref_masses ext_cal_masses
               -tof_const tof_conv_consts -peak_data
InternalCalibration -in picked.mzML -out picked_calibrated.mzML
                    -ref_masses internal_calibrant_masses -peak_data
Map Alignment
Map alignment

The goal of map alignment is to transform different HPLC-MS maps (or derived maps) to a common retention time axis. It corrects for shifted and scaled retention times, which may result from changes of the chromatography.

The different MapAligner tools take n input maps, de-warp them and store the n de-warped maps. The following image shows the general procedure:

TOPP Alignment

There are different map alignment tools available. The following table gives a rough overview of them:

Tool

Applicable To

Description

MapAlignerPoseClustering

feature maps, peak maps

This algorithm does a star-wise alignment of the input data. The center of the star is the map with most data points. All other maps are then aligned to the center map by estimating a linear transformation (shift and scaling) of retention times. The transformation is estimated using a pose clustering approach as described in doi:10.1093/bioinformatics/btm209

MapAlignerIdentification

feature maps, consensus maps, identifications

This algorithm utilizes peptide identifications, and is thus applicable to files containing peptide IDs (idXML, annotated featureXML/consensusXML). It finds peptide sequences that different input files have in common and uses them as points of correspondence. From the retention times of these peptides, transformations are computed that convert each file to a consensus time scale.

MapAlignerSpectrum

peak maps

This experimental algorithm uses a dynamic-programming approach based on spectrum similarity for the alignment. The resulting retention time mapping of dynamic-programming is then smoothed by fitting a spline to the retention time pairs.

MapRTTransformer

peak maps, feature maps, consensus maps, identifications

This algorithm merely applies a set of transformations that are read from files (in TransformationXML format). These transformations might have been generated by a previous invocation of a MapAligner tool. For example, compute a transformation based on identifications and then apply it to the features or raw data. The transformation file format is not very complicated, so it is relatively easy to write (or generate) the transformation files

Feature Grouping

In order to quantify differences across maps (label-free) or within a map (isotope-labeled), groups of corresponding features have to be found. The FeatureLinker TOPP tools support both approaches. These groups are represented by consensus features, which contain information about the constituting features in the maps as well as average position, intensity, and charge.

Isotope-labeled quantitation

To differentially quantify the features of an isotope-labeled HPLC-MS map, follow the listed steps:

  1. The first step in this pipeline is to find the features of the HPLC-MS map. The FeatureFinder applications calculate the features from profile data or centroided data.

  2. In the second step, the labeled pairs (e.g. light/heavy labels of ICAT) are determined by the FeatureLinkerLabeled application. FeatureLinkerLabeled first determines all possible pairs according to a given optimal shift and deviations in RT and m/z. Then it resolves ambiguous pairs using a greedy-algorithm that prefers pairs with a higher score. The score of a pair is the product of:

    • feature quality of feature 1

    • feature quality of feature 2

    • quality measure for the shift (how near is it to the optimal shift)

    TOPP labeled quant

Label-free quantitation

To differentially quantify the features of two or more label-free HPLC-MS map.

TOPP labelfree quant

Tip

This algorithm assumes that the retention time axes of all input maps are very similar. To correct for retention time distortions, please have a look at Map alignment.

Feature Detection

For quantitation, the FeatureFinder tools are used. They extract the features from profile data or centroided data. TOPP offers different types of FeatureFinders:

FeatureFinderIsotopeWavelet
Description

The algorithm has been designed to detect features in raw MS data sets. The current implementation is only able to handle MS1 data. An extension handling also tandem MS spectra is under development. The method is based on the isotope wavelet, which has been tailored to the detection of isotopic patterns following the averagine model. For more information about the theory behind this technique, please refer to Hussong et al.: “Efficient Analysis of Mass Spectrometry Data Using the Isotope Wavelet” (2007).

Attention

This algorithm features no “modelling stage”, since the structure of the isotopic pattern is explicitly coded by the wavelet itself. The algorithm also works for 2D maps (in combination with the so-called sweep-line technique (Schulz-Trieglaff et al.: “A Fast and Accurate Algorithm for the Quantification of Peptides from Mass Spectrometry Data” (2007))). The algorithm could originally be executed on (several) high-speed CUDA graphics cards. Tests on real-world data sets revealed potential speedups beyond factors of 200 (using 2 NVIDIA Tesla cards in parallel). Support for CUDA was removed in OpenMS due to maintenance overhead. Please refer to Hussong et al.: “Highly accelerated feature detection in proteomics data sets using modern graphics processing units” (2009) for more details on the implementation.

Seeding

Identification of regions of interest by convolving the signal with the wavelet function. A score, measuring the closeness of the transform to a theoretically determined output function, finally distinguishes potential features from noise.

Extension

The extension is based on the sweep-line paradigm and is done on the fly after the wavelet transform.

Modelling

None (explicitly done by the wavelet).

FeatureFinderCentroided
Description

This is an algorithm for feature detection based on peak data. In contrast to the other algorithms, it is based on peak/stick data, which makes it applicable even if no profile data is available. Another advantage is its speed due to the reduced amount of data after peak picking.

Seeding

It identifies interesting regions by calculating a score for each peak based on

  • the significance of the intensity in the local environment.

  • RT dimension: the quality of the mass trace in a local RT window.

  • m/z dimension: the quality of fit to an averagine isotope model.

Extension

The extension is based on a heuristics – the average slope of the mass trace for RT dimension, the best fit to averagine model in m/z dimension.

Modelling

In model fitting, the retention time profile (Gaussian) of all mass traces is fitted to the data at the same time. After fitting, the data is truncated in RT and m/z dimension. The reported feature intensity is based on the fitted model, rather than on the (noisy) data.

Example

For this example the file LCMS-centroided.mzML from the examples data is used (File > Open example data). In order to adapt the algorithm to the data, some parameters have to be set.

Intensity

The algorithm estimates the significance of peak intensities in a local environment. Therefore, the HPLC-MS map is divided into n times n regions. Set the intensity:bins parameter to 10 for the whole map. For a small region, set it to 1.

Mass trace

For the mass traces, define the number of adjacent spectra in which a mass has to occur (mass_trace:min_spectra). In order to compensate for peak picking errors, missing peaks can be allowed (mass_trace:max_missing) and a tolerated mass deviation must be set (mass_trace:mz_tolerance).

Isotope pattern

The expected isotopic intensity pattern is estimated from an averagene amino acid composition. The algorithm searches all charge states in a defined range (isotopic_pattern:change_min to isotopic_pattern:change_max). Just as for mass traces, a tolerated mass deviation between isotopic peaks has to be set (isotopic_pattern:mz_tolerance).

The image shows the centroided peak data and the found peptide features. The used parameters can be found in the TOPP tools dialog.

TOPPView Tools FFCentrioided

Quality Control

To check the quality of the data (supports label-free workflows and IsobaricAnalyzer output):

The QualityControl TOPP tool computes and collects data which allows to compute QC metrics to check the quality of LC-MS data. Depending on the given input data, this tool collects data for metrics (see section Metrics. New metavalues will be added to existing data and the information will be written out in mzTab format. This mzTab file can then be processed using custom scripts or via the R package (see PTXQC).

Workflow

Find an example workflow in OpenMS/share/OpenMS/examples/TOPPAS/QualityControl.toppas.

For data from IsobaricAnalyzer, just provide the consensusXML as input to QualityControl. No FeatureXMLs or TrafoXMLs are required. The mzML raw file can be added as input though.

Metrics

Further headings shows what each of the included metrics does, what they need to be executed and what they add to the data or what they return.

Input Data
  • PostFDR FeatureXML: A FeatureXML after FDR filtering.

  • Contaminants Fasta file: A Fasta file containing contaminant proteins.

  • Raw mzML file: An unchanged mzML file.

  • InternalCalibration MzML file: An MzML file after internal calibration.

  • TrafoXML file: The RT alignment function as obtained from a MapAligner.

Contaminants

The Contaminants metric takes the contaminants database and digests the protein sequences with the digestion enzyme that is given in the featureXML. Afterwards it checks whether each of all peptide sequences of the featureXML (including the unassigned PeptideIdentifications) is registered in the contaminants database.

Required input data

Contaminants Fasta file, PostFDR FeatureXML

Output

Changes in files:

  • Metavalue: is_contaminant set to 1 or to 0 if the peptide is found in the contaminant database or not and sets a 0 if not.

Other outputs:

  • Returns:

    • Contaminant ratio of all peptides.

    • Contaminant ratio of all assigned peptides.

    • Contaminant ratio of all unassigned peptides.

    • Intensity ratio of all contaminants in the assigned peptides.

    • Number of empty features, Number of all found features.

FragmentMassError

The FragmentMassError metric computes a list of fragment mass errors for each annotated MS2 spectrum in ppm and Da. Afterwards it calculates the mass delta between observed and theoretical peaks.

Required input data

PostFDR FeatureXML, raw mzML file

Output

Changes in files:

  • Metavalue:

    • fragment_mass_error_ppm set to the fragment mass error in parts per million.

    • fragment_mass_error_da set to the fragment mass error in Dalton

Other Output:

  • Returns:

    • Average and variance of fragment mass errors in ppm

MissedCleavages

The MissedCleavages metric counts the number of MissedCleavages per PeptideIdentification given a FeatureMap and returns an agglomeration statistic (observed counts). Additionally the first PeptideHit of each PeptideIdentification in the FeatureMap is augmented with metavalues.

Required input data

PostFDR FeatureXML

Output

Changes in files:

  • Metavalue:

    • missed_cleavages

Other Output:

  • Returns:

    • Frequency map of missed cleavages as key/value pairs.

MS2IdentificationRate

The MS2IdentificationRate metric calculates the Rate of the MS2 identification as follows: The number of all PeptideIdentifications are counted and that number is divided by the total number of MS2 spectra.

Required input data

PostFDR FeatureXML, raw mzML file.

Output

Changes in files: This metric does not change anything in the data.

Other Output:

  • Returns:

    • Number of PeptideIdentifications

    • Number of MS2 spectra

    • Ratio of #pepID/#MS2

MzCalibration

The MzCalibration metric adds new metavalues to the first (best) hit of each PeptideIdentification. For this metric it is also possible to use this without an MzML File, but then only uncalibrated m/z error (ppm) will be reported. However, for full functionality a PeakMap/MSExperiment with original m/z-values before m/z calibration generated by InternalCalibration has to be given.

Required input data

PostFDR FeatureXML

Output

Changes in files:

  • Metavalues:

    • mz_raw set to m/z value of original experiment.

    • mz_ref set to m/z value of calculated reference.

    • uncalibrated_mz_error_ppm set to uncalibrated m/z error in parts per million.

    • calibrated_mz_error_ppm set to calibrated m/z error in parts per million.

Other Output: No additional output.

RTAlignment

The RTAlignment metric checks what the retention time was before the alignment and how it is after the alignment. These two values are added to the metavalues in the PeptideIdentification.

Required input data

PostFDR FeatureXML, trafoXML file

Output

Changes in files:

  • Metavalues:

    • rt_align set to retention time after alignment.

    • rt_raw set to retention time before alignment.

Other Output: No additional output.

TIC

The TIC metric calculates the total ion count of an MSExperiment if a bin size in RT seconds greater than 0 is given. All MS1 abundances within a bin are summed up.

Required input data

raw mzML file

Output

Changes in files: This metric does not change anything in the data.

Other Output:

  • Returns:

    • TIC chromatograms

TopNoverRT

The TopNoverRT metric calculates the ScanEventNumber (number of the MS2 scans after the MS1 scan) and adds them as the new metavalue ScanEventNumber to the PeptideIdentifications. It finds all unidentified MS2-Spectra and adds corresponding empty PeptideIdentifications without sequence as placeholders to the unassigned PeptideIdentification list. Furthermore, it adds the metavalue identified to the PeptideIdentification.

Required input data

PostFDR FeatureXML, raw mzML file

Output

Changes in files:

  • Metavalues:

    • ScanEventNumber set to the calculated value

    • identified set to + or -

  • If provided:

    • FWHM set to RT peak width for all assigned PIs

    • ion_injection_time set to injection time from MS2 spectrum

    • activation_method set to activation method from MS2 spectrum

    • total_ion_count set to summed intensity from MS2 spectrum

    • base_peak_intensity set to highest intensity from MS2 spectrum

  • Additionally:

    • Adds empty PeptideIdentifications

Other Output: No additional output.

Contribute

Reporting Bugs and Issues

A list of known issues in the current OpenMS release can be found here. Please check if your OpenMS version matches the current version and if the bug has already been reported.

In order to report a new bug, please create a [GitHub issue](manual/contribute.md#Write and Label GitHub Issues) or contact us.

Include the following information in your bug report:

  1. The command line (i.e. call) including the TOPP tool and the arguments you used, or the steps you followed in a GUI tool (e.g. TOPPView) - e.g. FeatureFinderCentroided -in myfile.mzML -out myfile.featureXML.

  2. The output of OpenMS/TOPP (or a screenshot in case of a GUI problem).

  3. Operating system (e.g. “Windows XP 32 bit”, “Win 7 64 bit”, “Fedora 8 32 bit”, “macOS 10.6 64 bit”).

  4. OpenMS version (e.g. “OpenMS 1.11.1”, “Revision 63082 from the SVN repository”).

  5. OpenMS architecture (“32 bit” or “64 bit”)

Please provide files that we need to reproduce the bug (e.g. TOPP INI files, data files — usually mzML) via a download link, via the mailing list or by directly contacting one of the developers.

Write and Label GitHub Issues

Create an Issue

To create an issue:

  1. Go to the OpenMS codebase.

  2. Submit an issue.

The issue will be listed under Issues.

Label an Issue

To label an issue:

  1. On the right of the screen, select the cog icon under Labels.

  2. Choose a label from the list. Normally, an issue can have one or more of the following labels:

    • defect: A defect refers to a bug in OpenMS. This is a high priority issue.

    • enhancement: An enhancement refers to a feature idea to enhance the current OpenMS code. This is a medium priority issue.

    • task: A task refers to a single piece of work that a developer can undertake. This is a medium priority issue.

    • refactoring: A refactoring issue refers to a suggestion to streamline the code without changing how the code function.

    • question: A question could trigger to a discussion about tools, parameters and scientific tasks.

Pull Request Checklist

Before opening a pull request, check the following:

  1. Does the code build? Execute make (or your build system’s equivalent, e.g., cmake --build . --target ALL_BUILD --config Release on Windows).

  2. Do all tests pass? To check if all tests have passed, execute ctest. If a test that is unrelated to your changes fails, check the nightly builds to see if the error is also in develop. If the error is in develop, [create a github issue](/manual/contribute.md#Write and Label GitHub Issues).

  3. Is the code documented? Document all new classes, including their methods and parameters. It is also recommended to document non-public members and methods.

  4. Does the code introduce changes to the API? If the code introduces changes to the API, make sure that the documentation is up-to-date and that the Python bindings (pyOpenMS) still work. For each change in the C++ API, make a change in the Python API wrapper via the pyOpenMS/pxds/ files.

  5. Have you completed regression testing? Make sure that you include a test in the test suite for:

    • Public methods of a class

    • TOPP tools

    • Bug fixes

Make sure to:

  • Rebase before you open a pull request. To include all recent changes, rebase your branch on develop before opening a pull request. If you pushed your branch to origin before rebasing, git will most likely tell you after the rebase that your local branch and the remote branch have diverged. If you are sure that the remote branch does not contain any local commits in the rebased version, you can safely push using git push -f origin <branch-name> to enforce overwrite. If not, contact your local git expert on how to get the changes into your local branch.

  • Capture similar changes in a single commit Each commit should represent one logical unit. Consolidate multiple commits if they belong together or split single commits if they are unrelated. For example, committing code formatting together with a one-line fix makes it very hard to figure out what the fix was and which changes were inconsequential.

  • Create a pull request for a single feature or bug If you have multiple features or fixes in a pull request, you might get asked to split your request and open multiple pull requests instead.

  • Describe what you have changed in your pull request. When opening the pull request, give a detailed overview of what has changed and why. Include a clear rationale for the changes and add benchmark data if available. See this request for an example.

OpenMS Git Workflow

Before getting started, install latest version of git to avoid problems like GitHub https authentication errors (see Troubleshooting cloning errors and a solution using ssh).

OpenMS follows the git flow workflow. The difference is that merge commits are managed via pull requests instead of creating merge commits locally.

Naming conventions

Naming conventions for the following apply:

  • A local repository is the repository that lies on your hard drive after cloning.

  • A remote repository is a repository on a git server such as GitHub.

  • A fork is a copy of a repository. Forking a repository allows you to freely experiment with changes without affecting the original project.

  • Origin refers to a remote repository that you have forked. Call this repository https://github.com/_YOURUSERNAME_/OpenMS.

  • Upstream refers to the original remote OpenMS repository. Call this repository https://github.com/OpenMS/OpenMS.

Create fork

Start by forking the OpenMS repository.

To create a fork, click Fork under the main menu as shown below.

image info

Clone your fork

To obtain a local repository copy, clone your fork using:

$ git clone https://github.com/_YOURUSERNAME_/OpenMS.git

This will clone your fork (correctly labelled origin by default) into a local copy on your computer.

Note

To use git clone git@github.com:_YOURUSERNAME_/OpenMS.git, make sure you have SSH key added to your GitHub account.

Keep your fork in sync

Keep your fork (origin) in sync with the OpenMS repository (upstream) by following the GitHub instructions. In summary, to keep your fork in sync:

  1. Fetch changes from upstream and update your local branch.

  2. Push your updated local branch to your fork (origin).

Tip

To keep track of others repositories, use git fetch --all --prune to update them as well. The option --prune tells git to automatically remove tracked branches if they got removed in the remote repository.

$ git fetch --all --prune
$ git checkout develop
$ git merge --ff-only upstream/develop
$ git push origin develop

Feel free to experiment within your fork. However, for your code needs to meet OpenMS quality standards to be merged into the OpenMS repository,

Follow these rules:

  • Never commit directly to the develop or master branches as it will complicate the merge.

  • Try to start every feature from develop and not base features on other features.

  • Name the OpenMS remote upstream and always push directly to origin (git push origin <branch-name>).

  • When updating your fork, consider using git fetch upstream followed by git merge --ff-only upstream/develop to avoid creating merge commits in develop.

  • If you never commit to develop this should always succeed and (if a commit accidentally went to develop) warn you instead of creating a merge commit.

Create new feature

All features start from develop.

$ git checkout develop
$ git checkout -b feature/your-cool-new-feature

All commits related to this feature will then go into the branch feature/your-cool-new-feature.

Keeping your feature branch in sync with develop branch

While working on your feature branch, it is usual that development continues and new features get integrated into the main development branch. This means your feature branch lags behind develop. To get your feature branch up-to-date, rebase your feature branch on develop using:

$ git checkout feature/myfeaturebranch
$ git rebase develop

The above commands:

  1. Performs a rewind of your commits until the branching point.

  2. Applies all commits that have been integrated into develop.

  3. Reapplies your commits on top of the commits integrated into develop.

For more information, refer to a visual explanation of rebasing.

Tip

Do not rebase published branches (e.g. branches that are part of a pull request). If you created a pull request, you should only add commits in your feature branch to fix things that have been discussed. After your pull request contains all fixes, you are ready to merge the pull request into develop without rebasing (see e.g. rebase-vs-merge).

Adding a feature to OpenMS

Features that should go into the main development line of OpenMS should be integrated via a pull request. This allows the development community of OpenMS to discuss the changes and suggest possible improvements.

After opening the pull request via the GitHub web site, GitHub will try to create the pull request against the branch that you branched off from. Please check the branch that you are opening the pull request against before submitting the pull request. If any changes are made, a new pull request is required. Select Allow others to make changes to this pull request so that maintainers can directly help to solve problems.

Open pull requests only after checking code-style, documentation and passing tests. Pull requests that do not pass CI or code review will not be merged until the problems are solved. It is recommended that you read the pull request guidelines before you submit a pull request.

Update git submodules

Start in your local OpenMS/OpenMS repository (on your feature/pull request branch).

The following example uses a submodule called THIRDPARTY.

$ git submodule update --init THIRDPARTY
$ cd THIRDPARTY
# yes, in the submodules the default remote is origin
# usually you want to pull the changes from master (e.g. after your pull request to OpenMS/THIRDPARTY has been merged)
$ git pull origin master
$ cd ..
$ git status
# Make sure that you see "modified:   THIRDPARTY (new commits)"
$ git commit -am "updated submodule"

Developers

To contribute to OpenMS:

For any questions, please contact us.

Technical documentation

Note

Untested installers and containers are known as the nightly snapshot, are released every night. They generally pass automated continuous integration tests but no manual tests.

View the documentation for the nightly snapshot of OpenMS develop branch at the build archive.

See the documentation for the latest release.

Contribution guidelines

Before contributing to OpenMS, read information on the development model and conventions followed to maintain a coherent code base.

Development model

OpenMS follows the Gitflow development workflow.

Every contributor is encouraged to create their own fork (even if they are eligible to push directly to OpenMS). To create a fork:

  1. Follow the documentation on forking.

  2. Keep your fork up-to-date.

  3. Create a pull request. Before opening the pull request, please view the pull request guidelines.

Coding conventions

See the manual for coding style recommended by OpenMS: Coding conventions.

See also

C++ Guide.

OpenMS automatically tests for common coding convention violations using a modified version of cpplint. Style testing can be enabled using cmake options. clang-format is used for formatting the cpp code.

Commit messages

View the guidelines for commit messages: How to write commit messages.

Automated unit tests

Nightly tests run on different platforms. It is recommended to test on different platforms.

Tip

This saves time and increases productivity during continuous integration tests.

Nightly tests: CDASH.

Further contributor resources

Consider the following resources for further information:

  • Guidelines for adding new dependency libraries: View the guidelines for adding new dependency libraries.

  • Experimental installers: We automatically build installers for different platforms. These usually contain unstable or partially untested code. The nightly (unstable) installers are available at the build archive.

  • Developer FAQ: Visit the Developer FAQ to get answers to frequently asked questions.

Adding New Tool to The TOPP suite
The OpenMS pipeline (TOPP)

Any tool that is written with the OpenMS library can easily be made into a TOPP tool by simply using the OpenMS command line parser which is able to parse ParamXML, a powerful XML based description of the tool. Hence most analysis algorithms in OpenMS are available as a stand-alone tool which can be called on the command line or integrated into workflow engines via the CTD mechanism. A current list of TOPP tools can be found in the documentation.

What do I have to do to add a new TOPP tool?

The recommended way is to inherit from the class TOPPBase as in existing TOPP tools (sources available in /src/topp/). This will add command line parsing functionality to your tool as described in the TOPP section of this page.

  • Add the code to src/topp/ and register it in src/topp/executables.cmake

  • Add your tool (with the correct category) to getTOPPToolList() in src/openms/source/APPLICATIONS/ToolHandler.cpp. This creates a doxygen page with the –help output of the tool (using TOPPDocumenter). This page must be included at the end of the doxygen documentation of your tool (see other tools for an example).

  • Add it to the TOPP docu page (in doc/doxygen/public/TOPP.doxygen)

  • Add the name to src/topp/executables.cmake

  • Write a TOPP test (add it to src/tests/topp/CMakeLists.txt)

Warning

Handle any kind of input files to your TOPP tool via command line flags and use the ${DATA_DIR_TOPP} prefix. Use ini-files to specify output-files, but not input-files. Doing otherwise will break out-of-source builds.

Hint

add -test to the call of your TOPP tool and also create the expected output that you put in src/tests/topp with that flag active. The flag ensures that UniqueId’s, dates etc are equal no matter where and when the tool is run.

I want to implement a new file adapter. What is to be done?

First, add a file adapter class to the include/OpenMS/FORMAT/ and source/FORMAT/ folders. The file adapter should implement a default constructor, a load method and a store method. Make sure your code conforms to the OpenMS Coding conventions. For automatic file type recognition, you need to

  • register your new file type at the Type enum in /include/OpenMS/FORMAT/FileTypes.h,

  • flag the file type as supported in the isSupported method of /source/FORMAT/FileHandler.C

  • register the file extension in the getTypeByFileName method of /source/FORMAT/FileHandler.C

If the new file is a peak or feature file format you should also add it to loadExperiment or loadFeatures, respectively, of the FileHandler class. To add the file format to the TOPPView open dialog, you have to modify the file /source/APPLICATIONS/TOPPViewBase.C.

  • Add the file extensions to the filter_all and filter_single variables of the getFileList_ method.

To add your format to TOPP applications:

  • add the file extension to the extensions list of the respective parameter:

    e.g. setValidStrings_("in_type", StringList::create("mzData,mzXML,mzML")); in FileInfo
    
How to create an icon file for a TOPP tool under Windows?
  • Create an .ico file: first, you need some graphics program (The GIMP is recommended) think of a motive and remind yourself that you have limited space. Create at least a 16x16, 32x32, 48x48 and 64x64 pixel version and save each of them in a separate layer of the respective size. Do not add any larger sized layers, since Win XP will not display any icon then. When saving the image as type .ico the GIMP will ask you for the color depth of each layer. As it is recommended to have multiple color depths of each icon-size, go back to the layers and duplicate each layer twice. That should give you 12 layers. Now, save the image as .ico (e.g. TOPPView.ico) file, giving each group of equal sized layers a 32 bit (8 bit transparency), 8 bit (1 bit transparency), 4 bit (1 bit transparency) color depth.

Attention

Make sure to assign the higher color depth to the upper layers as Windows will not pick the highest possible color otherwise.

  • Create a resource file: Create a text file named .rc (e.g. TOPPView.rc) Insert the following line: 101 ICON “TOPPView.ico” , replacing TOPPView with your binary name. Put both files in OpenMS/source/APPLICATIONS/TOPP/ (similar files for other TOPP tools already present). Re-run cmake and re-link your TOPP tool.

Voila. You should have an iconized TOPP tool.

Develop your Tool in an external project using OpenMS

To include the OpenMS library in one of your projects, we recommend to have a look at a small emulated external project in our repository. We strongly suggest to use CMake for building your project together with OpenMS to make use of the macros and environment information generated during the build of the OpenMS library.

The Common Tool Description (CTD)

The CTD is a format developed from the OpenMS team to allow the user to use TOPP tools also in other workflow engines. Each tool can output a CTD description of itself (the XML scheme for the CTD can be found here), which can then be used by a node generator program to generate nodes for different workflow engines. The CTD mechanism is shared by OpenMS with other mature libraries like SeqAn and BALL. An example for a node generation program are the Generic KNIME Nodes. The most complete description on how to generate your own Generic KNIME Nodes based on a CTD (e.g. from your freshly developed command line tool), can be found on the SeqAn documentation. We are working on a tutorial specifically tailored to OpenMS.

Custom Compilation of OpenMS

To compile with self built compilers and non default standard libraries, follow listed steps.

To choose any specific compiler, instead of the system default, add the whole path to these options for the cmake call:

cmake -DCMAKE_C_COMPILER=/path/to/c-compiler/binary/gcc -DCMAKE_CXX_COMPILER=/path/to/c+±compiler/binary/g++

cmake -DCMAKE_C_COMPILER=/path/to/c-compiler/binary/clang -DCMAKE_CXX_COMPILER=/path/to/c+±compiler/binary/clang++

To compile OpenMS with clang and a specific GCC stdlib, instead of the system default one:

Use this cmake option to specify an additional compiling option for clang:

cmake -DMY_CXX_FLAGS="--gcc-toolchain=/path/to/gcc"

with the path to the top gcc directory (containing the directory lib64) to the cmake call.

Warning

This combination does not work for all versions of clang and gcc.

  • Clang 9.0.0 and GCC 4.8.5 stdlib does not work!

  • Clang 9.0.0 and GCC 9.2.0 stdlib does not work!

  • Clang 9.0.0 and GCC 8.3.0 stdlib compiles, but some tests fail.

  • Clang 6.0.0 and GCC 7.4.0 stdlib (Ubuntu 18.04 default versions) works

Developer Guidelines For Adding New Dependent Libraries
Our dependency library philosophy

In short, requirements for adding a new library are:

  • indispensable functionality

  • license compatibility

  • availability for all platforms

Indispensable functionality

In general, adding a new dependency library (of which we currently have more than a handful, e.g. Xerces-C or ZLib) imposes a significant integration and maintenance effort. Thus, the new library should add indispensable functionality. If the added value does not compensate for the overhead, alternative solutions encompass:

  • write it yourself and add to the OpenMS library (i.e. its repository) directly

  • write a TOPPAdapter which calls an external executable (placing the burden on the user to supply the executable)

License compatibility

OpenMS has a BSD-3 clause license and we try hard to remove dependencies of GSL-like libraries. Therefore, a new library with e.g. LGPL-2 would be prohibitive.

C++ standard compatibility

New dependency libraries needs to be compatible and therefore compilable with the same C++ standard as OpenMS.

Availability for all platforms

OpenMS has been designed for Windows, macOS, and Linux. Therefore, the new dependency library needs to be designed for these platforms.

  • on WindowsOS this usually means, adding the new library to the Contrib in debug and release variants. In short all recent versions of Visual Studio (VS2008 and onwards) must be supported (or support must be added). This encompasses

    • a solution file (which can be either statically present or generated by a meta-system like CMake) is available

    • The library actually compiles and is linked to the dynamic VS-C++ runtime lib (since this is what the OpenMS lib will link to as well - combining static and dynamic links will lead to linker errors or segfaults).

  • on macOS it should be ensured that the library can be build on recent macOS versions (> 10.10) compiled using the mac specific libc++. Ideally the package should be available via HomeBrew or MacPorts so we can directly use those libraries instead of shipping them via the contrib. Additionally, the MacPorts and HomeBrew formulas for building the libraries can serve as blueprints on how to compile the library in a generic setting inside the contrib which should also be present.

  • on Linux since we (among other distributions) feature an OpenMS Debian package which requires that all dependencies of OpenMS are available as Debian package as well, the new library must be available (or made available) as Debian package or linked statically during the OpenMS packaging build.

How to add it to the contrib build

Add a CMake file to OpenMS/contrib into the libraries.cmake folder on how to build the library. Preferably of course the library supports building with CMake (see Xerces) which makes the script really easy. It should support static and dynamic builds on every platform. Add the compile flag for position independent code (e.g. -fpic) in the static version. Add patches in the patches folder and call them with the macros in the macros.cmake file. Create patches with diff -Naur original_file my_file > patch.txt. If there are problems during applying a patch, make sure to double check filepaths in the head of the patch and the call of the patching macro in CMake.

  • All the libraries need to go into (e.g. copied/installed/moved) to $buildfolder/lib

  • All the headers under $buildfolder/include/$libraryname (the only exception to leave out the library name subfolder is when the Find$libraryname.cmake does not support this subfolder e.g. because system libraries are not structured like this, see boost).

  • All needed files into $buildfolder/share/$libraryname

Then test the build on your platform including a subsequent build of OpenMS using that library. Submit a pull request to OpenMS/contrib. Submit a pull request to OpenMS/OpenMS that updates the contrib submodule. Make sure the libraries are correctly shipped in pyOpenMS and the packages (especially dynamic libraries and especially on Windows).

External Code using OpenMS

If OpenMS’ TOPP tools are not enough in a certain scenario, you can either request a change to OpenMS, if you feel this functionality is useful for others as well, or modify/extend OpenMS privately. For the latter, there are multiple ways to do this:

  • Modify the developer version of OpenMS by changing existing tools or adding new ones.

  • Use an External Project to write a new tool, while not touching OpenMS itself (see below on how to do that).

Once you’ve finished your new tool, and it only needs to run on the development machine. To ship it to a new client machine, see, read further in this document.

Compiling external code

It is very easy to set up an environment to write your own programs using OpenMS. Make sure to downloaded and installed the source package of OpenMS/TOPP properly.

Note

You cannot use the install target when working with the development version of OpenMS, it must be built and used within the build tree.

All important compiler settings and preprocessor definitions along with the OpenMS library are available. The most important variables are:

  • OpenMS_INCLUDE_DIRECTORIES: all include directories containing OpenMS headers

  • OPENMS_ADDCXX_FLAGS: preprocessor macros we require written as (-DMACRO1 -DMACRO2)

and the OpenMS target itself (which you can link against).

The example that follows will be explained in details:

### example CMakeLists.txt to develop C++ programs using OpenMS
project("Example_Project_using_OpenMS")
cmake_minimum_required(VERSION 3.0)

## list all your executables here (a corresponding .cpp file should exist, e.g. Main.cpp)
set(my_executables
  Main
)

## list all classes here, which are required by your executables
## (all these classes will be linked into a library)
set(my_sources
  ExampleLibraryFile.cpp
)

## find OpenMS configuration and register target "OpenMS" (our library)
find_package(OpenMS)
## if the above fails you can try calling cmake with -D OpenMS_DIR=/path/to/OpenMS/
## or modify the find_package() call accordingly
## find_package(OpenMS PATHS "</path/to/OpenMS//")

# check whether the OpenMS package was found
if (OpenMS_FOUND)
  message(STATUS "\nFound OpenMS at ${OpenMS_DIR}\n")

  ## library with additional classes from above
  add_library(my_custom_lib STATIC ${my_sources})

  ## add targets for the executables
  foreach(i ${my_executables})
    add_executable(${i} ${i}.cpp)
    ## link executables against OpenMS
    target_link_libraries(${i} OpenMS my_custom_lib)
  endforeach(i)


else(OpenMS_FOUND)
  message(FATAL_ERROR "OpenMSConfig.cmake file not found!")
endif(OpenMS_FOUND)

The command project defines the name of the project, the name is only of interest of you’re working in an IDE or want to export this project’s targets. To compile the program, append it to the my_executables list. If you use object files (classes which do not contain a main program), append them to the my_sources list. In the next step CMake creates a statically linked library of the object files, listed in my_sources. This simple CMakeLists.txt example can be extended to also build shared libraries, include other external libraries and so on.

An example external project can be found in OpenMS/share/OpenMS/examples/external_code. Copy these files to a separate directory and use CMake to configure it (here as an in-source build).

cd <path_to_external_project>
cmake -G "<generator>" .

For more information visit the website of cmake at cmake.org and consult the documentation.

Important

Have fun coding with OpenMS!

Shipping external code to a new machine

If you’ve modified OpenMS itself and not used an external project use our installer scripts, to build your own OpenMS installer for your platform (see our internal FAQ which is built using “make doc_internal”) and ship that to a client machine.

If you’ve used an external project and have a new executable (+ an optional new library), use the installer approach as well, and manually copy the new executable to the TOPP binary directory (e.g. on Windows this could be c:/program files/OpenMS/bin, on Linux it could be /bin.

If you do not use the installer, copy all required files manually, plus a few extra steps, see below. What needs to be done is a little platform dependent, thus very cumbersome to explain. Look at the cmake installer scripts, to see whats required (for macOS and Linux see OpenMS/cmake/package*.cmake).

In short:

  • copy the OpenMS/share/OpenMS directory to the client machine (e.g <client/my_dir>/share) and set the environment variable OPENMS_DATA_PATH to this directory

  • copy the OpenMS library (OpenMS.dll for Windows or OpenMS.so/.dylib for Linux/macOS) to <client/my_dir>/bin.

  • copy all Qt4 libraries to the client <client/my_dir>/bin or on Linux/macOS make sure you have installed the Qt4 package.

  • [Windows only] copy Xerces dll (see contrib/lib) to <client/my_dir>/bin

  • [Windows only] install the VS redistributable package (see Microsoft Homepage) on the client machine which corresponds to the VS version that was used to compile your code (use the correct redistributable package!, i.e., architecture 32|64bit, VS version, VS Service Pack version). If you choose the wrong redistributable package, you will get “Application failed to initialize properly…” error messages.

Developer FAQ

The following contains answers to typical questions from developers about OpenMS.

General

The following section provides general information to new contributors.

I am new to OpenMS. What should I do first?
I have written a class for OpenMS. What should I do?

Follow the OpenMS coding conventions.

Coding style (brackets, variable names, etc.) must conform to the conventions.

  • The class and all the members should be properly documented.

  • Check your code with the tool tools/checker.php. Call php tools/checker.php for detailed instructions.

Please open a pull request and follow the pull request guidelines.

Can I use QT designer to create GUI widgets?

Yes. Create a class called Widget: Create .ui-File with QT designer and store it as Widget.ui., add the class to sources.cmake. From the .ui-File the file include/OpenMS/VISUAL/UIC/ClassTemplate.h is generated by the build system.

Note

Do not check in this file, as it is generated automatically when needed.

Derive the class Widget from WidgetTemplate. For further details, see the Widget.h and Widget.cpp files.

Can the START_SECTION-macro not handle template methods that have two or more arguments?

Insert round brackets around the method declaration.

Where can I find the binary installers created?

View the binary installers at the build archive. Please verify the creation date of the individual installers, as there may have been an error while creating the installer.

Troubleshooting

The following section provides information about how to troubleshoot common OpenMS issues.

OpenMS complains about boost not being found but I’m sure its there

CMake got confused. Set up a new build directory and try again. If you build from source (not recommended), deleting the CMakeCache.txt and cmake directory might help.

Build System

The following questions are related to the build system.

What is CMake?

CMake builds BuildSystems for different platforms, e.g. VisualStudio Solutions on Windows, Makefiles on Linux etc. This allows to define in one central location (namely CMakeLists.txt) how OpenMS is build and have the platform specific stuff handled by CMake.

View the cmake website for more information.

How do I use CMake?

See Installation instructions for your platform. In general, call CMake(.exe) with some parameters to create the native build-system.

Tip

whenever ccmake is mentioned in this document, substitute this by CMake-GUI if your OS is Windows. Edit the CMakeCache.txt file directly.

How do I generate a build-system for Eclipse, KDevelop, CodeBlocks etc?

Type cmake into a console. This will list the available code generators available on your platform; use them with CMake using the -G option.

What are user definable CMake cache variables?

They allow the user to pass options to CMake which will influence the build system. The most important option which should be given when calling CMake.exe is:

CMAKE_FIND_ROOT_PATH, which is where CMake will search for additional libraries if they are not found in the default system paths. By default we add OpenMS/contrib.

If you have installed all libraries on your system already, there is no need to change CMAKE_FIND_ROOT_PATH. For contrib libraries, set the variable CMAKE_FIND_ROOT_PATH.

On Windows, contrib folder is required, as there are no system developer packages. To pass this variable to CMake use the -D switch e.g. cmake -D CMAKE_FIND_ROOT_PATH:PATH="D:\\somepath\\contrib".

Everything else can be edited using ccmake afterwards.

The following options are of interest:

  • CMAKE_BUILD_TYPE To build Debug or Release version of OpenMS. Release is the default.

  • CMAKE_FIND_ROOT_PATH The path to the contrib libraries.

    Tip

    Provide more then one value here (e.g., -D CMAKE_FIND_ROOT_PATH="/path/to/contrib;/usr/" will search in your contrib path and in /usr for the required libraries)

  • STL_DEBUG Enables STL debug mode.

  • DB_TEST (deprecated) Enables database testing.

  • QT_DB_PLUGIN (deprecated) Defines the db plugin used by Qt.

View the description for each option by calling ccmake.

Can I use another solver other than GLPK?

Other solvers can be used, but by default, the build system only links to GLPK (this is how OpenMS binary packages must be built). To to use another solver, use cmake ... -D USE_COINOR=1 .... and refer to the documentation of the LPWrapper class.

How do I switch to debug or release configuration?

For Makefile generators (typically on Linux), set the CMAKE_BUILD_TYPE variable to either Debug or Release by calling ccmake. For Visual Studio, this is not necessary as all configurations are generated and choose the one you like within the IDE itself. The ‘Debug’ configuration enabled debug information. The ‘Release’ configuration disables debug information and enables optimisation.

I changed the contrib path, but re-running CMake won’t change the library paths?

Once a library is found and its location is stored in a cache variable, it will only be searched again if the corresponding entry in the cache file is set to false.

Warning

If you delete the CMakeCache.txt, all other custom settings will be lost.

The most useful targets will be shown to you by calling the targets target, i.e. make targets.

CMake can’t seem to find a Qt library (usually QtCore). What now?

CMake finds QT by looking for qmake in your PATH or for the Environment Variable QTDIR. Set these accordingly.

Make sure there is no second installation of Qt (especially the MinGW version) in your local environment.

Warning

This might lead CMake to the wrong path (it’s searching for the Qt*.lib files). You should only move or delete the offending Qt version if you know what you are doing!

A save workaround is to edit the CMakeCache file (e.g. via ccmake) and set all paths relating to QT (e.g. QT_LIBRARY_DIR) manually.

(Windows) What version of Visual Studio should I use?

It is recommended to use the latest version. Get the latest CMake, as its generator needs to support your VS. If your VS is too new and there is no CMake for that yet, you’re gonna be faced with a lot of conversion issues. This happens whenever the Build-System calls CMake (which can be quite often, e.g., after changes to CMakeLists.txt).

How do I add a new class to the build system?
  1. Create the new class in the corresponding sub-folder of the sub-project. The header has to be created in src/<sub-project>/include/OpenMS and the .cpp file in src/<sub-project>/source, e.g., src/openms/include/OpenMS/FORMAT/NewFileFormat.h and src/openms/source/FORMAT/NewFileFormat.cpp.

  2. Add both to the respective sources.cmake file in the same directory (e.g., src/openms/source/FORMAT/ and src/openms/include/OpenMS/FORMAT/).

  3. Add the corresponding class test to src/tests/class_tests/<sub-project>/ (e.g., src/tests/class_tests/openms/source/NewFileFormat_test.cpp).

  4. Add the test to the executables.cmake file in the test folder (e.g., src/tests/class_tests/openms/executables.cmake).

  5. Add them to git by using the command git add.

How do I add a new directory to the build system?
  1. Create two new sources.cmake files (one for src/<sub-project>/include/OpenMS/MYDIR, one for src/<sub-project>/source/MYDIR), using existing sources.cmake files as template.

  2. Add the new sources.cmake files to src/<sub-project>/includes.cmake

  3. If you created a new directory directly under src/openms/source, then have a look at src/tests/class_tests/openms/executables.cmake.

  4. Add a new section that makes the unit testing system aware of the new (upcoming) tests.

  5. Look at the very bottom and augment TEST_executables.

  6. Add a new group target to src/tests/class_tests/openms/CMakeLists.txt.

How can I speed up the compile process of OpenMS?

To speed up the compile process of OpenMS, use several threads. If you have several processors/cores, build OpenMS classes/tests and TOPP tools in several threads. On Linux, use the make option -j: make -j8 OpenMS TOPP test_build.

On Windows, Visual Studio solution files are automatically build with the /MP flag, such that Visual Studio uses all available cores of the machine.

Release

View preparation of a new OpenMS release to learn more about contributing to releases.

Working in Integrated Development Environments (IDEs)
Why are there no source/TEST and source/APPLICATIONS/TOPP folder?

All source files added to an IDE are associated with their targets. Find the source files for each test within its own subproject. The same is true for the TOPP classes.

I’m getting the error “Error C2471: cannot update program database”

This is a bug in Visual Studio and there is a bug fix Only apply it if you encounter the error. The bug fix might have unwanted side effects!

Visual Studio can’t read the clang-format file.

Depending on the Visual Studio version it might get an error like Error while formating with ClangFormat. This is because Visual Studio is using an outdated version of clang-format. Unfortunately there is no easy way to update this using Visual Studio itself. There is a plugin provided by LLVM designed to fix this problem, but the plugin doesn’t work with every Visual Studio version. In that case, update clang-format manually using the pre-build clang-format binary. Both the binary and a link to the plugin can be found here. To update clang-format download the binary and exchange it with the clang-format binary in your Visual Studio folder. For Visual Studio 17 and 19 it should be located at: C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\Llvm\bin.

The indexer gets stuck at some file which #includes seqan

It seems that SeqAn code is just too confusing for older eclipse C++ indexers. You should upgrade to eclipse galileo (CDT 6.0.x). Also, increase the available memory limit in eclipse.ini, e.g. -Xmx1024m for one gig.

The parser is confused after OPENMS_DLLAPI and does not recognize standard C++ headers

Go to Project -> Properties -> C/C++ Include Paths and Preprocessor Symbols -> Add Preprocessor symbol -> "OPENMS_DLLAPI=". This tells eclipse that the macro is defined empty. In the same dialog add an external include path to e.g. /usr/include/c++/4.3.3/, etc. The issue with C++ headers was fixed in the latest galileo release.

Hints to resolve the OPENMS_DLLAPI issue using the cmake generator are welcome!

Debugging

The following section provides information about how to debug your code.

How do I run a single test?

Execute an OpenMS class test using the CTest regular expressions:

$ ctest -V -R "^<class>_test"

# To build a class test, call the respective make target in ./source/TEST:

$ make <class>_test

To run a TOPP test, use:

$ ctest -V -R "TOPP_<tool>"

To build the tool, use:

$ make <tool>
How do I debug uncaught exceptions?

Dump a core if an uncaught exception occurs, by setting the environment variable OPENMS_DUMP_CORE.

Each time an uncaught exception occurs, the OPENMS_DUMP_CORE variable is checked and a segmentation fault is caused, if it is set.

(Linux) Why is no core dumped, although a fatal error occured?

The ulimit -c unlimited command. It sets the maximum size of a core to unlimited.

Warning

We observed that, on some systems, no core is dumped even if the size of the core file is set to unlimited. We are not sure what causes this problem.

(Linux) How can I set breakpoints in gdb to debug OpenMS?

Imagine you want to debug the TOPPView application and you want it to stop at line 341 of SpectrumMDIWindow.C.

  1. Enter the following in your terminal:

Run gdb:
shell> gdb TOPPView
  1. Start the application (and close it):

gdb> run [arguments]
  1. Set the breakpoint:

gdb> break SpectrumMDIWindow.C:341
  1. Start the application again (with the same arguments):

gdb> run
How can I find out which shared libraries are used by an application?

Linux: Use ldd.

Windows (Visual studio console): See Dependency Walker (use x86 for 32 bit builds and the x64 version for 64bit builds. Using the wrong version of depends.exe will give the wrong results) or dumpbin /DEPENDENTS OpenMS.dll.

How can I get a list of the symbols defined in a (shared) library or object file?

Linux: Use nm <library>.

Use nm -C to switch on demangling of low-level symbols into their C+±equivalent names. nm also accepts .a and .o files.

Windows (Visual studio console): Use dumpbin /ALL <library>.

Use dumpbin on object files (.o) or (shared) library files (.lib) or the DLL itself e.g. dumpbin /EXPORTS OpenMS.dll.

Cross-platform thoughts

OpenMS runs on three major platforms… Here are the most prominent causes of “it runs on Platform A, but not on B. What now?”

Reading or writing binary files

Reading or writing binary files causes different behaviour. Usually Linux does not make a difference between text-mode and binary-mode when reading files. This is quite different on Windows as some bytes are interpreted as EOF, which lead might to a premature end of the reading process.

If reading binary files, make sure that you explicitly state that the file is binary when opening it.

During writing in text-mode on Windows a line-break (\n) is expanded to (\r\n). Keep this in mind or use the eol-style property of subversion to ensure that line endings are correctly checked out on non-Windows systems.

Paths and system functions

Avoid hardcoding e.g.String tmp_dir = "/tmp";. This will fail on Windows. Use Qt’s QDir to get a path to the systems temporary directory if required.

Avoid names like uname which are only available on Linux.

When working with files or directories, it is usually safe to use “/” on all platforms. Take care of spaces in directory names though. Quote paths if they are used in a system call to ensure that the subsequent interpreter takes the spaced path as a single entity.

Doxygen Documentation
Where can I find the definition of the main page?

Find a definition of the main page here.

Where can I add a new module?

Add a new module here.

How is the parameter documentation for classes derived from DefaultParamHandler created?

Add your class to the program OpenMS/doc/doxygen/parameters/DefaultParamHandlerDocumenter.cpp. This program generates a html table with the parameters. This table can then be included in the class documentation using the following doxygen command:@htmlinclude OpenMS_<class name>.parameters.

Note

Parameter documentation is automatically generated for TOPP included in the static ToolHandler.cpp tools list.

To include TOPP parameter documentation use following doxygen command:

@htmlinclude TOPP_<tool name>.parameters

Test if everything worked by calling make doc_param_internal. The parameters documentation is written to OpenMS/doc/doxygen/parameters/output/.

How is the command line documentation for TOPP tools created?

The program OpenMS/doc/doxygen/parameters/TOPPDocumenter.cpp creates the command line documentation for all classes that are included in the static ToolHandler.cpp tools list. It can be included in the documentation using the following doxygen command:

@verbinclude TOPP_<tool name>.cli

Test if everything worked by calling make doc_param_internal. The command line documentation is written to OpenMS/doc/doxygen/parameters/output/.

Bug Fixes
How to contribute a bug fix?

Read contributor quickstart guide.

How can I profile my code?

IBM’s profiler, available for all platforms (and free for academic use): Purify(Plus) and/or Quantify.

Windows: this is directly supported by Visual Studio (Depending on the edition: Team and above). Follow their documentation.

Linux:

  1. Build OpenMS in debug mode (set CMAKE_BUILD_TYPE to Debug).

  2. Call the executable with valgrind: valgrind –tool=callgrind.

    Warning

    Other processes running on the same machine can influence the profiling. Make sure your application gets enough resources (memory, CPU time).

  3. Start and stop the profiling while the executable is running e.g. to skip initialization steps:

  4. Start valgrind with the option –instr-atstart=no.

  5. Call callgrind -i [on|off] to start/stop the profiling.

  6. The output can be viewed with kcachegrind callgrind.out.

(Linux) How do I check my code for memory leaks?
  • Build OpenMS in debug mode (set CMAKE_BUILD_TYPE to Debug).

  • Call the executable with valgrind: valgrind --suppressions=OpenMS/tools/valgrind/openms_external.supp –leak-check=full <executable> <parameters>.

Common errors are:

  • 'Invalid write/read ...' - Violation of container boundaries.

  • '... depends on uninitialized variable' - Uninitialized variables:

  • '... definitely lost' - Memory leak that has to be fixed

  • '... possibly lost' - Possible memory leak, so have a look at the code

For more information see the valgrind documentation .

Additional

Graphical User Interfaces

OpenMS provides additional graphical user interfaces besides TOPPAS and TOPPView, designed for users who want easy access to TOPP tools. These interfaces include:

  • INIFileEditor

    A GUI application used to edit TOPP INI files. TOPP INI files are used to configure TOPP tool parameters. TOPP INI files are files with the extension .ini. For mor information, read our INIFileEditor section.

  • SwathWizard An application for SWATH analysis. SwathWizard is used to analyze DIA swath data. For more information, read our SwathWizard section.

A possible workflow would consist of the following steps:

  1. Generate a TOPP INI file from the command line.

  2. Edit the TOPP INI file in the INIFile Editor.

  3. Import data into TOPPView.

  4. Apply TOPP tool to data in TOPPView. You will need to load the TOPP INI file edited in step 1.

SwathWizard

SwathWizard is an assistant for Swath analysis.

The Wizard takes the user through the whole analysis pipeline for SWATH proteomics data analysis, i.e. the TOPP Documentation: OpenSwathWorkflow tool, including downstream tools such as GitHub:PyProphet/pyProphet and the GitHub:msproteomicstools/TRIC alignment tool.

Since the downstream tools require Python and the respective modules, the Wizard will check their proper installation status and warn the user if a component is missing.

Users can enter the required input data (mzML MS/MS data, configuration files) in dedicated fields, usually by drag and droping files from the operating systems’ file explorer (Explorer, Nautilus, Finder…). The output of the Wizard is both the intermediate files from OpenSWATH (e.g. the XIC data in .sqMass format) and the tab-separated table format (.tsv) from pyProphet and TRIC.

This is how the wizard looks like:

SwathWizard

A schematic of the internal data flow (all tools are called by SwathWizard in the background) can be found in the TOPP Documentation: SwathWizard.

A recommended test data for the Wizard is the PASS00779 dataset.

INIFileEditor

Can be used to visually edit INI files of TOPP tools.

The values can be edited by double-clicking or pressing F2.

The documentation of each value is shown in the text area on the bottom of the widget.

INIFileEditor

Glossary

A glossary of common terms used throughout OpenMS documentation.

aerosol

An aerosol is a suspension of fine solid particles or liquid droplets in air or another gas.

atom

An atom is the smallest unit of ordinary matter that forms a chemical element.

chromatogram

A two-dimensional plot that describes the amount of analyte eluted from a chromatography versus the analyte’s retention time. OpenMS represents a chromatogram using the class MSChromatogram

collision-induced dissociation (CID)

A mass spectrometry technique to induce fragmentation of selected ions in the gas phase. Also known as Collision induced dissociation.

consensus feature

Features from replicate experiments with similar retention times and m/z values are linked and considered a consensus feature. A consensus feature contains information on the common retention time and m/z values as well as intensities for each sample. OpenMS represents a consensus feature using the class ConsensusFeature.

consensus map

A consensus map is a collection of consensus features identified from mass spectra across replicate experiments. One consensus map can contain many consensus features. OpenMS represents a consensus map using the class ConsensusMap.

de novo peptide sequencing

A peptide’s amino acid sequence is inferred directly from the precursor peptide mass and tandem mass spectrum (MS/MS or MS^3) fragment ions, without comparison to a reference proteome.

electrospray ionization

A technique used in mass spectrometry to produce ions using an electrospray in which a high voltage is applied to a liquid to create an aerosol.

electrospray ionization (ESI)

A technique used in mass spectrometry to produce ions.

FASTA format

A text-based format for representing nucleotide or amino acid sequences.

feature

An LC-MS feature represents the combined isotopic mass traces of a detected chemical compound. The chromatographic peak shape of a feature is defined by the interaction of the analyte with the LC column. Each feature contains information on retention time, mass-to-charge ratio, intensity and overall quality. OpenMS represents a feature using the class Feature.

feature map

A feature map is a collection of features identified in a mass spectrum from a single experiment. One feature map can contain many features. OpenMS represents a feature map using the class FeatureMap.

Galaxy

Server and browser-based interactive workflow editor and runner; see Workflow Editor.

HPLC-MS

Data produced by High performance liquid chromatography (HPLC) separates components of a mixture, whereas mass spectrometry (MS) offers the detection tools to identify them.

ion

Any atom or group of atoms that bears one or more positive or negative electrical charges. Positively charged are cations, negavtively charged anions.

iTRAQ

Stands for ‘Isobaric tags for relative and absolute quantitation’.

KNIME

An advanced workflow editor which OpenMS provides a plugin for; see Workflow Editor.

LC-MS

Liquid Chromatography-Mass Spectrometry.

Liquid chromatography

An analytical technique used to separate molecules.

LuciphorAdapter

Adapter for the LuciPHOr2: a site localisation tool of generic post-translational modifications from tandem mass spectrometry data. More information is available in the OpenMS API reference documentation.

m/z

mass to charge ratio.

Mascot

A so-called search engine: It identifies peptide sequences from MS/MS spectra. Please find more information in the TOPP Documentation.

Mass

Mass is a measure of the amount of matter that an object contains. In comparison to often used term weight, which is a measure of the force of gravity on that object.

mass spectrometry

An analytical technique used to identify and quantify molecules of interest.

mass spectrum

A mass spectrum is a plot of the ion signal as a function of the mass-to-charge ratio. A mass spectrum is produced by a single mass spectrometry run. These spectra are used to determine the elemental or isotopic signature of a sample, the masses of particles and of molecules, and to elucidate the chemical identity or structure of molecules and other chemical compounds. OpenMS represents a one dimensional mass spectrum using the class MSSpectrum.

MS

Mass Spectrometry

MS(1)

First stage to get a spectra. A sample is injected into the mass spectrometer, ionized, accelerated and analyzed by mass spectrometry.

MS(2)

Ions from MS1 spectra are then selectively fragmented and analyzed by a second stage of mass spectrometry (MS2) to generate the spectra for the ion fragments.

MS/MS

Tandem mass spectrometry, MS^2^, a technique where two or more mass analyzers are coupled together using an additional reaction step to increase their abilities to analyse chemical samples.

MS^2

See MS/MS.

MS^3

Multi-stage Mass Spectrometry

MSExperiment

An OpenMS class used to represent a single mass spectrometry run. Read the documentation for further information.

MSGFPlusAdapter

Adapter for the MS-GF+ protein identification (database search) engine. More information is available in the OpenMS API reference documentation.

mzData

mzData was the first attempt by the Proteomics Standards Initiative (PSI) from the Human Proteome Organization (HUPO) to create a standardized format for Mass Spectrometry data.[7] This format is now deprecated, and replaced by mzML.

mzML

The mzML format is an open, XML-based format for mass spectrometer output files, developed with the full participation of vendors and researchers in order to create a single open format that would be supported by all software.

mzXML

mzXML is an open data format for storage and exchange of mass spectroscopy data, developed at the SPC/Institute for Systems Biology.

Nextflow

Script/DSL-based workflow language, executor and utilities; see Workflow Editor.

Nightly Snapshot

Untested installers and containers are known as the nightly snapshot.

octadecyl (C18)

An alkyl radical C(18)H(37) derived from an octadecane by removal of one hydrogen atom.

OpenMS API

An interface that allows developers to use OpenMS core library classes and methods.

orbitrap analyzers

In mass spectrometry, an ion trap mass analyzer consisting of an outer barrel-like electrode and a coaxial inner spindle-like electrode that traps ions in an orbital motion around the spindle. A high resoltion mass spectrometry analyzer.

peak

A single raw data point in a chromatogram or a mass spectrum. OpenMS represents a peak in a chromatogram using the class ChromatogramPeak. OpenMS represents a single, one-dimensional peak in a mass spectrum using the class PeakID

PepNovo

PepNovo is a de novo sequencing algorithm for MS/MS spectra.

peptides

A short chain of amino acids.

proteins

Proteins are vital parts of living organisms, with many functions, for example composing the structural fibers of muscle to the enzymes that catalyze the digestion of food to synthesizing and replicating DNA.

proteomics

Proteomics is the large-scale study of proteins.

ProteoWizard

ProteoWizard is a set of open-source, cross-platform tools and libraries for proteomics data analyses. It provides a framework for unified mass spectrometry data file access and performs standard chemistry and LCMS dataset computations.

pyOpenMS

pyOpenMS is an open-source Python library for mass spectrometry, specifically for the analysis of proteomics and metabolomics data in Python. For pyOpenMS documentaion visit this link.

quadrupole mass filters

A mass filter allowing one mass channel at a time to reach the detector as the mass range is scanned.

retention time

retention time (RT) in liquid chromatography, is the time it takes for a separated analyte to move through the stationary phase.

RT

Retention time.

SILAC

Stands for ‘Stable isotope labeling using amino acids in cell culture’.

spectra

Plural of spectrum.

SRM

Selected reation monitoring is a mass spectrometry technique for small molecule analysis.

SWATH

Stands for ‘Sequential acquisition of all theoretical fragment ion spectra’.

time-of-flight (TOF)

A measurement of the time taken by an object, particle of wave (be it acoustic, electromagnetic, e.t.c) to travel a distance through a medium.

TMT

Tandem Mass Tag (TMT) is a mass spectrometry based system designed to identify and quantify proteins in different samples.

TOPP

The OpenMS Pipeline, see TOPP Tools.

TOPP tool

see TOPP Tools

TOPP Tools

OpenMS provides a number of programs, called TOPP tools, that process mass spectrometry data. More information on TOPP tools can be found in the OpenMS API reference documentation.

TOPPAS

An graphical user interface (GUI), which is shipped with OpenMS, to create and execute worflows using TOPP tools; see Workflow Editor.

TOPPView

TOPPView is a viewer for MS and HPLC-MS data which ships with OpenMS. More information is available in TOPPView documentation.

KNIME

Using OpenMS in combination with KNIME, you can create, edit, open, save, and run workflows that combine TOPP tools with the powerful data analysis capabilities of KNIME. Workflows can be created conveniently in a graphical user interface. The parameters of all involved tools can be edited within the application and are also saved as part of the workflow. Furthermore, KNIME interactively performs validity checks during the workflow editing process, to make it more difficult to create an invalid workflow. Throughout most parts of this tutorial, you will use KNIME to create and execute workflows. The first step is to become familiar with KNIME. Additional information on the basic usage of KNIME can be found on the KNIME Getting Started page. However, the most important concepts will also be reviewed in this tutorial.

General Remarks

  • This handout will guide you through an introductory tutorial for the OpenMS/TOPP software package[1].

  • OpenMS[2],[3] is a versatile open-source library for mass spectrometry data analysis. Based on this library, we offer a collection of command-line tools ready to be used by end users. These so-called TOPP tools (short for ”The OpenMS Pipeline”)[4] can be understood as small building blocks of arbitrarily complex data analysis workflows.

  • In order to facilitate workflow construction, OpenMS was integrated into KNIME[5], the Konstanz Information Miner, an open-source integration platform providing a powerful and flexible workflow system combined with advanced data analytics, visualization, and report capabilities. Raw MS data as well as the results of data processing using TOPP can be visualized using TOPPView[6].

  • This tutorial was designed for use in a hands-on tutorial session but can also be worked through at home using the online resources. You will become familiar with some of the basic functionalities of OpenMS/TOPP, TOPPView, as well as KNIME and learn how to use a selection of TOPP tools used in the tutorial workflows.

  • If you are attending the tutorial and received a USB stick, all sample data referenced in this tutorial can be found in the C:Example_Data folder, on the USB stick, or released online on our Archive.

Getting Started

Installation

Before we get started, we will install OpenMS with its viewer TOPPView, KNIME and the OpenMS KNIME plugin. If you take part in a live training session you will have likely received an USB stick from us that contains the required data and software. If we provide laptops with the software you may of course skip the installation process and continue reading the next section. If you are doing this tutorial online, choose online in the following tab(s).

If you are working through this tutorial at home/online, proceed with the following steps:

  • Download and install OpenMS using the installation instructions for the OpenMS tools.

    Note

    To install the graphical application, please use the downloadable installer for your platform, not conda, nor docker.

  • Download and install KNIME

Please choose the directory that matches your operating system and execute the installer.

For Windows, you run:

Note

The OpenMS installer for windows now supports installing only for a single user. If you choose this option the location of the tools will be different than C:Program FilesOpenMS-3.1.0 specified in this document. In most cases they will install to C:Users$YOUR_USERAppDataLocalOpenMS-3.1.0 where $YOUR_USER is replaced with your username.

  • The OpenMS installer: WindowsOpenMS-3.1.0-Win64.exe

  • The KNIME installer: WindowsKNIME-5.2.0-Installer-64bit.exe

On macOS(x86), you run:

  • The OpenMS installer: MacOpenMS-3.1.0-macOS.dmg

  • The KNIME installer: Macknime_5.2.0.app.macosx.cocoa.x86_64.dmg

On macOS(arm), you run:

  • The OpenMS installer: MacOpenMS-3.1.0-macOS.dmg

  • The KNIME installer: Macknime_5.2.0.app.macosx.cocoa.aarch64.dmg

On Linux:

  • The OpenMS package: LinuxOpenMS-3.1.0-Debian-Linux-x86_64.deb can be installed with your package manager

  • The KNIME package can be extracted to a folder of your choice from knime_5.2.0.linux.gtk.x86_64.tar

Note

You can also install OpenMS via your package manager (version availability not guaranteed) or build it on your own with our build instructions.

KNIME Modern and Classic UI

Since version 5.0 KNIME has a new updated user interface. For the purposes of this tutorial we will continue to use the “classic user interface”. Depending on your OS KNIME may have started automatically in the Modern UI, which looks like the following:

ms2 spectrum

Figure 5.5: The modern KNIME UI. To switch back to the classic UI, select “Menu” and click “Switch to classic user interface”

Plugin and dependency

Before we can start with the tutorial, we need to install all the required extensions for KNIME. Since KNIME 3.2.1, the program automatically detects missing plugins when you open a workflow but to make sure that the right source for the OpenMS plugin is chosen, please follow the instructions here.

Required KNIME plugins

First, we install some additional extensions that are required by our OpenMS nodes or used in the Tutorials for downstream processing, visualization or reporting.

  1. In KNIME, click on Help > Install New Software.

  2. From the ‘Work with:’ drop-down list, select the update site ‘KNIME 5.2 - https://update.knime.com/analytics-platform/5.2

  3. Now select the following KNIME core plugins from the KNIME & Extensions category

  • KNIME Base Chemistry Types & Nodes

  • KNIME Chemistry Add-Ons

  • KNIME Interactive R Statistics Integration

  • KNIME Report Designer

  1. Click on Next and follow the instructions (it’s not necessary to restart KNIME now).

  2. Click again on Help > Install New Software

  3. From the ‘Work with:’ drop-down list, select the update site ‘KNIME Community Extensions (Trusted) - https://update.knime.com/community-contributions/trusted/5.2

  4. From the “KNIME Community Contributions - Cheminformatics” category select

  • RDKit Nodes Feature

  1. From the “KNIME Community Extensions - Other” category select

  • Generic Worfkflow Nodes for KNIME

  1. Click on Next and follow the instructions and after a restart of KNIME the dependencies will be installed.

R programming language and its KNIME integration

In addition, we need to install R for the statistical downstream analysis. Choose the directory that matches your operating system, double-click the R installer and follow the instructions. We recommend to use the default settings whenever possible. On macOS you also need to install XQuartz from the same directory.

Afterwards open your R installation. If you use Windows, you will find an ”R x64 4.3.2” icon on your desktop. If you use macOS, you will find R in your Applications folder. In R, type the following lines (you might also copy them from the file Rinstall_R_packages.R on the USB stick):

install.packages('Rserve',,"http://rforge.net/",type="source")
install.packages("Cairo")

install.packages("devtools")
install.packages("ggplot2")
install.packages("ggfortify")

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install()
BiocManager::install(c("MSstats"))

In KNIME, click on File > Preferences, select the category KNIME > R and set the ”Path to R Home” to your installation path. You can use the following settings, if you installed R as described above:

  • Windows: C:\Program Files\R\R-4.3.2'

  • macOS: /Library/Frameworks/R.framework/Versions/4.3/Resources

KNIME OpenMS plugin

You are now ready to install the OpenMS nodes.

  • In KNIME, click on Help > Install New Software

You now have to choose an update site to install the OpenMS plugin from. Which update site to choose depends on if you received an USB stick in a hands-on Tutorial or if you are doing this Tutorial online.

To install the OpenMS KNIME plugin from the internet, do the following:

  1. From the ‘Work with:’ drop-down list, select the update site ‘KNIME Community Extensions (Trusted) - https://update.knime.com/community-contributions/trusted/5.2

  2. Now select the following plugin from the “KNIME Community Contributions - Bioinformatics & NGS” category

  • OpenMS

  • OpenMSThirdParty

  1. Click on Next and follow the instructions and after a restart of KNIME the OpenMS nodes will be available in the Node repository under “Community Nodes”.

Note

If this does not work for you, report it and while waiting for a reply/fix, try to use an update site of an older KNIME version by editing the KNIME version number in the URL or by using our inofficial update site at https://abibuilder.cs.uni-tuebingen.de/archive/openms/knime-plugin/updateSite/release/latest

We included a custom KNIME update site to install the OpenMS KNIME plugins from the USB stick. If you do not have a stick available, please see below.

  • In the now open dialog choose Add (in the upper right corner of the dialog) to define a new update site. In the opening dialog enter the following details.

    Name: OpenMS 3.1.0 UpdateSite

    Location: file:/KNIMEUpdateSite/3.1.0/

  • After pressing OK KNIME will show you all the contents of the added Update Site.

Note

From now on, you can use this repository for plugins in the Work with: drop-down list.

  • Select the OpenMS nodes in the ”Uncategorized” category and click Next.

  • Follow the instructions and after a restart of KNIME the OpenMS nodes will be available in the Node repository under “Community Nodes”.

To install the nightly/experimental version of the OpenMS KNIME plugin from the internet, do the following:

Note

From now on, you can use this repository for plugins in the Work with: drop-drown list.

  • Select the OpenMS nodes in the “Uncategorized” category and click Next.

  • Follow the instructions and after a restart of KNIME the OpenMS nodes will be available in the Node repository under “Community Nodes”.

Data conversion

Each MS instrument vendor has one or more formats for storing the acquired data. Converting these data into an open format (preferably mzML) is the very first step when you want to work with open-source mass spectrometry software. A freely available conversion tool is MSConvert, which is part of a ProteoWizard installation. All files used in this tutorial have already been converted to mzML by us, so you do not need to perform the data conversion yourself. However, we provide a small raw file so you can try the important step of raw data conversion for yourself.

Note

The OpenMS installation package for Windows automatically installs ProteoWizard, so you do not need to download and install it separately. Due to restrictions from the instrument vendors, file format conversion for most formats is only possible on Windows systems. In practice, performing the conversion to mzML on the acquisition PC connected to the instrument is usually the most convenient option.

To convert raw data to mzML using ProteoWizard you can either use MSConvertGUI (a graphical user interface) or msconvert (a simple command line tool).

msconvertgui

Figure 1: MSConvertGUI (part of ProteoWizard), allows converting raw files to mzML. Select the raw files you want to convert by clicking on the browse button and then on Add. Default parameters can usually be kept as-is. To reduce the initial data size, make sure that the peakPicking filter (converts profile data to centroided data (see Fig. 2)) is listed, enabled (true) and applied to all MS levels (parameter ”1-”). Start the conversion process by clicking on the Start button.

Both tools are available in: C:Program FilesOpenMS-3.1.0shareOpenMSTHIRDPARTYpwiz-bin.

You can find a small RAW file on the USB stick Example_DataIntroductiondatasetsraw.

MSConvertGUI

MSConvertGUI (see Fig. 1) exposes the main parameters for data conversion in a convenient graphical user interface.

msconvert

The msconvert command line tool has no graphical user interface but offers more options than the application MSConvertGUI. Additionally, since it can be used within a batch script, it allows converting large numbers of files and can be much more easily automatized. To convert and pick the file raw_data_file.RAW you may write:

msconvert raw_data_file.RAW --filter "peakPicking true 1-"

in your command line.

profile centroided

Figure 2: The amount of data in a spectra is reduced by peak picking. Here a profile spectrum (blue) is converted to centroided data (green). Most algorithms from this point on will work with centroided data.

To convert all RAW files in a folder may write:

msconvert *.RAW -o my_output_dir

Note

To display all options you may type msconvert --help . Additional information is available on the ProteoWizard web page.

ThermoRawFileParser

Recently the open-source platform independent ThermoRawFileParser tool has been developed. While Proteowizard and MSConvert are only available for Windows systems this new tool allows to also convert raw data on Mac or Linux.

Note

To learn more about the ThermoRawFileParser and how to use it in KNIME see Minimal Workflow.

Overview of the graphical user interface

The KNIME workbench

Figure 7: The KNIME workbench

The graphical user interface (GUI) of KNIME consists of different components or so-called panels that are shown in above image. We will briefly introduce the individual panels and their purposes below.

Workflow Editor

The workflow editor is the central part of the KNIME GUI. Here you assemble the workflow by adding nodes from the Node Repository via ”drag & drop”. For quick creation of a workflow, note that double-clicking on a node in the repository automatically connects it to the selected node in the workbench (connecting all the inputs with as many fitting outputs of the last node). Manually, nodes can be connected by clicking on the output port of one node and dragging the edge until releasing the mouse at the desired input port of the next node. Deletions are possible by selecting nodes and/or edges and pressing DEL or Fn + Backspace depending on your OS and settings. Multiselection happens via dragging rectangles with the mouse or adding elements to the selection by clicking them while holding down Ctrl.

KNIME Explorer

Shows a list of available workflows (also called workflow projects). You can open a workflow by double-clicking it. A new workflow can be created with a right-click in the Workflow Explorer followed by choosing New KNIME Workflow from the appearing context menu. Remember to save your workflow often with the Ctrl + S shortcut.

Workflow Coach

Shows a list of suggested following nodes, based on the last added/clicked nodes. When you are not sure which node to choose next, you have a reasonable suggestion based on other users behavior there. Connect them to the last node with a double-click.

Node Repository

Shows all nodes that are available in your KNIME installation. Every plugin you install will provide new nodes that can be found here. The OpenMS nodes can be found in Community Node > OpenMS Nodes to hook up to external search engines and the RawFileConverter are found under Community Node > OpenMSThirdParty Nodes for managing files (e.g., Input Files or Output Folders) can be found in Community Nodes > GenericKnimeNode. You can search the node repository by typing the node name into the small text box in the upper part of the node repository.

Outline

The Outline panel contains a small overview of the complete workflow. While of limited use when working on a small workflow, this feature is very helpful as soon as the workflows get bigger. You can adjust the zoom level of the explorer by adjusting the percentage in the toolbar at the top of KNIME.

Console

In the console panel, warning and error messages are shown. This panel will provide helpful information if one of the nodes failed or shows a warning sign.

Node Description

As soon as a node is selected, the Node Description window will show the documentation of the node including documentation for all its parameters and especially their in- and outputs, such that you know what types of data nodes may produce or expect. For OpenMS nodes you will also find a link to the tool page of the online documentation.

Creating workflows

Workflows can easily be created by a right click in the Workflow Explorer followed by clicking on New KNIME workflow.

Sharing workflows

To be able to share a workflow with others, KNIME supports the import and export of complete workflows. To export a workflow, select it in the Workflow Explorer and select File > Export KNIME Workflow. KNIME will export workflows as a knwf file containing all the information on nodes, their connections, and their parameter configuration.

Those knwf files can again be imported by selecting: File > Import KNIME Workflow

Note

For your convenience we added all workflows discussed in this tutorial to the Workflows folder on the USB Stick. Additionally, the workflow files can be found on workflow downloads. If you want to check your own workflow by comparing it to the solution or got stuck, simply import the full workflow from the corresponding knwf file and after that double-click it in your KNIME Workflow repository to open it.

Duplicating workflows

In this tutorial, a lot of the workflows will be created based on the workflow from a previous task. To keep the intermediate workflows, we suggest you create copies of your workflows so you can see the progress. To create a copy of your workflow, save it, close it and follow the next steps.

  • Right click on the workflow you want to create a copy of in the Workflow Explorer and select Copy.

  • Right click again somewhere on the workflow explorer and select Paste.

  • This will create a workflow with same name as the one you copied with a (2) appended.

  • To distinguish them later on you can easily rename the workflows in the Workflow Explorer by right clicking on the workflow and selecting Rename.

Note

To rename a workflow it has to be closed.

Minimal Workflow

Let us start with the creation of a simple workflow. As a first step, we will gather some basic information about the data set before starting the actual development of a data analysis workflow. This minimal workflow can also be used to check if all requirements are met and that your system is compatible.

  • Create a new workflow.

  • Add an File Importer node and an Output Folder node (found in Community Nodes > GenericKnimeNodes > IO) and a FileInfo node (to be found in the category Community Node > OpenMS > File Handling) to the workflow.

  • Connect the File Importer node to the FileInfo node, and the first output port of the FileInfo node to the Output Folder node.

Tip

In case you are unsure about which node port to use, hovering the cursor over the port in question will display the port name and what kind of input it expects.

The complete workflow is shown in below image. FileInfo can produce two different kinds of output files.

A minimal workflow calling FileInfo on a single file.

Figure 8: A minimal workflow calling FileInfo on a single file.

  • All nodes are still marked red, since we are missing an actual input file. Double-click the File Importer node and select Browse. In the file system browser select Example_DataIntroductiondatasetstinyvelos005614.mzML and click Open. Afterwards close the dialog by clicking Ok.

  • The File Importer node and the FileInfo node should now have switched to yellow, but the Output Folder node is still red. Double-click on the Output Folder node and click on Browse to select an output directory for the generated data.

  • Great! Your first workflow is now ready to be run. Press + F7 (shift key + F7; or the button with multiple green triangles in the KNIME Toolbar) to execute the complete workflow. You can also right click on any node of your workflow and select Execute from the context menu.

  • The traffic lights tell you about the current status of all nodes in your workflow. Currently running tools show either a progress in percent or a moving blue bar, nodes waiting for data show the small word “queued”, and successfully executed ones become green. If something goes wrong (e.g., a tool crashes), the light will become red.

  • In order to inspect the results, you can just right-click the Output Folder node and select View: Open the output folder You can then open the text file and inspect its contents. You will find some basic information of the data contained in the mzML file, e.g., the total number of spectra and peaks, the RT and m/z range, and how many MS1 and MS2 spectra the file contains.

Workflows are typically constructed to process a large number of files automatically. As a simple example, consider you would like to filter multiple mzML files to only include MS1 spectra. We will now modify the workflow to compute the same information on three different files and then write the output files to a folder.

  • We start from the previous workflow.

  • First we need to replace our single input file with multiple files. Therefore we add the Input Files node from the category Community Nodes > GenericKnimeNodes > IO.

  • To select the files we double-click on the Input Files node and click on Add. In the filesystem browser we select all three files from the directory Example_Data > Introduction > datasets > tiny. And close the dialog with Ok.

  • We now add two more nodes: the ZipLoopStart and the ZipLoopEnd node from the category Community Nodes > *GenericKnimeNodes > Flow and replace the FileInfo node with FileFilter from Community Nodes > OpenMS > File Handling.

  • Afterwards we connect the Input Files node to the first port of the ZipLoopStart node, the first port of the ZipLoopStart node to the FileConverter node, the first output port of the FileConverter node to the first input port of the ZipLoopEnd node, and the first output port of the ZipLoopEnd node to the Output Folder node.

The complete workflow is shown in the top right of the figure below.

A minimal workflow calling the FileFilter on multiple mzML files in a loop

Figure 9: The FileFilter workflow. Showing the configure dialog for FileFilter, and the level selector pane.

Now we need to configure the FileFilter to only store MS1 data. To do this we double click on the FileFilter node to open the configuration dialog (see left pane above), double click “level”, select 2 from the sub-pane (see bottom right panel above), and click delete. Repeat the process for 3. Select OK to exit the sub-pane, and then OK again in the configuration dialog.

Execute the workflow and inspect the output as before.

Now, if you open the resulting files in TOPPView, you can see that only the MS1 spectra remain.

In case you had trouble to understand what ZipLoopStart and ZipLoopEnd do, here is a brief explanation:

  • The Input Files node passes a list of files to the ZipLoopStart node.

  • The ZipLoopStart node takes the files as input, but passes the single files sequentially (that is: one after the other) to the next node.

  • The ZipLoopEnd collects the single files that arrive at its input port. After all files have been processed, the collected files are passed again as file list to the next node that follows.

Advanced topic: Metanodes

Workflows can get rather complex and may contain dozens or even hundreds of nodes. KNIME provides a simple way to improve handling and clarity of large workflows:

Metanodes allow to bundle several nodes into a single Metanode.

Task

Select multiple nodes (e.g. all nodes of the ZipLoop including the start and end node). To select a set of nodes, draw a rectangle around them with the left mouse button or hold Ctrl to add/remove single nodes from the selection.

Tip

There is a Select Scope option when you right-click a node in a loop, that does exactly that for you. Then, open the context menu (right-click on a node in the selection) and select Create Metanode. Enter a caption for the Metanode. The previously selected nodes are now contained in the Metanode. Double-clicking on the Metanode will display the contained nodes in a new tab window.

Task

Create the Metanode to let it behave like an encapsulated single node. First select the Metanode, open the context menu (right-click) and select Metanode > Convert to Component. The differences between Metanodes and components are marginal (Metanodes allow exposing user inputs, workflow variables and contained nodes). Therefore, we suggest to use standard Metanodes to clean up your workflow and cluster common subparts until you actually notice their limits.

Task

Undo the packaging. First select the Metanode/Component, open the context menu (right-click) and select Metanode/Component > Expand.

Label-free quantification of peptides and proteins
Introduction

In the following chapter, we will build a workflow with OpenMS / KNIME to quantify a label-free experiment. Label-free quantification is a method aiming to compare the relative amounts of proteins or peptides in two or more samples. We will start from the minimal workflow of the last chapter and, step-by-step, build a label-free quantification workflow.

The complete workflow can be downloaded here as well.

Peptide identification

As a start, we will extend the minimal workflow so that it performs a peptide identification using the Comet search engine. Comet is included in the OpenMS installation, so you do not need to download and install it yourself.

Let’s start by replacing the input files in our Input Files node by the three mzML files in Example Data > Labelfree > datasets > lfqxspikeinxdilutionx1-3.mzML. This is a reduced toy dataset where each of the three runs contains a constant background of S. pyogenes peptides as well as human spike-in peptides in different concentrations. [1]

  • Instead of FileFilter, we want to perform Comet identification, so we simply replace the FileFilter node with the CometAdapter node Community Nodes > OpenMSThirdParty > Identification, and we are almost done. Just make sure you have connected the ZipLoopStart node with the in (top) port of the CometAdapter node.

  • Comet, like most mass spectrometry identification engines, relies on searching the input spectra against sequence databases. Thus, we need to introduce a search database input. As we want to use the same search database for all of our input files, we can just add a single File Importer node to the workflow and connect it directly with the CometAdapter database (middle) port. KNIME will automatically reuse this Input node each time a new ZipLoop iteration is started. In order to specify the database, select Example_DataLabelfreedatabases
    s_pyo_sf370_potato_human_target_decoy_with_contaminants.fasta

  • Connect the out port of the CometAdapter to ZipLoopEnd and we have a very basic peptide identification workflow.

    Note

    You might also want to save your new identification workflow under a different name. Have a look at duplicating workflows for information on how to create copies of workflows.

  • The result of a single Comet run is basically a number of peptide-spectrum-matches (PSM) with a score each, and these will be stored in an idXML file. Now we can run the pipeline and after execution is finished, we can have a first look at the the results: just open the output folder with a file browser and from there open one of the three input mzML’s in TOPPView.

  • Here, annotate this spectrum data file with the peptide identification results. Choose Tools > Annonate with identification from the menu and select the idXML file that CometAdapter generated (it is located within the output directory that you specified when starting the pipeline).

  • On the right, select the tab Identification view. All identified peptides can be seen using this view. User can also browse the corresponding MS2 spectra.

    Note

    Opening the output file of CometAdapter (the idXML file) directly is also possible, but unless you REALLY like XML, reading idXML files is less useful.

  • The search results stored in the idXML file can also be read back into a KNIME table for inspection and subsequent analyses: Add a TextExporter node from Community Nodes > OpenMS > File Handling to your workflow and connect the output port of your CometAdapter (the same port ZipLoopEnd is connected to) to its input port. This tool will convert the idXML file to a more human-readable text file which can also be read into a KNIME table using the IDTextReader node. Add an IDTextReader node(Community Nodes > OpenMS > Conversion) after TextExporter and execute it. Now you can right click IDTextReader and select ID Table to browse your peptide identifications.

  • From here, you can use all the tools KNIME offers for analyzing the data in this table. As a simple example, add a Histogram node (from category Views) node after IDTextReader, double-click it, select peptide_charge as Dimension, click Save and Execute to generate a plot showing the charge state distribution of your identifications.

In the next step, we will tweak the parameters of Comet to better reflect the instrument’s accuracy. Also, we will extend our pipeline with a false discovery rate (FDR) filter to retain only those identifications that will yield an FDR of < 1 %.

  • Open the configuration dialog of CometAdapter. The dataset was recorded using an LTQ Orbitrap XL mass spectrometer, set the precursor_mass_tolerance to 5 and precursor_error_units to ppm.

    Note

    Whenever you change the configuration of a node, the node as well as all its successors will be reset to the Configured state (all node results are discarded and need to be recalculated by executing the nodes again).

  • Make sure that Carbamidomethyl (C) is set as fixed modification and Oxidation(M) as variable modification.

    Note

    To add a modification click on the empty value field in the configuration dialog to open the list editor dialog. In the new dialog click Add. Then select the newly added modification to open the drop down list where you can select the the correct modification.

  • A common step in analysis is to search not only against a regular protein database, but to also search against a decoy database for FDR estimation. The fasta file we used before already contains such a decoy database. For OpenMS to know which Comet PSM came from which part of the file (i.e. target versus decoy), we have to index the results. To this end, extend the workflow with a PeptideIndexer node Community Nodes > OpenMS > ID Processing. This node needs the idXML as input as well as the database file (see below figure).

    Tip

    You can direct the files of an File Importer node to more than just one destination port.

  • The decoys in the database are prefixed with “DECOY_”, so we have to set decoy_string to DECOY_ and decoy_string_position to prefix in the configuration dialog of PeptideIndexer.

  • Now we can go for the FDR estimation, which the FalseDiscoveryRate node will calculate for us (you will find it in Community Nodes > OpenMS > Identification Processing). FalseDiscoveryRate is meant to be run on data with protein inferencences (more on that later), in order to just use it for peptides, open the configure window, select “show advanced parameter” and toggle “force” to true.

  • In order to set the FDR level to 1%, we need an IDFilter node from Community Nodes > OpenMS > Identification Processing. Configuring its parameter score→pep to 0.01 will do the trick. The FDR calculations (embedded in the idXML) from the FalseDiscoveryRate node will go into the in port of the IDFilter node.

  • Execute your workflow and inspect the results using IDTextReader like you did before. How many peptides did you identify at this FDR threshold?

The below images shows Comet ID pipeline including FDR filtering.

Comet ID pipeline including FDR filtering

Figure 12: Comet ID pipeline including FDR filtering

Bonus task: Identification using several search engines

Note

If you are ahead of the tutorial or later on, you can further improve your FDR identification workflow by a so-called consensus identification using several search engines. Otherwise, just continue with quantification.

It has become widely accepted that the parallel usage of different search engines can increase peptide identification rates in shotgun proteomics experiments. The ConsensusID algorithm is based on the calculation of posterior error probabilities (PEP) and a combination of the normalized scores by considering missing peptide sequences.

  • Next to the CometAdapter add a XTandemAdapter Community Nodes > OpenMSThirdParty > Identification of Proteins > Peptides(SearchEngines) node and set its parameters and ports analogously to the CometAdapter. In XTandem, to get more evenly distributed scores, we decrease the number of candidates a bit by setting the precursor mass tolerance to 5 ppm and the fragment mass tolerance to 0.1 Da.

  • To calculate the PEP, introduce a IDPosteriorErrorProbability Community Nodes > OpenMS > Identification Processing node to the output of each ID engine adapter node. This will calculate the PEP to each hit and output an updated idXML.

  • To create a consensus, we must first merge these two files with a FileMerger node Community Nodes > GenericKnimeNode > Flow so we can then merge the corresponding IDs with a IDMerger Community Nodes > OpenMS > File Handling.

  • Now we can create a consensus identification with the ConsensusID Community Nodes > OpenMS > Identification Processing node. We can connect this to the PeptideIndexer and go along with our existing FDR filtering.

    Note

    By default, X!Tandem takes additional enzyme cutting rules into consideration (besides the specified tryptic digest). Thus for the tutorial files, you have to set PeptideIndexer’s enzyme→specificity parameter to none to accept X!Tandem’s non-tryptic identifications as well.

In the end, the ID processing part of the workflow can be collapsed into a Metanode to keep the structure clean (see below figure which shows complete consensus identification workflow).

Complete consensus identification workflow

Figure 13: Complete consensus identification workflow

Feature Mapping

Now that we have successfully constructed a peptide identification pipeline, we can assign this information to the corresponding feature signals.

  • Add a FeatureFinderCentroided node from Community Nodes > OpenMS > Quantitation which gets input from the first output port of the ZipLoopStart node. Also, add an IDMapper node (from Community Nodes > OpenMS > Identification Processing ) which receives input from the FeatureFinderCentroided node (Port 1) and the IDFilter node (Port 0). The output of the IDMapper node is then connected to an in port of the ZipLoopEnd node.

  • FeatureFinderCentroided finds and quantifies peptide ion signals contained in the MS1 data. It reduces the entire signal, i.e., all peaks explained by one and the same peptide ion signal, to a single peak at the maximum of the chromatographic elution profile of the monoisotopic mass trace of this peptide ion and assigns an overall intensity.

  • FeatureFinderCentroided produces a featureXML file as output, containing only quantitative information of so-far unidentified peptide signals. In order to annotate these with the corresponding ID information, we need the IDMapper node.

  • Run your pipeline and inspect the results of the IDMapper node in TOPPView. Open the mzML file of your data to display the raw peak intensities.

  • To assess how well the feature finding worked, you can project the features contained in the featureXML file on the raw data contained in the mzML file. To this end, open the featureXML file in TOPPView by clicking on File Open file and add it to a new layer ( Open in New layer ). The features are now visualized on top of your raw data. If you zoom in on a small region, you should be able to see the individual boxes around features that have been detected (see Fig. 14). If you hover over the the feature centroid (small circle indicating the chromatographic apex of monoisotopic trace) additional information of the feature is displayed.

    |Visualization of detected features (boxes) in TOPPView| |:–:| |Figure 14: Visualization of detected features (boxes) in TOPPView|

    Note

    The chromatographic RT range of a feature is about 30-60 s and its m/z range around 2.5 m/z in this dataset. If you have trouble zooming in on a feature, select the full RT range and zoom only into the m/z dimension by holding down Ctrl ( cmd ⌘ on macOS) and repeatedly dragging a narrow box from the very left to the very right

  • You can see which features were annotated with a peptide identification by first selecting the featureXML file in the Layers window on the upper right side and then clicking on the icon with the letters A, B and C on the upper icon bar. Now, click on the small triangle next to that icon and select Peptide identification.

The following image shows the final constructed workflow:

Extended workflow featuring peptide identification and quantification

Figure 15: Extended workflow featuring peptide identification and feature mapping.

Combining features across several label-free experiments

So far, we successfully performed peptide identification as well as feature mapping on individual LC-MS runs. For differential label-free analyses, however, we need to identify and map corresponding signals in different experiments and link them together to compare their intensities. Thus, we will now run our pipeline on all three available input files and extend it a bit further, so that it is able to find and link features across several runs.

Complete identification and label-free quantification workflow

Figure 16: Complete identification and label-free feature mapping workflow. The identification nodes are grouped together as ID metanode.

  • To link features across several maps, we first have to align them to correct for retention time shifts between the different label-free measurements. With the MapAlignerPoseClustering node in Community Nodes > OpenMS > Map Alignment, we can align corresponding peptide signals to each other as closely as possible by applying a transformation in the RT dimension.

    Note

    MapAlignerPoseClustering consumes several featureXML files and its output should still be several featureXML files containing the same features, but with the transformed RT values. In its configuration dialog, make sure that OutputTypes is set to featureXML.

  • With the FeatureLinkerUnlabeledQT node in Community Nodes > OpenMS > Map Alignment, we can then perform the actual linking of corresponding features. Its output is a consensusXML file containing linked groups of corresponding features across the different experiments.

  • Since the overall intensities can vary a lot between different measurements (for example, because the amount of injected analytes was different), we apply the ConsensusMapNormalizer node in Community Node > OpenMS > Map Alignment as a last processing step. Configure its parameters with setting algorithm_type to median. It will then normalize the maps in such a way that the median intensity of all input maps is equal.

  • Export the resulting normalized consensusXML file to a csv format using the TextExporter node.

  • Use the ConsensusTextReader node in Community Nodes > OpenMS > Conversion to convert the output into a KNIME table. After running the node you can view the KNIME table by right-clicking on the ConsensusTextReader node and selecting Consensus Table. Every row in this table corresponds to a so-called consensus feature, i.e., a peptide signal quantified across several runs. The first couple of columns describe the consensus feature as a whole (average RT and m/z across the maps, charge, etc.). The remaining columns describe the exact positions and intensities of the quantified features separately for all input samples (e.g., intensity_0 is the intensity of the feature in the first input file). The last 11 columns contain information on peptide identification.

    Note

    You can specify the desired column separation character in the parameter settings (by default, it is set to “ ” (a space)). The output file of TextExporter can also be opened with external tools, e.g., Microsoft Excel, for downstream statistical analyses.

Basic data analysis in KNIME

In this section we are going to use the output of the ConsensusTextReader for downstream analysis of the quantification results:

  • Let’s say we want to plot the log intensity distributions of the human spike-in peptides for all input files. In addition, we will plot the intensity distributions of the background peptides.

  • As shown in Fig. 17, add a Row Splitter node (Data Manipulation > Row > Filter) after the ConsensusTextReader node. Double-click it to configure. The human spike-in peptides have accessions starting with “hum”. Thus, set the column to apply the test to accessions, select pattern matching as matching criterion, enter hum* into the corresponding text field, and check the contains wild cards box. Press OK and execute the node.

  • Row Splitter produces two output tables: the first one contains all rows from the input table matching the filter criterion, and the second table contains all other rows. You can inspect the tables by right-clicking and selecting Filtered and Filtered Out. The former table should now only contain peptides with a human accession, whereas the latter should contain all remaining peptides (including unidentified ones).

  • Now, since we only want to plot intensities, we can add a Column Filter node by going to Data Manipulation > Column Filter. Connect its input port to the Filtered output port of the Row Filter node, and open its configuration dialog. We could either manually select the columns we want to keep, or, more elegantly, select Wildcard/Regex Selection and enter intensity_? as the pattern. KNIME will interactively show you which columns your pattern applies to while you’re typing.

  • Since we want to plot log intensities, we will now compute the log of all intensity values in our table. The easiest way to do this in KNIME is a small piece of R code. Add an R Snippet node R after Column Filter and double-click to configure. In the R Script text editor, enter the following code:

    x <- knime.in       # store copy of input table in x
    
    x[x == 0] <- NA     # replace all zeros by NA (= missing value)
    
    x <- log10(x)       # compute log of all values
    knime.out <- x      # write result to output table
    
  • Now we are ready to plot! Add a Box Plot (JavaScript) node Views -JavaScript after the R Snippet node, execute it, and open its view. If everything went well, you should see a significant fold change of your human peptide intensities across the three runs.

  • To verify that the concentration of background peptides is constant in all three runs, copy and paste the three nodes after Row Splitter and connect the duplicated Column Filter to the second output port (Filtered Out) of Row Splitter, as shown in Fig. 17. Execute and open the view of your second Box Plot.

You have now constructed an entire identification and label-free feature mapping workflow including a simple data analysis using KNIME. The final workflow should like the workflow shown in the following image:

Simple KNIME data analysis example for LFQ

Figure 17: Simple KNIME data analysis example for LFQ

Extending the LFQ workflow by protein inference and quantification

We have made the following changes compared to the original label-free quantification workflow from the last chapter:

  • First, we have added a ProteinQuantifier node and connected its input port to the output port of the ConsensusMapNormalizer node.

  • This already enables protein quantification. ProteinQuantifier quantifies peptides by summarizing over all observed charge states and proteins by summarizing over their quantified peptides. It stores two output files, one for the quantified peptides and one for the proteins.

  • In this example, we consider only the protein quantification output file, which is written to the first output port of ProteinQuantifier.

  • Because there is no dedicated node in KNIME to read back the ProteinQuantifier output file format into a KNIME table, we have to use a workaround. Here, we have added an additional URI Port to Variable node which converts the name of the output file to a so-called “flow variable” in KNIME. This variable is passed on to the next node CSV Reader, where it is used to specify the name of the input file to be read. If you double-click on CSV Reader, you will see that the text field, where you usually enter the location of the CSV file to be read, is greyed out. Instead, the flow variable is used to specify the location, as indicated by the small green button with the “v=?” label on the right.

  • The table containing the ProteinQuantifier results is filtered one more time in order to remove decoy proteins. You can have a look at the final list of quantified protein groups by right-clicking the Row Filter and selecting Filtered.

  • By default, i.e., when the second input port protein_groups is not used, ProteinQuantifier quantifies proteins using only the unique peptides, which usually results in rather low numbers of quantified proteins.

  • In this example, however, we have performed protein inference using Fido and used the resulting protein grouping information to also quantify indistinguishable proteins. In fact, we also used a greedy method in FidoAdapter (parameter greedy_group_resolution) to uniquely assign the peptides of a group to the most probable protein(s) in the respective group. This boosts the number of quantifications but slightly raises the chances to yield distorted protein quantities.

  • As a prerequisite for using FidoAdapter, we have added an IDPosteriorErrorProbability node within the ID meta node, between the XTandemAdapter (note the replacement of OMSSA because of ill-calibrated scores) and PeptideIndexer. We have set its parameter prob_correct to true, so it computes posterior probabilities instead of posterior error probabilities (1 - PEP). These are stored in the resulting idXML file and later on used by the Fido algorithm. Also note that we excluded FDR filtering from the standard meta node. Harsh filtering before inference impacts the calibration of the results. Since we filter peptides before quantification though, no potentially random peptides will be included in the results anyway.

  • Next, we have added a third outgoing connection to our ID meta node and connected it to the second input port of ZipLoopEnd. Thus, KNIME will wait until all input files have been processed by the loop and then pass on the resulting list of idXML files to the subsequent IDMerger node, which merges all identifications from all idXML files into a single idXML file. This is done to get a unique assignment of peptides to proteins over all samples.

  • Instead of the meta node Protein inference with FidoAdapter, we could have just used a FidoAdapter node ( Community Nodes > OpenMS > Identification Processing). However, the meta node contains an additional subworkflow which, besides calling FidoAdapter, performs a statistical validation (e.g. (pseudo) receiver operating curves; ROCs) of the protein inference results using some of the more advanced KNIME and R nodes. The meta node also shows how to use MzTabExporter and MzTabReader.

Statistical validation of protein inference results

In the following section, we will explain the subworkflow contained in the Protein inference with FidoAdapter meta node.

Data preparation

For downstream analysis on the protein ID level in KNIME, it is again necessary to convert the idXML-file-format result generated from FidoAdapter into a KNIME table.

  • We use the MzTabExporter to convert the inference results from FidoAdapter to a human readable, tab-separated mzTab file. mzTab contains multiple sections, that are all exported by default, if applicable. This file, with its different sections can again be read by the MzTabReader that produces one output in KNIME table format (triangle ports) for each section. Some ports might be empty if a section did not exist. Of course, we continue by connecting the downstream nodes with the protein section output (second port).

  • Since the protein section contains single proteins as well as protein groups, we filter them for single proteins with the standard Row Filter.

ROC curve of protein ID

ROC Curves (Receiver Operating Characteristic curves) are graphical plots that visualize sensitivity (true-positive rate) against fall-out (false positive rate). They are often used to judge the quality of a discrimination method like e.g., peptide or protein identification engines. ROC Curve already provides the functionality of drawing ROC curves for binary classification problems. When configuring this node, select the opt_global_target_decoy column as the class (i.e. target outcome) column. We want to find out, how good our inferred protein probability discriminates between them, therefore add best_search_engine_score[1] (the inference engine score is treated like a peptide search engine score) to the list of ”Columns containing positive class probabilities”. View the plot by right-clicking and selecting View: ROC Curves. A perfect classifier has an area under the curve (AUC) of 1.0 and its curve touches the upper left of the plot. However, in protein or peptide identification, the ground-truth (i.e., which target identifications are true, which are false) is usually not known. Instead, so called pseudoROC Curves are regularly used to plot the number of target proteins against the false discovery rate (FDR) or its protein-centric counterpart, the q-value. The FDR is approximated by using the target-decoy estimate in order to distinguish true IDs from false IDs by separating target IDs from decoy IDs.

Posterior probability and FDR of protein IDs

ROC curves illustrate the discriminative capability of the scores of IDs. In the case of protein identifications, Fido produces the posterior probability of each protein as the output score. However, a perfect score should not only be highly discriminative (distinguishing true from false IDs), it should also be “calibrated” (for probability indicating that all IDs with reported posterior probability scores of 95% should roughly have a 5% probability of being false. This implies that the estimated number of false positives can be computed as the sum of posterior error probabilities ( = 1 - posterior probability) in a set, divided by the number of proteins in the set. Thereby a posterior-probability-estimated FDR is computed which can be compared to the actual target-decoy FDR. We can plot calibration curves to help us visualize the quality of the score (when the score is interpreted as a probability as Fido does), by comparing how similar the target-decoy estimated FDR and the posterior probability estimated FDR are. Good results should show a close correspondence between these two measurements, although a non-correspondence does not necessarily indicate wrong results.

The calculation is done by using a simple R script in R snippet. First, the target decoy protein FDR is computed as the proportion of decoy proteins among all significant protein IDs. Then posterior probabilistic-driven FDR is estimated by the average of the posterior error probability of all significant protein IDs. Since FDR is the property for a group of protein IDs, we can also calculate a local property for each protein: the q-value of a certain protein ID is the minimum FDR of any groups of protein IDs that contain this protein ID. We plot the protein ID results versus two different kinds of FDR estimates in R View(Table) (see Fig. 22).

The workflow of statistical analysis of protein inference results

Figure 21: The workflow of statistical analysis of protein inference results

The pseudo-ROC Curve of protein IDs

Figure 22: The pseudo-ROC Curve of protein IDs. The accumulated number of protein IDs is plotted on two kinds of scales: target-decoy protein FDR and Fido posterior probability estimated FDR. The largest value of posterior probability estimated FDR is already smaller than 0.04, this is because the posterior probability output from Fido is generally very high

References
MSStats
R integration

KNIME provides a large number of nodes for a wide range of statistical analysis, machine learning, data processing, and visualization. Still, more recent statistical analysis methods, specialized visualizations or cutting edge algorithms may not be covered in KNIME. In order to expand its capabilities beyond the readily available nodes, external scripting languages can be integrated. In this tutorial, we primarily use scripts of the powerful statistical computing language R. Note that this part is considered advanced and might be difficult to follow if you are not familiar with R. In this case you might skip this part.

R View (Table) allows to seamlessly include R scripts into KNIME. We will demonstrate on a minimal example how such a script is integrated.

Task

First we need some example data in KNIME, which we will generate using the Data Generator node (IO > Other > Data Generator). You can keep the default settings and execute the node. The table contains four columns, each containing random coordinates and one column containing a cluster number (Cluster_0 to Cluster_3). Now place a R View (Table) node into the workflow and connect the upper output port of the Data Generator node to the input of the R View (Table) node. Right-click and configure the node. If you get an error message like Execute failed: R_HOME does not contain a folder with name ’bin’. or Execution failed: R Home is invalid.: please change the R settings in the preferences. To do so open File > Preferences > KNIME > R and enter the path to your R installation (the folder that contains the bin directory. e.g., C:Program FilesRR-3.4.3).

If you get an error message like: ”Execute failed: Could not find Rserve package. Please install it in your R installation by running ”install.packages(’Rserve’)”.” You may need to run your R binary as administrator (In windows explorer: right-click ”Run as administrator”) and enter install.packages(’Rserve’) to install the package.

If R is correctly recognized we can start writing an R script. Consider that we are interested in plotting the first and second coordinates and color them according to their cluster number. In R this can be done in a single line. In the R view (Table) text editor, enter the following code:

plot(x=knime.in$Universe_0_0, y=knime.in$Universe_0_1, main="Plotting column Universe_0_0 vs. Universe_0_1", col=knime.in$"Cluster Membership")

Explanation: The table provided as input to the R View (Table) node is available as R data.frame with name knime.in. Columns (also listed on the left side of the R View window) can be accessed in the usual R way by first specifying the data.frame name and then the column name (e.g., knime.in$Universe_0_0). plot is the plotting function we use to generate the image. We tell it to use the data in column Universe_0_0 of the dataframe object knime.in (denoted as knime.in$Universe_0_0) as x-coordinate and the other column knime.in$Universe_0_1 as y-coordinate in the plot. main is simply the main title of the plot and col the column that is used to determine the color (in this case it is the Cluster Membership column).

Now press the Eval script and Show plot buttons.

Note

Note that we needed to put some extra quotes around Cluster Membership. If we omit those, R would interpret the column name only up to the first space (knime.in$Cluster) which is not present in the table and leads to an error. Quotes are regularly needed if column names contain spaces, tabs or other special characters like $ itself.

Using MSstats in a KNIME workflow

The R package MSstats can be used for statistical relative quantification of proteins and peptides in mass spectrometry-based proteomics. Supported are label-free as well as labeled experiments in combination with data-dependent, targeted and data independent acquisition. Inputs can be identified and quantified entities (peptides or proteins) and the output is a list of differentially abundant entities, or summaries of their relative abundance. It depends on accurate feature detection, identification and quantification which can be performed e.g. by an OpenMS workflow. MSstats can be used for data processing & visualization, as well as statistical modeling & inference. Please see [1] and the MSstats website for further information.

Identification and quantification of the iPRG2015 data with subsequent MSstats analysis

Here, we describe how to use OpenMS and MSstats for the analysis of the ABRF iPRG2015 dataset[2].

Note

Reanalysing the full dataset from scratch would take too long. In the following tutorial, we will focus on just the conversion process and the downstream analysis.

Dataset

The iPRG (Proteome Informatics Research Group) dataset from the study in 2015, as described in [2], aims at evaluating the effect of statistical analysis software on the accuracy of results on a proteomics label-free quantification experiment. The data is based on four artificial samples with known composition (background: 200 ng S. cerevisiae). These were spiked with different quantities of individual digested proteins, whose identifiers were masked for the competition as yeast proteins in the provided database (see Table 1).

Table 1: Samples (background: 200 ng S. cerevisiae) with spiked-in proteins
in different quantities [fmols]
Samples
Name
Origin
Molecular Weight
1
2 3 4
A Ovalbumin Egg White 45 KD 65 55 15 2
B Myoglobin Equine Heart 17 KD 55 15 2 65
C Phosphorylase b Rabbit Muscle 97 KD 15 2 65 55
D Beta-Glactosidase Escherichia Coli 116 KD 2 65 55 15
E Bovine Serum Albumin Bovine Serum 66 KD 11 0.6 10 500
F Carbonic Anhydrase Bovine Erythrocytes 29 KD 10 500 11 0.6
Identification and quantification

KNIME data analysis of iPRG LFQ data

Figure 18: KNIME data analysis of iPRG LFQ data.

The iPRG LFQ workflow (Fig. 18) consists of an identification and a quantification part. The identification is achieved by searching the computationally calculated MS2 spectra from a sequence database (File Importer node, here with the given database from iPRG: ExampleDataiPRG2015databaseiPRG2015targetdecoynocontaminants.fasta against the MS2 from the original data (Input Files node with all mzMLs following ExampleDataiPRG2015datasetsJD06232014sample*.mzML using the CometAdapter.

Note

If you want to reproduce the results at home, you have to download the iPRG data in mzML format and perform peak picking on it or convert and pick the raw data with msconvert.

Afterwards, the results are scored using the FalseDiscoveryRate node and filtered to obtain only unique peptides (IDFilter) since MSstats does not support shared peptides, yet. The quantification is achieved by using the FeatureFinderCentroided node, which performs the feature detection on the samples (maps). In the end the quantification results are combined with the filtered identification results (IDMapper). In addition, a linear retention time alignment is performed (MapAlignerPoseClustering), followed by the feature linking process (FeatureLinkerUnlabledQT). The ConsensusMapNormalizers is used to normalize the intensities via robust regression over a set of maps and the IDConflictResolver assures that only one identification (best score) is associated with a feature. The output of this workflow is a consensusXML file, which can now be converted using the MSStatsConverter (see Conversion and downstream analysis section).

Experimental design

As mentioned before, the downstream analysis can be performed using MSstats. In this case, an experimental design has to be specified for the OpenMS workflow. The structure of the experimental design used in OpenMS in case of the iPRG dataset is specified in Table 2.

Table 2: OpenMS Experimental design for the iPRG2015 dataset.
Fraction_Group Fraction Spectra_Filepath Label Sample
1 1 Sample1-A 1 1
2 1 Sample1-B 1 2
3 1 Sample1-C 1 3
4 1 Sample2-A 1 4
5 1 Sample2-B 1 5
6 1 Sample2-C 1 6
7 1 Sample3-A 1 7
8 1 Sample3-B 1 8
9 1 Sample3-C 1 9
10 1 Sample4-A 1 10
11 1 Sample4-B 1 11
12 1 Sample4-C 1 12
Sample MSstats_Condition MSstats_BioReplicate
1 1 1
2 1 2
3 1 3
4 2 4
5 2 5
6 2 6
7 3 7
8 3 8
9 3 9
10 4 10
11 4 11
12 4 12

An explanation of the variables can be found in Table 3.

Table 3: Explanation of the column of the experimental design table
variables

value

Fraction_Group

Index used to group fractions and source files.

Fraction

1st, 2nd, .., fraction. Note: All runs must have the same number of fractions.

Spectra_Filepath

Path to mzML files

Label

label-free: always 1

TMT6Plex: 1...6

SILAC with light and heavy: 1..2

Sample

Index of sample measured in the specified label X, in fraction Y of fraction group Z.

Conditions

Further specification of different conditions (e.g. MSstats_Condition; MSstats_BioReplicate)

The conditions are highly dependent on the type of experiment and on which kind of analysis you want to perform. For the MSstats analysis the information which sample belongs to which condition and if there are biological replicates are mandatory. This can be specified in further condition columns as explained in Table 3. For a detailed description of the MSstats-specific terminology, see their documentation e.g. in the R vignette.

Conversion and downstream analysis

Conversion of the OpenMS-internal consensusXML format (which is an aggregation of quantified and possibly identified features across several MS-maps) to a table (in MSstats-conformant CSV format) is very easy. First, create a new KNIME workflow. Then, run the MSStatsConverter node with a consensusXML and the manually created (e.g. in Excel) experimental design as inputs (loaded via File Importer nodes). The first input can be found in:

ExampleDataiPRG2015openmsLFQResultsiPRGlfq.consensusXML

This file was generated by using the WorkflowsopenmsLFQiPRG2015.knwf workflow (seen in Fig. 18). The second input is specified in:

ExampleDataiPRG2015experimentaldesign.tsv

Adjust the parameters in the config dialog of the converter to match the given experimental design file and to use a simple summing for peptides that elute in multiple features (with the same charge state, i.e. m/z value).

parameter

value

msstats_bioreplicate

MSstats_Bioreplicate

msstats_condition

MSstats_Condition

labeled_reference_peptides

false

retention_time_summarization_method (advanced)

sum

The downstream analysis of the peptide ions with MSstats is performed in several steps. These steps are reflected by several KNIME R nodes, which consume the output of MSStatsConverter. The outline of the workflow is shown in Figure 19.

MSstats analysis using KNIME

Figure 19: MSstats analysis using KNIME. The individual steps (Preprocessing, Group Comparisons, Result Data Renaming, and Export) are split among several consecutive nodes.

We load the file resulting from MSStatsConverter either by saving it with an Output File node and reloading it with the File Reader. Alternatively, for advanced users, you can use a URI Port to Variable node and use the variable in the File Reader config dialog (V button - located on the right of the Browse button) to read from the temporary file.

Preprocessing

The first node (Table to R) loads MSstats as well as the data from the previous KNIME node and performs a preprocessing step on the input data. The following inline R script ( needs to be pasted into the config dialog of the node):

library(MSstats)
data <- knime.in
quant <- OpenMStoMSstatsFormat(data, removeProtein_with1Feature = FALSE)

The inline R script allows further preparation of the data produced by MSStatsConverter before the actual analysis is performed. In this example, the lines with proteins, which were identified with only one feature, were retained. Alternatively they could be removed. In the same node, most importantly, the following line transforms the data into a format that is understood by MSstats:

processed.quant <- dataProcess(quant, censoredInt = 'NA')

Here, dataProcess is one of the most important functions that the R package provides. The function performs the following steps:

  1. Logarithm transformation of the intensities

  2. Normalization

  3. Feature selection

  4. Missing value imputation

  5. Run-level summarization

In this example, we just state that missing intensity values are represented by the NA string.

Group Comparison

The goal of the analysis is the determination of differentially-expressed proteins among the different conditions C1-C4. We can specify the comparisons that we want to make in a comparison matrix. For this, let’s consider the following example:

comparison matrix

This matrix has the following properties:

  • The number of rows equals the number of comparisons that we want to perform, the number of columns equals the number of conditions (here, column 1 refers to C1, column 2 to C2 and so forth).

  • The entries of each row consist of exactly one 1 and one -1, the others must be 0.

  • The condition with the entry 1 constitutes the enumerator of the log2 fold-change. The one with entry -1 denotes the denominator. Hence, the first row states that we want calculate:

\[ \begin{equation} \log_2 \frac{C_{2}}{C_{1}} \end{equation}\]

We can generate such a matrix in R using the following code snippet in (for example) a new R to R node that takes over the R workspace from the previous node with all its variables:

comparison1<-matrix(c(-1,1,0,0),nrow=1)   
comparison2<-matrix(c(-1,0,1,0),nrow=1)

comparison3<-matrix(c(-1,0,0,1),nrow=1)  
comparison4<-matrix(c(0,-1,1,0),nrow=1)

comparison5<-matrix(c(0,-1,0,1),nrow=1)  
comparison6<-matrix(c(0,0,-1,1),nrow=1)

comparison <- rbind(comparison1, comparison2, comparison3, comparison4, comparison5, comparison6)
row.names(comparison)<-c("C2-C1","C3-C1","C4-C1","C3-C2","C4-C2","C4-C3")

Here, we assemble each row in turn, concatenate them at the end, and provide row names for labeling the rows with the respective condition. In MSstats, the group comparison is then performed with the following line:

test.MSstats <- groupComparison(contrast.matrix=comparison, data=processed.quant)

No more parameters need to be set for performing the comparison.

Result processing

In a next R to R node, the results are being processed. The following code snippet will rename the spiked-in proteins to A,B,C,D,E, and F and remove the names of other proteins, which will be beneficial for the subsequent visualization, as for example performed in Figure 20:

  test.MSstats.cr <- test.MSstats$ComparisonResult   


  # Rename spiked ins to A,B,C....  
  pnames <- c("A", "B", "C", "D", "E", "F")

  names(pnames) <- c(  
  "sp|P44015|VAC2_YEAST",  
  "sp|P55752|ISCB_YEAST",

  "sp|P44374|SFG2_YEAST",  
  "sp|P44983|UTR6_YEAST",  
  "sp|P44683|PGA4_YEAST",

  "sp|P55249|ZRT4_YEAST"  
  )  

  test.MSstats.cr.spikedins <- bind_rows(

  test.MSstats.cr[grep("P44015", test.MSstats.cr$Protein),],

  test.MSstats.cr[grep("P55752", test.MSstats.cr$Protein),],

  test.MSstats.cr[grep("P44374", test.MSstats.cr$Protein),],

  test.MSstats.cr[grep("P44683", test.MSstats.cr$Protein),],

  test.MSstats.cr[grep("P44983", test.MSstats.cr$Protein),],

  test.MSstats.cr[grep("P55249", test.MSstats.cr$Protein),]  
  )  
  # Rename Proteins

  test.MSstats.cr.spikedins$Protein <- sapply(test.MSstats.cr.spikedins$Protein, function(x) {pnames[as.character(x)]})



  test.MSstats.cr$Protein <- sapply(test.MSstats.cr$Protein, function(x) {


    x <- as.character(x)  

    if (x %in% names(pnames)) {


      return(pnames[as.character(x)])  
      } else {  
      return("")

    }
  })
Export

The last four nodes, each connected and making use of the same workspace from the last node, will export the results to a textual representation and volcano plots for further inspection. Firstly, quality control can be performed with the following snippet:

qcplot <- dataProcessPlots(processed.quant, type="QCplot",   
        ylimDown=0,

which.Protein = 'allonly',
width=7, height=7, address=F)

The code for this snippet is embedded in the first output node of the workflow. The resulting boxplots show the log2 intensity distribution across the MS runs. The second node is an R View (Workspace) node that returns a Volcano plot which displays differentially expressed proteins between conditions C2 vs. C1. The plot is described in more detail in the following Result section. This is how you generate it:

groupComparisonPlots(data=test.MSstats.cr, type="VolcanoPlot",

                    width=12, height=12,dot.size = 2,ylimUp = 7,

                    which.Comparison = "C2-C1",
                    address=F)

The last two nodes export the MSstats results as a KNIME table for potential further analysis or for writing it to a (e.g. csv) file. Note that you could also write output inside the Rscript if you are familiar with it. Use the following for an R to Table node exporting all results:

knime.out <- test.MSstats.cr

And this for an R to Table node exporting only results for the spike-ins:

knime.out <- test.MSstats.cr.spikedins
Result

An excerpt of the main result of the group comparison can be seen in Figure 20.

Volcano plots c2_c1 Volcano plots c3_c2

Figure 20: Volcano plots produced by the Group Comparison in MSstats The dotted line indicates an adjusted p-value threshold

The Volcano plots show differently expressed spiked-in proteins. In the left plot, which shows the fold-change C2-C1, we can see the proteins D and F (sp|P44983|UTR6_YEAST and sp|P55249|ZRT4_YEAST) are significantly over-expressed in C2, while the proteins B,C, and E (sp|P55752|ISCB_YEAST, sp|P55752|ISCB_YEAST, and sp|P44683|PGA4_YEAST) are under-expressed. In the right plot, which shows the fold-change ratio of C3 vs. C2, we can see the proteins E and C (sp|P44683|PGA4_YEAST and sp|P44374|SFG2_YEAST) over-expressed and the proteins A and F (sp|P44015|VAC2_YEAST and sp|P55249|ZRT4_YEAST) under-expressed. The plots also show further differentially-expressed proteins, which do not belong to the spiked-in proteins.

The full analysis workflow can be found under: WorkflowsMSstatsstatPostProcessingiPRG2015.knwf

Isobaric analysis workflow

In the last chapters, we identified and quantified peptides in a label-free experiment.

In this section, we would like to introduce a possible workflow for the analysis of isobaric data. Let’s have a look at the workflow (see Fig 23).

Workflow for the analysis of isobaric data

Figure 23: Workflow for the analysis of isobaric data

The full analysis workflow can be found here: WorkflowsIdentificationquantificationisobaricinferenceepifanyMSstatsTMT

The workflow has four input nodes. The first for the experimental design to allow for MSstatsTMT compatible export (MSStatsConverter). The second for the .mzML files with the centroided spectra from the isobaric labeling experiment and the third one for the .fasta database used for identification. The last one allows to specify an output path for the plots generated by the R View, which runs MSstatsTMT (I). The quantification (A) is performed using the IsobaricAnalzyer. The tool is able to extract and normalize quantitative information from TMT and iTRAQ data. The values can be assessed from centroided MS2 or MS3 spectra (if available). Isotope correction is performed based on the specified correction matrix (as provided by the manufacturer). The identification © is applied as known from the previous chapters by using database search and a target-decoy database.

To reduce the complexity of the data for later inference the q-value estimation and FDR filtering is performed on PSM level for each file individually (B). Afterwards the identification (PSM) and quantiative information is combined using the IDMapper. After the processing of all available files, the intermediate results are aggregated (FileMerger - D). All PSM results are used for score estimation and protein inference (Epifany) (E). For detailed information about protein inference please see Chaper 4. Then, decoys are removed and the inference results are filtered via a protein group FDR. Peptide level results can be exported via MzTabExporter (F), protein level results can be obtained via the ProteinQuantifier (G) or the results can exported (MSStatsConverter - H) and further processed with the following R pipeline to allow for downstream processing using MSstatsTMT.

Please import the workflow from WorkflowsIdentificationquantificationisobaricinferenceepifanyMSstatsTMT into KNIME via the menu entry File > Import KNIME workflow > Select file and double-click the imported workflow in order to open it. Before you can execute the workflow, you have to correct the locations of the files in the Input Files nodes (don’t forget the one for the FASTA database inside the “ID” meta node). Try and run your workflow by executing all nodes at once.

Excursion MSstatsTMT

The R package MSstatsTMT can be used for protein significance analysis in shotgun mass spectrometry-based proteomic experiments with tandem mass tag (TMT) labeling. MSstatsTMT provides functionality for two types of analysis & their visualization: Protein summarization based on peptide quantification and Model-based group comparison to detect significant changes in abundance. It depends on accurate feature detection, identification and quantification which can be performed e.g. by an OpenMS workflow.

In general, MSstatsTMT can be used for data processing & visualization, as well as statistical modeling. Please see [3] and the MSstats website for further information.

There is also an online lecture and tutorial for MSstatsTMT from the May Institute Workshop 2020.

Dataset and experimental design

We are using the MSV000084264 ground truth dataset, which consists of TMT10plex controlled mixes of different concentrated UPS1 peptides spiked into SILAC HeLa peptides measured in a dilution series https://www.omicsdi.org/dataset/massive/MSV000084264. Figure 24 shows the experimental design. In this experiment, 5 different TMT10plex mixtures – different labeling strategies – were analysed. These were measured in triplicates represented by the 15 MS runs (3 runs each). The example data, database and experimental design to run the workflow can be found here.

Experimental Design

Figure 24: Experimental Design

The experimental design in table format allows for MSstatsTMT compatible export. The design is represented by two tables. The first one 4 represents the overall structure of the experiment in terms of samples, fractions, labels and fraction groups. The second one 5 adds to the first by specifying specific conditions, biological replicates as well as mixtures and label for each channel. For additional information about the experimental design please see Table 3.

Table 4: Experimental Design 1
Spectra_Filepath Fraction Label Fraction_Group Sample
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 1 1 1
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 2 1 2
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 3 1 3
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 4 1 4
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 5 1 5
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 6 1 6
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 7 1 7
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 8 1 8
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 9 1 9
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_01.mzML 1 10 1 10
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 1 2 11
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 2 2 12
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 3 2 13
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 4 2 14
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 5 2 15
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 6 2 16
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 7 2 17
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 8 2 18
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 9 2 19
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_02.mzML 1 10 2 20
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 1 3 21
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 2 3 22
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 3 3 23
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 4 3 24
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 5 3 25
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 6 3 26
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 7 3 27
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 8 3 28
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 9 3 29
161117_SILAC_HeLa_UPS1_TMT10_SPS_MS3_Mixture1_03.mzML 1 10 3 30

Table 5: Experimental Design 2
Sample MSstats_Condition MSstats_BioReplicate MSstats_Mixture LabelName
1 Norm Norm 1 126
2 0.667 0.667 1 127N
3 0.125 0.125 1 127C
4 0.5 0.5 1 128N
5 1 1 1 128C
6 0.125 0.125 1 129N
7 0.5 0.5 1 129C
8 1 1 1 130N
9 0.667 0.667 1 130C
10 Norm Norm 1 131
11 Norm Norm 1 126
12 0.667 0.667 1 127N
13 0.125 0.125 1 127C
14 0.5 0.5 1 128N
15 1 1 1 128C
16 0.125 0.125 1 129N
17 0.5 0.5 1 129C
18 1 1 1 130N
19 0.667 0.667 1 130C
20 Norm Norm 1 131
21 Norm Norm 1 126
22 0.667 0.667 1 127N
23 0.125 0.125 1 127C
24 0.5 0.5 1 128N
25 1 1 1 128C
26 0.125 0.125 1 129N
27 0.5 0.5 1 129C
28 1 1 1 130N
29 0.667 0.667 1 130C
30 Norm Norm 1 131

After running the worklfow, the MSStatsConverter will convert the OpenMS output in addition with the experimental design to a file (.csv) which can be processed by using MSstatsTMT.

MSstatsTMT analysis

Here, we depict the analysis by MSstatsTMT using a segment of the isobaric analysis workflow (Fig. 25). The segment is available as WorkflowsMSstatsTMT.knwf.

MSstatsTMT workflow segment

Figure 25: MSstatsTMT workflow segment

There are two input nodes, the first one takes the result (.csv) from the MSStatsConverter and the second a path to the directory where the plots generated by MSstatsTMT should be saved. The R source node loads the required packages, such as dplyr for data wrangling, MSstatsTMT for analysis and MSstats for plotting. The inputs are further processed in the R View node.

Here, the data of the File Importer is loaded into R using the flow variable [”URI-0”]:

file <- substr(knime.flow.in[["URI-0"]], 6, nchar(knime.flow.in[["URI-0"]]))

MSstatsConverter_OpenMS_out <- read.csv(file)
data <- MSstatsConverter_OpenMS_out

The OpenMStoMSstatsTMTFormat function preprocesses the OpenMS report and converts it into the required input format for MSstatsTMT, by filtering based on unique peptides and measurements in each MS run.

processed.data <- OpenMStoMSstatsTMTFormat(data)

Afterwards different normalization steps are performed (global, protein, runs) as well as data imputation by using the msstats method. In addition peptide level data is summarized to protein level data.

quant.data <- proteinSummarization(processed.data,   
                                  method="msstats",

                                  global_norm=TRUE,  
                                  reference_norm=TRUE,

                                  MBimpute = TRUE,  
                                  maxQuantileforCensored = NULL,

                                  remove_norm_channel = TRUE,
                                  remove_empty_channel =  TRUE)

There a lot of different possibilities to configure this method please have a look at the MSstatsTMT package for additional detailed information.

The next step is the comparions of the different conditions, here either a pairwise comparision can be performed or a confusion matrix can be created. The goal is to detect and compare the UPS peptides spiked in at different concentrations.

# prepare contrast matrix   
unique(quant.data$Condition)  

comparison<-matrix(c(-1,0,0,1,

                     0,-1,0,1,  
                     0,0,-1,1,

                     0,1,-1,0,  
                     1,-1,0,0), nrow=5, byrow = T)  


# Set the names of each row  
row.names(comparison)<- contrasts <- c("1-0125",

                                       "1-05",  
                                       "1-0667",

                                       "05-0667",  
                                       "0125-05")

# Set the column names
colnames(comparison)<- c("0.125", "0.5", "0.667", "1")

The constructed confusion matrix is used in the groupComparisonTMT function to test for significant changes in protein abundance across conditions based on a family of linear mixed-effects models in TMT experiments.

data.res <- groupComparisonTMT(data = quant.data,   
                               contrast.matrix = comparison,

                               moderated = TRUE, # do moderated t test

                               adj.method = "BH") # multiple comparison adjustment
data.res <- data.res %>% filter(!is.na(Protein))

In the next step the comparison can be plotted using the groupComparisonPlots function by MSstats.

library(MSstats)  
groupComparisonPlots(data=data.res.mod, type="VolcanoPlot", address=F, which.Comparison = "0125-05", sig = 0.05)

Here, we have a example output of the R View, which depicts the significant regulated UPS proteins in the comparison of 125 to 05 (Fig. 26).

Volcanoplot of the group comparison regarding 0125 to 05

Figure 26: Volcanoplot of the group comparison regarding 0125 to 05

All plots are saved to the in the beginning specified output directory in addition.

Note

The isobaric analysis does not always has to be performed on protein level, for example for phosphoproteomics studies one is usually interested on the peptide level - in addition inference on peptides with post-translational modification is not straight forward. Here, we present and additonal workflow on peptide level, which can potentially be adapted and used for such cases. Please see WorkflowsIdentificationquantificationisobaricMSstatsTMT

References
Label-free quantification of metabolites
Introduction

Quantification and identification of chemical compounds are basic tasks in metabolomic studies. In this tutorial session we construct a UPLC-MS based, label-free quantification and identification workflow. Following quantification and identification we then perform statistical downstream analysis to detect quantification values that differ significantly between two conditions. This approach can, for example, be used to detect biomarkers. Here, we use two spike-in conditions of a dilution series (0.5 mg/l and 10.0 mg/l, male blood background, measured in triplicates) comprising seven isotopically labeled compounds. The goal of this tutorial is to detect and quantify these differential spike-in compounds against the complex background.

Basics of non-targeted metabolomics data analysis

For the metabolite quantification we choose an approach similar to the one used for peptides, but this time based on the OpenMS FeatureFinderMetabo method. This feature finder again collects peak picked data into individual mass traces. The reason why we need a different feature finder for metabolites lies in the step after trace detection: the aggregation of isotopic traces belonging to the same compound ion into the same feature. Compared to peptides with their averagine model, small molecules have very different isotopic distributions. To group small molecule mass traces correctly, an aggregation model tailored to small molecules is thus needed.

  • Create a new workflow called for instance ”Metabolomics”.

  • Add an File Importer node and configure it with one mzML file from the Example_DataMetabolomicsdatasets.

  • Add a FeatureFinderMetabo node (from Community Nodes > OpenMS > Quantitation) and connect the first output port of the File Importer to the FeatureFinderMetabo.

  • For an optimal result adjust the following settings. Please note that some of these are advanced parameters.

  • Connect a Output Folder to the output of the FeatureFinderMetabo (see Fig. 27).

FeatureFinderMetabo workflow

Figure 27: FeatureFinderMetabo workflow

In the following advanced parameters will be highlighted. These parameter can be altered if the Show advanced parameter field in the specific tool is activated.

parameter

value

algorithmcommonchrom_fwhm

8.0

algorithmmtdtrace_termination_criterion

sample_rate

algorithmmtdmin_trace_length

3.0

algorithmmtdmax_trace_length

600.0

algorithmepdwidth_filtering

off

algorithmffmreport_convex_hulls

true

The parameters change the behavior of FeatureFinderMetabo as follows:

  • chrom_fwhm: The expected chromatographic peak width in seconds.

  • trace_termination_criterion: In the first stage FeatureFinderMetabo assembles mass traces with a pre-defined mass accuracy. If this parameter is set to ’outlier’, the extension of a mass trace is stopped after a predefined number of consecutive outliers is found. If this parameter is set to ’sample_rate’, the extension of a mass trace is stopped once the ratio of collected peaks versus visited spectra falls below the ratio given by min_sample_rate.

  • min_trace_length: Minimal length of a mass trace in seconds. Choose a small value, if you want to identify low-intensity compounds.

  • max_trace_length: Maximal length of a mass trace in seconds. Set this parameter to -1 to disable the filtering by maximal length.

  • width_filtering: FeatureFinderMetabo can remove features with unlikely peak widths from the results. If activated it will use the interval provided by the parameters min_fwhm and max_fwhm.

  • report_convex_hulls: If set to true, convex hulls including mass traces will be reported for all identified features. This increases the output size considerably.

The output file .featureXML can be visualized with TOPPView on top of the used .mzML file - in a so called layer - to look at the identified features.

First start TOPPView and open the example .mzML file (see Fig. 28). Afterwards open the .featureXML output as new layer (see Fig. 29). The overlay is depicted in Figure 30. The zoom of the .mzML - .featureXML overlay shows the individual mass traces and the assembly of those in a feature (see Fig. 31).

Opened .mzML in TOPPView

Figure 28: Opened .mzML in TOPPView

Add new layer in TOPPView

Figure 29: Add new layer in TOPPView

Overlay of the .mzML layer with the .featureXML layer

Figure 30: Overlay of the .mzML layer with the .featureXML layer

Zoom of the overlay of the .mzML with the .featureXML layer

Figure 31: Zoom of the overlay of the .mzML with the .featureXML layer. Here the individual isotope traces (blue lines) are assembled into a feature here shown as convex hull (rectangular box).

The workflow can be extended for multi-file analysis, here an Input Files node is to be used instead of the File Importer node. In front of the FeatureFinderMetabo, a ZipLoopStart and behind ZipLoopEnd has to be used, since FeatureFinderMetabo will analysis on file to file bases.

To facilitate the collection of features corresponding to the same compound ion across different samples, an alignment of the samples’ feature maps along retention time is often helpful. In addition to local, small-scale elution differences, one can often see constant retention time shifts across large sections between samples. We can use linear transformations to correct for these large scale retention differences. This brings the majority of corresponding compound ions close to each other. Finding the correct corresponding ions is then faster and easier, as we don’t have to search as far around individual features.

map alignment example

Figure 32: The first feature map is used as a reference to which other maps are aligned. The calculated transformation brings corresponding features into close retention time proximity. Linking of these features form a so-called consensus features of a consensus map.

  • After the ZipLoopEnd node, add a MapAlignerPoseClustering node (Community Nodes>OpenMS>Map Alignment), set its Output Type to featureXML, and adjust the following settings:

parameter

value

algorithmmax_num_peaks_considered

−1

algorithmsuperimposermz_pair_max_distance

0.005

algorithmsuperimposernum_used_points

10000

algorithmpairfinderdistance_RTmax_difference

20.0

algorithmpairfinderdistance_MZmax_difference

20.0

algorithmpairfinderdistance_MZunit

ppm

MapAlignerPoseClustering provides an algorithm to align the retention time scales of multiple input files, correcting shifts and distortions between them. Retention time adjustment may be necessary to correct for chromatography differences e.g. before data from multiple LC-MS runs can be combined (feature linking). The alignment algorithm implemented here is the pose clustering algorithm.

The parameters change the behavior of MapAlignerPoseClustering as follows:

  • max_num_peaks_considered: The maximal number of peaks/features to be considered per map. To use all, set this parameter to -1.

  • mz_pair_max_distance: Maximum of m/z deviation of corresponding elements in different maps. This condition applies to the pairs considered in hashing.

  • num_used_points: Maximum number of elements considered in each map (selected by intensity). Use a smaller number to reduce the running time and to disregard weak signals during alignment.

  • distance_RT → max_difference: Features that have a larger RT difference will never be paired.

  • distance_MZ →max_difference: Features that have a larger m/z difference will never be paired.

  • distance_MZ →unit: Unit used for the parameter distance_MZ max_difference, either Da or ppm.

The next step after retention time correction is the grouping of corresponding features in multiple samples. In contrast to the previous alignment, we assume no linear relations of features across samples. The used method is tolerant against local swaps in elution order.

feature linking example

Figure 33: Features A and B correspond to the same analyte. The linking of features between runs (indicated by an arrow) allows comparing feature intensities.

  • After the MapAlignerPoseClustering node, add a FeatureLinkerUnlabeledQT node (Community Nodes > OpenMS>Map Alignment) and adjust the following settings:

    |parameter|value| |:————|:——–| |algorithmdistance_RTmax_difference|40| |algorithmdistance_MZmax_difference|20| |algorithmdistance_MZunit|ppm|

    The parameters change the behavior of FeatureLinkerUnlabeledQT as follows (similar to the parameters we adjusted for MapAlignerPoseClustering):

    • distance_RT → max_difference: Features that have a larger RT difference will never be paired.

    • distance_MZ → max_difference: Features that have a larger m/z difference will never be paired.

    • distance_MZ → unit: Unit used for the parameter distance_MZ max_difference, either Da or ppm.

  • After the FeatureLinkerUnlabeledQT node, add a TextExporter node (Community Nodes > OpenMS > File Handling).

  • Add an Output Folder node and configure it with an output directory where you want to store the resulting files.

  • Run the pipeline and inspect the output.

Label-free quantification workflow for metabolites

Figure 34: Label-free quantification workflow for metabolites.

You should find a single, tab-separated file containing the information on where metabolites were found and with which intensities. You can also add Output Folder nodes at different stages of the workflow and inspect the intermediate results (e.g., identified metabolite features for each input map). The complete workflow can be seen in Figure 34. In the following section we will try to identify those metabolites.

The FeatureLinkerUnlabeledQT output can be visualized in TOPPView on top of the input and output of the FeatureFinderMetabo (see Fig 35).

Label-free quantification workflow for metabolites

Figure 35: Visualization of .consensusXML output over the .mzML and .featureXML ’layer’.

Basic metabolite identification

At the current state we found several metabolites in the individual maps but so far don’t know what they are. To identify metabolites, OpenMS provides multiple tools, including search by mass: the AccurateMassSearch node searches observed masses against the Human Metabolome Database (HMDB)[1], [2], [3]. We start with the workflow from the previous section (see Figure 34).

  • Add a FileConverter node (Community Nodes > OpenMS > File Handling) and connect the output of the FeatureLinkerUnlabeledQT to the incoming port.

  • Open the Configure dialog of the FileConverter node and select the tab OutputTypes. In the drop down list for FileConverter.1.out select featureXML.

  • Add an AccurateMassSearch node (Community Nodes > OpenMS > Utilities) and connect the output of the FileConverter node to the first port of the AccurateMassSearch node.

  • Add four File Importer nodes and configure them with the following files:

    • Example_DataMetabolomicsdatabasesPositiveAdducts.tsv This file specifies the list of adducts that are considered in the positive mode. Each line contains the formula and charge of an adduct separated by a semicolon (e.g. M+H;1+). The mass of the adduct is calculated automatically.

    • Example_DataMetabolomicsdatabasesNegativeAdducts.tsv This file specifies the list of adducts that are considered in the negative mode analogous to the positive mode.

    • Example_DataMetabolomicsdatabasesHMDBMappingFile.tsv This file contains information from a metabolite database in this case from HMDB. It has three (or more) tab-separated columns: mass, formula, and identifier(s). This allows for an efficient search by mass.

    • Example_DataMetabolomicsdatabasesHMDB2StructMapping.tsv This file contains additional information about the identifiers in the mapping file. It has four tab-separated columns that contain the identifier, name, SMILES, and INCHI. These will be included in the result file. The identifiers in this file must match the identifiers in the HMDBMappingFile.tsv.

  • In the same order as they are given above connect them to the remaining input ports of the AccurateMassSearch node.

  • Add an Output Folder node and connect the first output port of the AccurateMassSearch node to the Output Folder node.

The result of the AccurateMassSearch node is in the mzTab format[4] so you can easily open it in a text editor or import it into Excel or KNIME, which we will do in the next section. The complete workflow from this section is shown in Figure 36.

Label-free quantification and identification workflow for metabolites

Figure 36: Label-free quantification and identification workflow for metabolites.

Convert your data into a KNIME table

The result from the TextExporter node as well as the result from the AccurateMassSearch node are files while standard KNIME nodes display and process only KNIME tables. To convert these files into KNIME tables we need two different nodes. For the AccurateMassSearch results, we use the MzTabReader node (Community Nodes > OpenMS > Conversion > mzTab) and its Small Molecule Section port. For the result of the TextExporter, we use the ConsensusTextReader (Community Nodes > OpenMS > Conversion). When executed, both nodes will import the OpenMS files and provide access to the data as KNIME tables. The retention time values are exported as a list using the MzTabReader based on the current PSI-Standard. This has to be parsed using the SplitCollectionColumn, which outputs a ”Split Value 1” based on the first entry in the rention time list, which has to be renamed to retention time using the ColumnRename. You can now combine both tables using the Joiner node (Manipulation > Column > Split & Combine) and configure it to match the m/z and retention time values of the respective tables. The full workflow is shown in Figure 37.

Label-free quantification and identification workflow for metabolites that loads the results into KNIME and joins the tables

Figure 37: Label-free quantification and identification workflow for metabolites that loads the results into KNIME and joins the tables.

Adduct grouping

Metabolites commonly co-elute as ions with different adducts (e.g., glutathione+H, glutathione+Na) or with charge-neutral modifications (e.g., water loss). Grouping such related ions allows to leverage information across features. For example, a low intensity, single trace feature could still be assigned a charge and adduct due to a matching high-quality feature. Several OpenMS tools, such as AccurateMassSearch, can use this information to, for example, narrow down candidates for identification.

For this grouping task, we provide the MetaboliteAdductDecharger node. Its method explores the combinatorial space of all adduct combinations in a charge range for optimal explanations. Using defined adduct probabilities, it assigns co-eluting features having suitable mass shifts and charges those adduct combinations which maximize overall ion probabilities.

The tool works natively with featureXML data, allowing the use of reported convex hulls. On such a single-sample level, co-elution settings can be chosen more stringently, as ionization-based adducts should not influence the elution time: Instead, elution differences of related ions should be due to slightly differently estimated times for their feature centroids.

Alternatively, consensusXML data from feature linking can be converted for use, though with less chromatographic information. Here, the elution time averaging for features linked across samples, motivates wider co-elution tolerances.

The two main tool outputs are a consensusXML file with compound groups of related input ions, and a featureXML containing the input file but annotated with inferred adduct information and charges.

Options to respect or replace ion charges or adducts allow for example:

  • Heuristic but faster, iterative adduct grouping(MetaboliteAdductDecharger → MetaboliteFeatureDeconvolution → q_try set to “feature”) by chaining multiple MetaboliteAdductDecharger nodes with growing adduct sets, charge ranges or otherwise relaxed tolerances.

  • More specific feature linking (FeatureLinkerUnlabeledQT → algorithm → ignore_adduct set to “false”)

Metabolite Adduct Decharger adduct grouping workflow

Figure 38: Metabolite Adduct Decharger adduct grouping workflow.

Task

A modified metabolomics workflow with exemplary MetaboliteAdductDecharger use and parameters is provided in WorkflowsMetaboliteAdductGrouping.knwf. Run the workflow, inspect tool outputs and compare AccurateMassSearch results with and without adduct grouping.

Visualizing data

Now that you have your data in KNIME you should try to get a feeling for the capabilities of KNIME.

Task

Check out the Molecule Type Cast node (Chemistry > Translators) together with subsequent cheminformatics nodes (e.g. RDKit From Molecule(Community Nodes > RDKit > Converters)) to render the structural formula contained in the result table.

Task

Have a look at the Column Filter node to reduce the table to the interesting columns, e.g., only the Ids, chemical formula, and intensities.

Task

Try to compute and visualize the m/z and retention time error of the different feature elements (from the input maps) of each consensus feature. Hint: A nicely configured Math Formula (Multi Column) node should suffice.

Manual validation

In metabolomics, matches between tandem spectra and spectral libraries are manually validated. Several commercial and free online resources exist which help in that task. Some examples are:

  • mzCloud contains only spectra from Thermo Orbitrap instruments. The webpage requires Microsoft Silverlight which currently does not work in modern browsers (see the following link.

  • MassBank North America (MoNA) has spectra from different instruments but falls short in number of spectra (compared to Metlin and mzCloud). See the following link.

  • METLIN includes 961,829 molecules ranging from lipids, steroids, metabolites, small peptides, carbohydrates, exogenous drugs and toxicants. In total over 14,000 metabolites.

Here, we will use METLIN to manually validate metabolites.

Task

Check in the .xlsx output from the Excel writer (XLS) if you can find glutathione. Use the retention time column to find the spectrum in the mzML file. Here open the file in the Example_DataMetabolomicsdatasetsMetaboliteIDSpectraDBpositive.mzML in TOPPView. The MSMS spectrum with the retention time of 67.6 s is used as example. The spectrum can be selected based on the retention time in the scan view window. Therefore the MS1 spectrum with the retention time of 66.9 s has to be double clicked and the MSMS spectra recorded in this time frame will show up. Select the tandem spectrum of Glutathione, but do not close TOPPView, yet.

Tandem spectrum of glutathione. Visualized in TOPPView.

Figure 40: Tandem spectrum of glutathione. Visualized in TOPPView.

Task

On the METLIN homepage search for Name Glutathione using the Advanced Search. See the link. Note that free registration is required. Which collision energy (and polarity) gives the best (visual) match to your experimental spectrum in TOPPView? Here you can compare the fragmentation patterns in both spectra shown by the Intensity or relative Intensity, the m/z of a peak and the distance between peaks. Each distance between two peaks corresponds to a fragment of elemental composition (e.g., NH2 with the charge of one would have mass of two peaks of 16.023 Th).

Tandem spectrum of glutathione. Visualized in Metlin. Note that several fragment spectra from varying collision energies are available.

Figure 41: Tandem spectrum of glutathione. Visualized in Metlin. Note that several fragment spectra from varying collision energies are available.

De novo identification

Another method for MS2 spectra-based metabolite identification is de novo identification. This approach can be used in addition to the other methods (accurate mass search, spectral library search) or individually if no spectral library is available. In this part of the tutorial, we discuss how metabolite spectra can be identified using de novo tools. To this end, the tools SIRIUS and CSI:FingerID ([5], [6], [7]) were integrated in the OpenMS Framework as SiriusAdapter. SIRIUS uses isotope pattern analysis to detect the molecular formula and further analyses the fragmentation pattern of a compound using fragmentation trees. CSI:FingerID is a method for searching a fingerprint of a small molecule (metabolite) in a molecular structure database. The node SiriusAdapter is able to work in different modes depending on the provided input.

  • Input: mzML - SiriusAdapter will search all MS2 spectra in a map.

  • Input: mzML, featureXML (FeatureFinderMetabo) - SiriusAdapter can use the provided feature information to reduce the search space to valid features with MS2 spectra. Additionally it can use the isotopic trace information.

  • Input: mzML, featureXML (FeatureFinderMetabo / MetaboliteAdductDecharger / AccurateMassSearch) - SiriusAdapter can use the feature information as mentioned above together with feature adduct information from adduct grouping or previous identification.

By using a mzML and featureXML, SIRIUS gains a lot of additional information by using the OpenMS tools for preprocessing.

Task

Construct the workflow as shown in Fig. 42. Example_DataMetabolomicsdatasets Use the file MetaboliteDeNovoID.mzML as input for your workflow.

Below we show an example workflow for de novo identification (Fig. 42). Here, the node FeatureFinderMetabo is used for feature detection to annotate analytes in mz, rt, intensity and charge. This is followed by adduct grouping, trying to asses possible adducts based on the feature space using the MetaboliteAdductDecharger. In addition, the HighResPrecursorMassCorrector can use the newly generated feature information to map MS2 spectra, which were measured on one of the isotope traces to the monoisotopic precursor. This helps with feature mapping and analyte identification in the SiriusAdapter due to the usage of additional MS2 spectra that belong to a specific feature.

De novo identification workflow

Figure 42: De novo identification workflow

Run the workflow and inspect the output.

The output consists of two mzTab files and an internal .ms file. One mzTab for SIRIUS and the other for the CSI:FingerID. These provide information about the chemical formula, adduct and the possible compound structure. The information is referenced to the spectrum used in the analysis. Additional information can be extracted from the SiriusAdapter by setting an ”out_workspace_directory”. Here the SIRIUS workspace will be provided after the calculation has finished. This workspace contains information about annotated fragments for each successfully explained compound.

Downstream data analysis and reporting

In this part of the metabolomics session we take a look at more advanced downstream analysis and the use of the statistical programming language R. As laid out in the introduction we try to detect a set of spike-in compounds against a complex blood background. As there are many ways to perform this type of analysis we provide a complete workflow.

Task

Import the workflow from WorkflowsmetaboliteID.knwf in KNIME: File > Import KNIME Workflow…

The section below will guide you in your understanding of the different parts of the workflow. Once you understood the workflow you should play around and be creative. Maybe create a novel visualization in KNIME or R? Do some more elaborate statistical analysis? Note that some basic R knowledge is required to fully understand the processing in R Snippet nodes.

Signal processing and data preparation for identification

The following part is analogous to what you did for the simple metabolomics pipeline.

Data preparation for quantification

The first part is identical to what you did for the simple metabolomics pipeline. Additionally, we convert zero intensities into NA values and remove all rows that contain at least one NA value from the analysis. We do this using a very simple R Snippet and subsequent Missing Value filter node.

Task

Inspect the R Snippet by double-clicking on it. The KNIME table that is passed to an R Snippet node is available in R as a data.frame named knime.in. The result of this node will be read from the data.frame knime.out after the script finishes. Try to understand and evaluate parts of the script (Eval Selection). In this dialog you can also print intermediary results using for example the R command head(knime.in) or cat(knime.in) to the Console pane.

Statistical analysis

After we linked features across all maps, we want to identify features that are significantly deregulated between the two conditions. We will first scale and normalize the data, then perform a t-test, and finally correct the obtained p-values for multiple testing using Benjamini-Hochberg. All of these steps will be carried out in individual R Snippet nodes.

  • Double-click on the first R Snippet node labeled ”log scaling” to open the R Snippet dialog. In the middle you will see a short R script that performs the log scaling. To perform the log scaling we use a so-called regular expression (grepl) to select all columns containing the intensities in the six maps and take the log2 logarithm.

  • The output of the log scaling node is also used to draw a boxplot that can be used to examine the structure of the data. Since we only want to plot the intensities in the different maps (and not m/z or rt) we first use a Column Filter node to keep only the columns that contain the intensities. We connect the resulting table to a Box Plot node which draws one box for every column in the input table. Right-click and select View: Box Plot

  • The median normalization is performed in a similar way to the log scaling. First we calculate the median intensity for each intensity column, then we subtract the median from every intensity.

  • Open the Box Plot connected to the normalization node and compare it to the box plot connected to the log scaling node to examine the effect of the median normalization.

  • To perform the t-test we defined the two groups we want to compare. Finally we save the p-values and fold-changes in two new columns named p-value and FC.

  • The Numeric Row Splitter is used to filter less interesting parts of the data. In this case we only keep columns where the fold-change is ≥ 2.

  • We adjust the p-values for multiple testing using Benjamini-Hochberg and keep all consensus features with a q-value ≤ 0.01 (i.e. we target a false-discovery rate of 1%).

Interactive visualization

KNIME supports multiple nodes for interactive visualization with interrelated output. The nodes used in this part of the workflow exemplify this concept. They further demonstrate how figures with data dependent customization can be easily realized using basic KNIME nodes. Several simple operations are concatenated in order to enable an interactive volcano plot.

  • We first log-transform fold changes and p-values in the R Snippet node. We then append columns noting interesting features (concerning fold change and p-value).

  • With this information, we can use various Manager nodes (Views > Property) to emphasize interesting data points. The configuration dialogs allow us to select columns to change color, shape or size of data points dependent on the column values.

  • The Scatter Plot node (from the Views repository) enables interactive visualization of the logarithmized values as a volcano plot: the log-transformed values can be chosen in the ‘Column Selection’ tab of the plot view. Data points can be selected in the plot and highlighted via the menu option. The highlighting transfers to all other interactive nodes connected to the same data table. In our case, selection and the highlighting will also occur in the Interactive Table node (from the Views repository).

  • Output of the interactive table can then be filtered via the ”HiLite” menu tab. For example, we could restrict shown rows to points highlighted in the volcano plot.

Task

Inspect the nodes of this section. Customize your visualization and possibly try to visualize other aspects of your data.

Advanced visualization

R Dependencies: This section requires that the R packages ggplot2 and ggfortify are both installed. ggplot2 is part of the KNIME R Statistics Integration (Windows Binaries) which should already be installed via the full KNIME installer, ggfortify however is not. In case that you use an R installation where one or both of them are not yet installed, add an R Snippet node and double-click to configure. In the R Script text editor, enter the following code:

#Include the next line if you also have to install ggplot2:   
install.packages("ggplot2")
 
#Include the following lines to install ggfortify:  
install.packages("ggfortify")
 
library(ggplot2) 
library(ggfortify)

You can remove the:

install.packages

commands once it was successfully installed.

Even though the basic capabilities for (interactive) plots in KNIME are valuable for initial data exploration, professional looking depiction of analysis results often relies on dedicated plotting libraries. The statistics language R supports the addition of a large variety of packages, including packages providing extensive plotting capabilities. This part of the workflow shows how to use R nodes in KNIME to visualize more advanced figures. Specifically, we make use of different plotting packages to realize heatmaps.

  • The used RView (Table) nodes combine the possibility to write R snippet code with visualization capabilities inside KNIME. Resulting images can be looked at in the output RView, or saved via the Image Writer (Port) node.

  • The heatmap nodes make use of the gplots libary, which is by default part of the R Windows binaries (for full KNIME version 3.1.1 or higher). We again use regular expressions to extract all measured intensity columns for plotting. For clarity, feature names are only shown in the heatmap after filtering by fold changes.

Data preparation for reporting

Following the identification, quantification and statistical analysis our data is merged and formatted for reporting. First we want to discard our normalized and logarithmized intensity values in favor of the original ones. To this end we first remove the intensity columns (Column Filter) and add the original intensities back (Joiner). For that, we use an Inner Join 2 with the Joiner node. In the dialog of the node, we add two entries for the Joining Columns and for the first column we pick retention_time from the top input (i.e. the AccurateMassSearch output) and rt_cf (the retention time of the consensus features) for the bottom input (the result from the quantification). For the second column you should choose exp_mass_to_charge and mz_cf respectively to make the joining unique. Note that the workflow needs to be executed up to the previous nodes for the possible selections of columns to appear.

Data preparation for reporting

Figure 43: Data preparation for reporting

Question

What happens if we use a Left Outer Join, Right Outer Join or Full Outer Join instead of the Inner Join?

Task

Inspect the output of the join operation after the Molecule Type Cast and RDKit molecular structure generation.

While all relevant information is now contained in our table the presentation could be improved. Currently, we have several rows corresponding to a single consensus feature (=linked feature) but with different, alternative identifications. It would be more convenient to have only one row for each consensus feature with all accurate mass identifications added as additional columns. To this end, we use the Column to Grid node that flattens several rows with the same consensus number into a single one. Note that we have to specify the maximum number of columns in the grid so we set this to a large value (e.g. 100). We finally export the data to an Excel file (XLS Writer).

References
OpenSWATH
Introduction

OpenSWATH [3] allows the analysis of LC-MS/MS DIA (data independent acquisition) data using the approach described by Gillet et al. [4]. The DIA approach described there uses 32 cycles to iterate through precursor ion windows from 400-426 Da to 1175-1201 Da and at each step acquires a complete, multiplexed fragment ion spectrum of all precursors present in that window. After 32 fragmentations (or 3.2 seconds), the cycle is restarted and the first window (400-426 Da) is fragmented again, thus delivering complete “snapshots” of all fragments of a specific window every 3.2 seconds. The analysis approach described by Gillet et al. extracts ion traces of specific fragment ions from all MS2 spectra that have the same precursor isolation window, thus generating data that is very similar to SRM traces.

Installation of OpenSWATH

OpenSWATH has been fully integrated since OpenMS 1.10 [2], [1], [5], [6], [7].

Installation of mProphet

mProphet[8] is available as standalone script in External_ToolsmProphet. R and the package MASS are further required to execute mProphet. Please obtain a version for either Windows, Mac or Linux directly from CRAN. PyProphet, a much faster reimplementation of the mProphet algorithm is available from PyPI. The usage of pyprophet instead of mProphet is suggested for large-scale applications.

mProphet will be used in this tutorial.

Generating the Assay Library
Generating TraML from transition lists

OpenSWATH requires an assay library to be supplied in the TraML format[9]. To enable manual editing of transition lists, the TOPP tool TargetedFileConverter is available, which uses tab separated files as input. Example datasets are provided in ExampleDataOpenSWATHassay. Please note that the transition lists need to be named .tsv.

The header of the transition list contains the following variables (with example values in brackets):

Required Columns: PrecursorMz

The mass-to-charge (m/z) of the precursor ion. (924.539)

ProductMz

The mass-to-charge (m/z) of the product or fragment ion. (728.99)

LibraryIntensity

The relative intensity of the transition. (0.74)

NormalizedRetentionTime

The normalized retention time (or iRT)[10] of the peptide. (26.5)

Targeted Proteomics Columns ProteinId

A unique identifier for the protein. (AQUA4SWATH_HMLangeA)

PeptideSequence

The unmodified peptide sequence. (ADSTGTLVITDPTR)

ModifiedPeptideSequence

The peptide sequence with UniMod modifications. (ADSTGTLVITDPTR(UniMod:267))

PrecursorCharge

The precursor ion charge. (2)

ProductCharge

The product ion charge. (2)

Grouping Columns: TransitionGroupId

A unique identifier for the transition group. (AQUA4SWATH_HMLangeA_ADSTGTLVITDPTR(UniMod:267)/2)

TransitionId

A unique identifier for the transition. (AQUA4SWATH_HMLangeA_ADSTGTLVITDPTR(UniMod:267)/2_y8)

Decoy

A binary value whether the transition is target or decoy. (target: 0, decoy: 1)

PeptideGroupLabel

Which label group the peptide belongs to.

DetectingTransition

Use transition for peak group detection. (1)

IdentifyingTransition

Use transition for peptidoform inference using IPF. (0)

QuantifyingTransition

Use transition to quantify peak group. (1)

For further instructions about generic transition list and assay library generation please see the following link. To convert transitions lists to TraML, use the TargetedFileConverter: Please use the absolute path to your OpenMS installation.

Linux or Mac

On the Terminal:

 TargetedFileConverter -in OpenSWATH_SGS_AssayLibrary_woDecoy.tsv -out OpenSWATH_SGS_AssayLibrary_woDecoy.TraML

Windows

On the TOPP command:

 TargetedFileConverter.exe -in OpenSWATH_SGS_AssayLibrary_woDecoy.tsv -out OpenSWATH_SGS_AssayLibrary_woDecoy.TraML
Appending decoys to a TraML file

In addition to the target assays, OpenSWATH requires decoy assays in the library which are later used for classification and error rate estimation. For the decoy generation it is crucial that the decoys represent the targets in a realistic but unnatural manner without interfering with the targets. The methods for decoy generation implemented in OpenSWATH include ’shuffle’, ’pseudo-reverse’, ’reverse’ and ’shift’. To append decoys to a TraML, the TOPP tool OpenSwathDecoyGenerator can be used: Please use the absolute path to your OpenMS installation.

Linux or Mac

On the Terminal:

OpenSwathDecoyGenerator -in OpenSWATH_SGS_AssayLibrary_woDecoy.TraML -out OpenSWATH_SGS_AssayLibrary.TraML -method shuffle -switchKR false

Windows

On the TOPP command:

OpenSwathDecoyGenerator.exe -in OpenSWATH_SGS_AssayLibrary_woDecoy.TraML -out OpenSWATH_SGS_AssayLibrary.TraML -method shuffle -switchKR false
OpenSWATH KNIME

An example KNIME workflow for OpenSWATH is supplied in Workflows (Fig. 44). The example dataset can be used for this workflow (filenames in brackets):

  1. Open WorkflowsOpenSWATH.knwf in KNIME: File > Import KNIME Workflow…

  2. Select the normalized retention time (iRT) assay library in TraML format by double-clicking on node File Importer > iRT Assay Library. (ExampleDataOpenSWATHassayOpenSWATHiRTAssayLibrary.TraML).

  3. Select the SWATH MS data in mzML format as input by double-clicking on node Input File > SWATH-MS files. (ExampleDataOpenSWATHdatasplitnapedroL120420x010SW-*.nf.pp.mzML).

  4. Select the target peptide assay library in TraML format as input by double-clicking on node Input Files > Assay Library. (ExampleDataOpenSWATHassayOpenSWATHSGSAssayLibrary.TraML).

  5. Set the output destination by double-clicking on node Output File.

  6. Run the workflow.

The resulting output can be found at your selected path, which will be used as input for mProphet. Execute the script on the Terminal (Linux or Mac) or cmd.exe (Windows) in ExampleDataOpenSWATHresult. Please use the absolute path to your R installation and the result file:

R --slave --args bin_dir=../../../External_Tools/mProphet/ mquest=OpenSWATH_quant.tsv workflow=LABEL_FREE num_xval=5 run_log=FALSE write_classifier=1 write_all_pg=1 < ../../../External_Tools/mProphet/mProphet.R

or for Windows:

"C:\Program Files\R\R-3.5.1\bin\x86\R.exe" --slave --args bin_dir=../../../External_Tools/mProphet/ mquest=OpenSWATH_quant.tsv workflow=LABEL_FREE num_xval=5 run_log=FALSE write_classifier=1 write_all_pg=1 < ../../../External_Tools/mProphet/mProphet.R

The main output will be called: OpenSWATHresultmProphetxallxpeakgroups.xls with statistical information available in OpenSWATHresultmProphet.pdf.

Please note that due to the semi-supervised machine learning approach of mProphet the results differ slightly when mProphet is executed several times.

OpenSWATH KNIME Workflow.

Figure 44: OpenSWATH KNIME Workflow.

Additionally, the chromatogram output (.mzML) can be visualized for inspection with TOPPView. For additional instructions on how to use pyProphet instead of mProphet please have a look at the PyProphet Legacy Workflow. If you want to use the SQLite-based workflow in your lab in the future, please have a look here. The SQLite-based workflow will not be part of the tutorial.

From the example dataset to real-life applications

The sample dataset used in this tutorial is part of the larger SWATH MS Gold Standard (SGS) dataset which is described in the publication of Roest et al.[3]. It contains one of 90 SWATH-MS runs with significant data reduction (peak picking of the raw, profile data) to make file transfer and working with it easier. Usually SWATH-MS datasets are huge with several gigabyte per run. Especially when complex samples in combination with large assay libraries are analyzed, the TOPP tool based workflow requires a lot of computational resources. Additional information and instruction can be found at the following link.

References
OpenSWATH for Metabolomics
Introduction

We would like to present an automated DIA/SWATH analysis workflow for metabolomics, which takes advantage of experiment specific target-decoy assay library generation. This allows for targeted extraction, scoring and statistical validation of metabolomics DIA data[1], [2].

Workflow

The workflow follows multiple steps (see Fig. 45).

DIAMetAlyzer - pipeline for assay library generation and targeted analysis with statistical validation

Figure 45: DIAMetAlyzer - pipeline for assay library generation and targeted analysis with statistical validation. DDA data is used for candidate identification containing feature detection, adduct grouping and accurate mass search. Library construction uses fragment annotation via compositional fragmentation trees and decoy generation using a fragmentation tree re-rooting method to create a target-decoy assay library. This library is used in a second step to analyse metabolomics DIA data performing targeted extraction, scoring and statistical validation (FDR estimation).

Assay library generation

Figure 46: Assay library generation. The results of the compound identification (feature, molecular formula, adduct), with the corresponding fragment spectra for the feature, are used to perform fragment annotation via SIRIUS, using the compositional fragmentation trees. Then, the n highest intensity transitions are extracted and stored in the assay library.

Decoy generation

Figure 47: Decoy generation. The compositional fragmentations trees from the step above are used to run the fragmentation tree re-rooting method from Passatutto, generating a compound specific decoy MS2 spectrum. Here, the n highest intensity decoy transitions are extracted and stored in the target-decoy assay library.

  • Candidate identification Feature detection, adduct grouping and accurate mass search are applied on DDA data.

  • Library construction The knowledge determined from the DDA data, about compound identification, its potential adduct and the corresponding fragment spectra are used to perform fragment annotation via compositional fragmentation trees sugin SIRIUS 4[3]. Afterwards transitions, which are the reference of a precursor to its fragment ions are stored in a so-called assay library (Fig. 46). Assay libraries usually contain additional metadata (i.e. retention time, peak intensities). FDR estimation is based on the target-decoy approach[4]. For the generation of the MS2 decoys, the fragmentation tree-based rerooting method by Passatutto ensure the consistency of decoy spectra (Fig.47)[5]. The target-decoy assay library is then used to analyse the SWATH data.

  • Targeted extraction Chromatogram extraction and peak-group scoring. This step is performed using an algorithm based on OpenSWATH[1] for metabolomics data.

  • Statistical validation FDR estimation uses the PyProphet algorithm[2]. To prevent overfitting we chose the simpler linear model (LDA) for target-decoy discrimination in PyProphet, using MS1 and MS2 scoring with low correlated scores.

Prerequisites

Apart from the usual KNIME nodes, the workflow uses python scripting nodes. One basic requirement for the installation of python packages, in particular pyOpenMS, is a package manager for python. Using conda as an environment manger allows to specify a specific environment in the KNIME settings (File>Preferences>KNIME>Python).

Windows

We suggest do use a virtual environment for the Python 3 installation on windows. Here you can install miniconda and follow the further instructions.

  1. Create new conda python environment.

    conda create -n py39 python=3.9
    
  2. Activate py39 environment.

    conda activate py39
    
  3. Install pip (see above).

  4. On the command line:

    python -m pip install -U pip   
    python -m pip install -U numpy  
    python -m pip install -U pandas
    
    python -m pip install -U pyprophet 
    python -m pip install -U pyopenms
    
macOS

We suggest do use a virtual environment for the Python 3 installation on Mac. Here you can install miniconda and follow the further instructions.

  1. Create new conda python environment.

    conda create -n py39 python=3.9
    
  2. Activate py39 environment.

    conda activate py39
    
  3. On the Terminal:

    python -m pip install -U pip   
    python -m pip install -U numpy  
    python -m pip install -U pandas
    
    python -m pip install -U pyprophet 
    python -m pip install -U pyopenms
    
Linux

Use your package manager apt-get or yum, where possible.

  1. Install Pytohn 3.9 (Debian: python-dev, RedHat: python-devel)

  2. Install NumPy (Debian/RedHat: python-numpy).

  3. Install setuptools (Debian/RedHat: python-setuptools).

  4. On the Terminal:

    python -m pip install -U pip   
    python -m pip install -U numpy  
    python -m pip install -U pandas
    
    python -m pip install -U pyprophet 
    python -m pip install -U pyopenms 
    
Benchmark data

For the assay library construction pesticide mixes (Agilent Technologies, Waldbronn, Germany) were measured individually in solvent (DDA). Benchmark DIA samples were prepared by spiking different commercially available pesticide mixes into human plasma metabolite extracts in a 1:4 dilution series, which covers 5 orders of magnitude.

The example data can be found here.

Example workflow

Example workflow for the usage of the DIAMetAlyzer Pipeline in KNIME (see Fig. 48). Inputs are the SWATH-MS data in profile mode (.mzML), a path for saving the new target-decoy assay library, the SIRIUS 4.9.0 executable, the DDA data (.mzML), custom libraries and adducts for AccurateMassSearch, the min/max fragment mass-to-charge to be able to restrict the mass of the transitions and the path to the PyProphet executable. The DDA is used for feature detection, adduct grouping, accurate mass search and forwarded to the AssayGeneratorMetabo. Here, feature mapping is performed to collect MS2 spectra that belong to a feature. All information collected before (feautre, adduct, putative identification, MS2 spectra) are then internally forwarded to SIRIUS. SIRIUS is used for fragment annotation and decoy generation based on the fragmentation tree re-rooting approach. This information is then used to filter spectra/decoys based on their explained intensity (min. 85%). Afterwards internal feature linking is performed which is most important for untargeted experiments using a lot of DDA data to construct the library. The constructed target-decoy assay library is processed with the SWATH-MS data in OpenSWATH. The results are used by PyProphet for scoring and output a list of metabolites with their respective q-value and quantitative information.

Example workflow for the usage of the DIAMetAlyzer Pipeline in KNIME

Figure 48: Example workflow for the usage of the DIAMetAlyzer Pipeline in KNIME.

Run the workflow

These steps need to be followed to run the workflow successfully:

  • Add DDA Input Files (.mzML).

  • Specify SIRIUS 4.9.0 executable.

  • Specify library files (mapping, struct) for AccurateMassSearch.

  • Add positive/negative adducts lists for AccurateMassSearch.

  • Supply an output path for the SIRIUS workspace in the AssayGeneratorMetabo.

  • Specify additional paths and variables, such as an output path for the target-decoy assay library and a path to the pyprophet installation as well as decoy fragment mz filter (min/max).

  • Input DIA/SWATH files (.mzML).

  • Specify output path in the output folders.

You can now run the workflow.

Important parameters

Please have a look at the most important parameters, which should be tweaked to fit your data. In general, OpenMS has a lot of room for parameter optimization to best fit your chromatography and instrumental settings.

FeatureFinderMetabo

parameter

explanation

noise_threshold_int

Intensity threshold below which peaks are regarded as noise.

chrom_fwhm

Expected chromatographic peak width (in seconds).

mass_error_ppm

Allowed mass deviation (in ppm).

MetaboliteAdductDecharger

parameter

explanation

mass_max_diff

Maximum allowed mass tolerance per feature…

potential_adducts

Adducts used to explain mass differences - These should fit to the adduct list specified for AccurateMassSearch.

AccurateMassSearch

parameter

explanation

mass_error_value

Tolerance allowed for accurate mass search.

ionization_mode

Positive or negative ionization mode.

AssayGeneratorMetabo

parameter

explanation

min_transitions

Minimal number of transitions (3).

max_transitions

Maximal number of transitions (3).

min_fragment_mz

Minimal m/z of a fragment ion choosen as a transition

max_fragment_mz

Maximal m/z of a fragment ion choosen as a transition

transitions_threshold

Further transitions need at least x% of the maximum intensity.

fragment_annotation_score_threshold

Filters annotations based on the explained intensity of the peaks in a spectrum (0.8).

SIRIUS (internal):

out_workspace_directory

Output directory for SIRIUS workspace (Fragmentation Trees).

filter_by_num_masstraces

Features have to have at least x MassTraces. To use this parameter feature_only is neccessary.

precursor_mass_tolerance

Tolerance window for precursor selection (Feature selection in regard to the precursor).

precursor_rt_tolerance

Tolerance allowed for matching MS2 spectra depending on the feature size (should be around the FWHM of the chromatograms).

profile

Specify the used analysis profile (e.g. qtof).

elements

Allowed elements for assessing the putative sumformula (e.g. CHNOP[5]S[8]Cl[1]). Elements found in the isotopic pattern are added automatically, but can be specified nonetheless.

Feature linking (internal):

ambiguity_resolution mz_tolerance

M/z tolerance for the resolution of identification ambiguity over multiple files - Feature linking m/z tolerance.

ambiguity_resolution rt_tolerance

RT tolerance in seconds for the resolution of identification ambiguity over multiple files - Feature linking m/z tolerance.

total_occurrence_filter

Filter compound based on total occurrence in analysed samples.

In case of the total_occurrence_filter the value to chose depends on the analysis strategy used. In the instance you are using only identified compounds (use_known_unknowns= false) - it will filter based on identified features. This means that even if the feature was detected in e.g. 50% of all samples it might be only identified correctly by accurate mass search in 20% of all samples. Using a total_occurrence_filter this specific feature would still be filtered out due to less identifications.

OpenSWATH

parameter

explanation

rt_extraction_window

Extract x seconds around this value.

rt_normalization_factor

Please use the range of your gradient e.g. 950 seconds.

If you are analysing a lot of big DIA mzML files ≈ 3-20GB per File, it makes sense to change how OpenSWATH processes the spectra.

parameter

explanation

readOptions

Set cacheWorkingInMemory - will cache the files to disk

and read SWATH-by-SWATH into memory

tempDirectory

Set a directory, where cached mzMLs are stored (be

aware that his directory can be quite huge depending on the data).

In the workflow pyprophet is called after OpenSWATH, it merges the result files, which allows to get enough data for the model training.

pyprophet merge --template path_to_target-decoy_assay_library.pqp --out merged.osw,→ ./*.osw

Afterwards, the results are scored using the MS1 and MS2 levels and filter for metabolomics scores, which have a low correlation.

pyprophet score --in merged.osw --out scored.osw --level ms1ms2 --ss_main_score,→ ”var_isotope_correlation_score” --ss_score_filter metabolomics

Export the non filtered results:

pyprophet export-compound --in scored.osw --out scored + ”_pyprophet_nofilter_ms1ms2.tsv” --max_rs_peakgroup_qvalue 1000.0

Please see the workflow for actual parameter values used for the benchmarking dataset.

The workflow can be used without any identification (remove AccurateMassSearch). Here, all features (known_unknowns) are processed. The assay library is constructed based on the chemical composition elucidated via the fragment annotation (SIRIUS 4). It is also possible to use identified and in addition unknown (non-identified) features, by using AccurateMassSearch in combination with the use_known_unknowns in the AssayGeneratorMetabo.

References
Quality control
Introduction

In this chapter, we will build on an existing workflow with OpenMS / KNIME to add some quality control (QC). We will utilize the qcML tools in OpenMS to create a file with which we can collect different measures of quality to the mass spectrometry runs themselves and the applied analysis. The file also serves the means of visually reporting on the collected quality measures and later storage along the other analysis result files. We will, step-by-step, extend the label-free quantitation workflow from section 3 with QC functions and thereby enrich each time the report given by the qcML file. But first, to make sure you get the most of this tutorial section, a little primer on how we handle QC on the technical level.

QC metrics and qcML

To assert the quality of a measurement or analysis we use quality metrics. Metrics are describing a certain aspect of the measurement or analysis and can be anything from a single value, over a range of values to an image plot or other summary. Thus, qcML metric representation is divided into QC parameters (QP) and QC attachments (QA) to be able to represent all sorts of metrics on a technical level. A QP may (or may not) have a value which would equal a metric describable with a single value. If the metric is more complex and needs more than just a single value, the QP does not require the single value but rather depends on an attachment of values (QA) for full meaning. Such a QA holds the plot or the range of values in a table-like form. Like this, we can describe any metric by a QP and an optional QA. To assure a consensual meaning of the quality parameters and attachments, we created a controlled vocabulary (CV). Each entry in the CV describes a metric or part/extension thereof. We embed each parameter or attachment with one of these and by doing so, connect a meaning to the QP/QA. Like this, we later know exactly what we collected and the programs can find and connect the right dots for rendering the report or calculating new metrics automatically. You can find the constantly growing controlled vocabulary here. Finally, in a qcml file, we split the metrics on a per mass-spectrometry-run base or a set of mass-spectrometry-runs respectively. Each run or set will contain its QP/QA we calculate for it, describing their quality.

Building a qcML file per run

As a start, we will build a basic qcML file for each mzML file in the label-free analysis. We are already creating the two necessary analysis files to build a basic qcML file upon each mzML file, a feature file and an identification file. We use the QCCalculator node from Community > OpenMS > Utilities where also all other QC* nodes will be found. The QCCalculator will create a very basic qcML file in which it will store collected and calculated quality data.

  • Copy your label-fee quantitation workflow into a new lfq-qc workflow and open it.

  • Place the QCCalculator node after the IDMapper node. Being inside the ZipLoop, it will execute for each of the three mzML files the Input node.

  • Connect the first QCCalculator port to the first ZipLoopStart outlet port, which will carry the individual mzML files.

  • Connect the last’s ID outlet port (IDFilter or the ID metanode) to the second QCCalculator port for the identification file.

  • Finally, connect the IDMapper outlet to the third QCCalculator port for the feature file.

The created qcML files will not have much to show for, basic as they are. So we will extend them with some basic plots.

  • First, we will add an 2D overview image of the given mass spectrometry run as you may know it from TOPPView. Add the ImageCreator node from Community Nodes > OpenMS > Utilities. Change the width and heigth parameters to 640x640 as we don’t want it to be too big. Connect it to the first ZipLoopStart outlet port, so it will create an image file of the mzML’s contained run.

  • Now we have to embed this file into the qcML file, and attach it to the right QualityParameter. For this, place a QCEmbedder node behind the ImageCreator and connect that to its third inlet port. Connect its first inlet port to the outlet of the QCCalculator node to pass on the qcML file. Now change the parameter cv_acc to QC:0000055 which designates the attached image to be of type QC:0000055 - MS experiment heatmap. Finally, change the parameter qp_att_acc to QC:0000004, to attach the image to the QualityParameter QC:0000004 - MS acquisition result details.

  • For a reference of which CVs are already defined for qcML, have a look at the following link.

There are two other basic plots which we almost always might want to look at before judging the quality of a mass spectrometry run and its identifications: the total ion current (TIC) and the PSM mass error (Mass accuracy), which we have available as pre-packaged QC metanodes.

Task

Import the workflow from WorkflowsQuality ControlQC Metanodes.zip by navigating to File > Import KNIME Workflow…

  • Copy the Mass accuracy metanode into the workflow behind the QCEmbedder node and connect it. The qcML will be passed on and the Mass accuracy plots added. The information needed was already collected by the QCCalculator.

  • Do the same with the TIC metanode so that your qcML file will get passed on and enriched on each step.

R Dependencies: This section requires that the R packages ggplot2 and scales are both installed. This is the same procedure as in this section. In case that you use an R installation where one or both of them are not yet installed, open the R Snippet nodes inside the metanodes you just used (double-click). Edit the script in the R Script text editor from:

#install.packages("ggplot2")  
#install.packages("scales")

to

install.packages("ggplot2")  
install.packages("scales")

Press Eval script to execute the script.

Basic QC setup within a LFQ workflow.

Figure 51: Basic QC setup within a LFQ workflow.

Note

To have a peek into what our qcML now looks like for one of the ZipLoop iterations, we can add an Output Folder node from Community Nodes > GenericKnimeNodes > IO and set its destination parameter to somewhere we want to find our intermediate qcML files in, for example tmp > qcxlfq. If we now connect the last metanode with the Output Folder and restart the workflow, we can start inspecting the qcML files.

Task

Find your first created qcML file and open it with the browser (not IE), and the contained QC parameters will be rendered for you.

Adding brand new QC metrics

We can also add brand new QC metrics to our qcML files. Remember the Histogram you added inside the ZipLoop during the label-free quantitation section? Let’s imagine for a moment this was a brand new and utterly important metric and plot for the assessment of your analyses quality. There is an easy way to integrate such new metrics into your qcMLs. Though the Histogram node cannot pass its plot to an image, we can do so with a R View (table).

  • Add an R View (table) next to the IDTextReader node and connect them.

  • Edit the R View (table) by adding the R Script according to this:

 #install.packages("ggplot2")   
library("ggplot2")  
ggplot(knime.in, aes(x=peptide_charge)) +
 
 geom_histogram(binwidth=1, origin =-0.5) +  
 scale_x_discrete() +
 
 ggtitle("Identified peptides charge histogram") + 
 ylab("Count")
  • This will create a plot like the Histogram node on peptide_charge and pass it on as an image.

  • Now add and connect a Image2FilePort node from Community Nodes > GenericKnimeNodes > Flow to the R View (table).

  • We can now use a QCEmbetter node like before to add our new metric plot into the qcML.

  • After looking for an appropriate target from the following link, we found that we can attach our plot to the MS identification result details by setting the parameter qp_att_acc to QC:0000025, as we are plotting the charge histogram of our identified peptides.

  • To have the plot later displayed properly, we assign it the parameter cv_acc of QC:0000051, a generic plot. Also we made sure in the R Script, that our plot carries a caption so that we know which is which, if we had more than one new plot.

  • Now we redirect the QCEmbedders output to the Output Folder from before and can have a look at how our qcML is coming along after restarting the workflow.

QC with new metric

Figure 52: QC with new metric.

Set QC metrics

Besides monitoring the quality of each individual mass spectrometry run analysis, another capability of QC with OpenMS and qcML is to monitor the complete set. The easiest control is to compare mass spectrometry runs which should be similar, e.g. technical replicates, to spot any aberrations in the set. For this, we will first collect all created qcML files, merge them together and use the qcML onboard set QC properties to detect any outliers.

  • Connect the QCEmbedders output from last section to the ZipLoopEnds second input port.

  • The corresponding output port will collect all qcML files from each ZipLoop iteration and pass them on as a list of files.

  • Now we add a QCMerger node after the ZipLoopEnd and feed it that list of qcML files. In addition, we set its parameter setname to give our newly created set a name - say spikein_replicates.

  • To inspect all the QCs next to each other in that created qcML file, we have to add a new Output Folder to which we can connect the QCMerger output.

When inspecting the set-qcML file in a browser, we will be presented another overview. After the set content listing, the basic QC parameters (like number of identifications) are each displayed in a graph. Each set member (or run) has its own section on the x-axis and each run is connected with that graph via a link in the mouseover on one of the QC parameter values.

QC set creation from ZipLoop

Figure 53: QC set creation from ZipLoop.

Task

For ideas on new QC metrics and parameters, as you add them in your qcML files as generic parameters, feel free to contact us, so we can include them in the CV.

References

TOPPView

Visualizing the data is the first step in quality control, an essential tool in understanding the data, and of course an essential step in pipeline development. OpenMS provides a convenient viewer for some of the data: TOPPView. We will guide you through some of the basic features of TOPPView. Please familiarize yourself with the key controls and visualization methods. We will make use of these later throughout the tutorial. Let’s start with a first look at one of the files of our tutorial data set. Note that conceptually, there are no differences in visualizing metabolomic or proteomic data. Here, we inspect a simple proteomic measurement:

TOPPView

Figure 3: TOPPView, the graphical application for viewing mass spectra and analysis results. Top window shows a small region of a peak map. In this 2D representation of the measured spectra, signals of eluting peptides are colored according to the raw peak intensities. The lower window displays an extracted spectrum (=scan) from the peak map. On the right side, the list of spectra can be browsed.

TOPPView

Figure 4: 3D representation of the measured spectra, signals of eluting peptides are colored according to the raw peak intensities.

  • Start TOPPView (see Windows’ Start-Menu or ApplicationsOpenMS-3.1.0 on macOS)

  • Go to File > Open File, navigate to the directory where you copied the contents of the USB stick to, and select Example_DataIntroductiondatasetssmallvelos005614.mzML. This file contains only a reduced LC-MS map of a label-free proteomic platelet measurement recorded on an Orbitrap velos. The other two mzML files contain technical replicates of this experiment. First, we want to obtain a global view on the whole LC-MS map - the default option Map view 2D is the correct one and we can click the Ok button.

  • Play around.

  • Three basic modes allow you to interact with the displayed data: scrolling, zooming and measuring:

    • Scroll mode

      • Is activated by default (though each loaded spectra file is displayed zoomed out first, so you do not need to scroll).

      • Allows you to browse your data by moving around in RT and m/z range.

      • When zoomed in, you can scroll through the spectra. Click-drag on the current view.

      • Arrow keys can be used to scroll the view as well.

    • Zoom mode

      • Zooming into the data; either mark an area in the current view with your mouse while holding the left mouse button plus the Ctrl key to zoom to this area or use your mouse wheel to zoom in and out.

      • All previous zoom levels are stored in a zoom history. The zoom history can be traversed using Ctrl + + or Ctrl + - or the mouse wheel (scroll up and down).

      • Pressing backspace zooms out to show the full LC-MS map (and also resets the zoom history).

    • Measure mode

      • It is activated using the ⇧ Shift key.

      • Press the left mouse button down while a peak is selected and drag the mouse to another peak to measure the distance between peaks.

      • This mode is implemented in the 1D and 2D mode only.

  • Right click on your 2D map and select Switch to 3D mode and examine your data in 3D mode (see Fig. 4).

  • Visualize your data in different intensity normalization modes, use linear, percentage (set intensity axis scale to percentage), snap and log-view (icons on the upper left tool bar). You can hover over the icons for additional information.

    Note

    On macOS, due to a bug in one of the external libraries used by OpenMS, you will see a small window of the 3D mode when switching to 2D. Close the 3D tab in order to get rid of it.

  • In TOPPView you can also execute TOPP tools. Go to Tools > Apply tool (whole layer) and choose a TOPP tool (e.g., FileInfo) and inspect the results.

Dependent on your data MS/MS spectra can be visualized as well (see Fig.5) . You can do so, by double-click on the MS/MS spectrum shown in scan view

ms2 spectrum

Figure 5: MS/MS spectrum

Smoothing Raw Data

To smooth raw data, call one of the available NoiseFilters via the Tools-menu, (select Tools > Apply TOPP tool), then select NoiseFilterSGolay or NoiseFilterGaussian as TOPP tool (green rectangle). The parameters for the filter type can be adapted (blue rectangle). For the Savitzky-Golay filter, set the frame_length and the polynomial_order fitted. For the Gaussian filter, the gaussian width and the ppm tolerance for a flexible gaussian width depending on the m/z value can be adapted. Press Ok to run the selected NoiseFilter.

The following image shows a part of the spectrum after smoothing as red line with the un-smoothed data in green.

Subtracting a Baseline from a Spectrum

First, load the spectrum to be analyzed in TOPPView. To use the described tools, open the tutorial data via the File-menu (File > Open example file, then select peakpicker_tutorial_1.mzML). The BaselineFilter can be called via the Tools-menu (Tools > Apply TOPP tool), then select BaselineFilter as TOPPtool (red rectangle). You can choose, between different types of filters (green rectangle), the one mainly used is TopHat. The other important parameter is the length of the structuring element (blue rectangle). The default value is 3 Thomson. Press Ok to start the baseline subtraction.

TOPPView Tools Baseline

The following image shows:

  • A part of the spectrum after baseline filtering as a green line.

  • The original raw data as a blue line.

TOPPView Tools Baseline Filtered

Profile data processing

Introduction

To find all peaks in the profile data:

  1. Eliminate noise using a NoiseFilter.

  2. The now smoothed profile data can be further processed by subtracting the baseline with the BaselineFilter.

  3. Then use one of the PeakPickers to find all peaks in the baseline-reduced profile data.

TOPP raw data

There are two different smoothing filters: NoiseFilterGaussian and NoiseFilterSGolay. To use the Savitzky Golay filter, or the BaselineFilter with non equally spaced profile data, e.g. TOF data, you have to generate equally spaced data using the Resampler tool.

Picking peaks with a PeakPicker

The PeakPicker tools allow for picking peaks in profile data. Currently, there are two different TOPP tools available, PeakPickerWavelet and PeakPickerHiRes.

PeakPickerWavelet

This peak picking algorithm uses the continuous wavelet transform of a raw data signal to detect mass peaks. Afterwards a given asymmetric peak function is fitted to the raw data and important peak parameters (e.g. fwhm) are extracted. In an optional step these parameters can be optimized using a non-linear optimization method.

The algorithm is described in detail in Lange et al. (2006) Proc. PSB-06.

  • Input Data: profile data (low/medium resolution)

  • Application: This algorithm was designed for low and medium resolution data. It can also be applied to high-resolution data, but can be slow on large datasets.

See the PeakPickerCWT class documentation for a parameter list.

PeakPickerHiRes

This peak-picking algorithm detects ion signals in raw data and reconstructs the corresponding peak shape by cubic spline interpolation. Signal detection depends on the signal-to-noise ratio which is adjustable by the user (see parameter signal_to_noise). A picked peak’s m/z and intensity value is given by the maximum of the underlying peak spline.

Please notice that this method is still experimental since it has not been tested thoroughly yet.

  • Input Data: profile data (high resolution)

  • Application: The algorithm is best suited for high-resolution MS data (FT-ICR-MS, Orbitrap). In high-resolution data, the signals of ions with similar mass-to-charge ratios (m/z) exhibit little or no overlapping and therefore allow for a clear separation. Furthermore, ion signals tend to show well-defined peak shapes with narrow peak width. These properties facilitate a fast computation of picked peaks so that even large data sets can be processed very quickly.

    See the PeakPickerHiRes class documentation for a parameter list.

Finding the right parameters for the NoiseFilters, the BaselineFilter and the PeakPickers

Finding the right parameters is not trivial. The default parameters will not work on most datasets. In order to find good parameters, following this procedure:

  1. Load the data in TOPPView.

  2. Extract a single scan from the middle of the HPLC gradient (Right click on scan).

  3. Experiment with the parameters until you have found the proper settings

Find the NoiseFilters, the BaselineFilter, and the PeakPickers in TOPPView in the menu Layer > Apply TOPP tool.