matchms.Pipeline module

class matchms.Pipeline.Pipeline(workflow: OrderedDict, progress_bar=True, logging_level: str = 'WARNING', logging_file: str | None = None)[source]

Bases: object

Central pipeline class.

The matchms Pipeline class is meant to make running extensive analysis pipelines fast and easy. It can be used in two different ways. First, a pipeline can be defined using a config file (a yaml file, best to start from the template provided to define your own pipline).

Once a config file is defined, the pipeline can be executed with the following code:

The second way to define a pipeline is via a Python script. The following code is an example of how this works:

To combine this with custom made scores or available matchms-compatible scores such as Spec2Vec or MS2DeepScore, it is also possible to pass objects instead of names to create_workflow

from spec2vec import Spec2Vec
workflow = create_workflow(score_computations = [["precursormzmatch",  {"tolerance": 120.0}],
                                       [Spec2Vec, {"model": "my_spec2vec_model.model"}],
                               ["filter_by_range", {"name": "Spec2Vec", "low": 0.3}]])
__init__(workflow: OrderedDict, progress_bar=True, logging_level: str = 'WARNING', logging_file: str | None = None)[source]
Parameters:
  • workflow – Contains an orderedDict containing the workflow settings. Can be created using create_workflow.

  • progress_bar – Default is True. Set to False if no progress bar should be displayed.

check_workflow()[source]

Define Pipeline workflow based on a yaml file (config_file).

import_spectrums(query_files: List[str] | str, reference_files: List[str] | str | None = None)[source]

Import spectra from file(s).

Parameters:
  • query_files – List of files, or single filename, containing the query spectra.

  • reference_files – List of files, or single filename, containing the reference spectra. If set to None (default) then all query spectra will be compared to each other.

run(query_files, reference_files=None, cleaned_query_file=None, cleaned_reference_file=None)[source]

Execute the defined Pipeline workflow.

This method will execute all steps of the workflow. 1) Initializing the log file and importing the spectrums 2) Spectrum processing (using matchms filters) 3) Score Computations

set_logging()[source]

Set the matchms logger to write messages to file (if defined).

write_to_logfile(line)[source]

Write message to log file.

matchms.Pipeline.check_score_computation(score_computations: Iterable[str | List[dict]])[source]

Check if the score computations seem OK before running. Aim is to avoid pipeline crashing after long computation.

matchms.Pipeline.create_workflow(yaml_file_name: str | None = None, query_filters: Iterable[str | Tuple[str, Dict[str, Any]]] = (), reference_filters: Iterable[str | Tuple[str, Dict[str, Any]]] = (), score_computations: Iterable[str | List[dict]] = ()) OrderedDict[source]

Creates a workflow that specifies the filters and scores needed to be run by Pipeline

Example code can be found in the docstring of Pipeline.

Parameters:
  • yaml_file_name – A yaml file containing the workflow settings will be saved if a file name is specified. If None no yaml file will be saved.

  • query_filters – Additional filters that should be applied to the query spectra.

  • reference_filters – Additional filters that should be applied to the reference spectra

  • score_computations – Score computations that should be performed.

matchms.Pipeline.get_unused_filters(yaml_file)[source]

Prints all filter names that are in ALL_FILTERS, but not in the yaml file