matchms.filtering package

Processing (or: filtering) mass spectra

Provided functions will usually only perform a single action to a spectrum. This can be changes or corrections of metadata, or peak filtering. More complicated processing pipelines can be build by stacking several of the provided filters.

Because there are numerous filter functions in matchms and because they often need to be applied in a specific order, the most feasible workflow for users is to use the SpectrumProcessor class to define a spetrum processing pipeline. Here is an example:

import numpy as np
from matchms import Spectrum
from matchms import SpectrumProcessor

spectrum = Spectrum(mz=np.array([100, 120, 150, 200.]),
                    intensities=np.array([200.0, 300.0, 50.0, 1.0]),
                    metadata={'id': 'spectrum1'})

# Users can pick a predefined pipeline ("basic", "default", "fully_annotated")
# Or set to None if no predefined settings are desired.
processing = SpectrumProcessor("basic")

# Additional filters can be added as desired
processing.add_matchms_filter("normalize_intensities")

# Run the processing pipeline:
spectrum_filtered = processing.process_spectrum(spectrum)
max_intensity = spectrum_filtered.peaks.intensities.max()
print(f"Maximum intensity is {max_intensity:.2f}")

Should output

Maximum intensity is 1.00

It is also possible to run each filter function individually. This for instance makes sense if users want to develop a highly customized spectrum processing routine. Example of how to use a single filter function:

import numpy as np
from matchms import Spectrum
from matchms.filtering import normalize_intensities

spectrum = Spectrum(mz=np.array([100, 120, 150, 200.]),
                    intensities=np.array([200.0, 300.0, 50.0, 1.0]),
                    metadata={'id': 'spectrum1'})
spectrum_filtered = normalize_intensities(spectrum)

max_intensity = spectrum_filtered.peaks.intensities.max()
print(f"Maximum intensity is {max_intensity:.2f}")

Should output

Maximum intensity is 1.00
matchms filtering sketch

Sketch of matchms spectrum processing.

class matchms.filtering.SpeciesString(dirty: str)[source]

Bases: object

A class to process and clean different types of chemical structure strings including InChI, InChIKey, and SMILES.

The class takes a raw input string, determines the intended structure type, and then cleans the string based on its type.

dirty

Raw input string representing a chemical structure.

Type:

str

target

The intended structure type determined from the input string. Could be ‘inchi’, ‘inchikey’, ‘smiles’, or None if no valid type was identified.

Type:

str

cleaned

The cleaned structure string.

Type:

str

__init__(dirty: str)[source]

Constructs a new instance of the SpeciesString class.

Parameters:

dirty (str) – The raw input string representing a chemical structure.

clean()[source]

Clean the input string based on its determined structure type.

clean_as_inchi()[source]

Search for valid inchi and harmonize it.

clean_as_inchikey()[source]

Search for valid inchikey and harmonize it.

clean_as_smiles()[source]

Search for valid smiles and harmonize it.

guess_target()[source]

Determine the intended structure type of the input string.

looks_like_a_smiles()[source]

Return True if string is made of allowed charcters for smiles.

looks_like_an_inchi()[source]

Search for first piece of InChI.

looks_like_an_inchikey()[source]

Return True if string has format of inchikey.

matchms.filtering.add_compound_name(spectrum_in: Spectrum) Spectrum[source]

Add compound_name to correct field: “compound_name” in metadata.

matchms.filtering.add_fingerprint(spectrum_in: Spectrum | None, fingerprint_type: str = 'daylight', nbits: int = 2048) Spectrum | None[source]

Add molecular finterprint to spectrum.

If smiles or inchi present in metadata, derive a molecular finterprint and add it to the spectrum.

Parameters:
  • spectrum_in – Input spectrum.

  • fingerprint_type – Determine method for deriving molecular fingerprints. Supported choices are “daylight”, “morgan1”, “morgan2”, “morgan3”. Default is “daylight”.

  • nbits – Dimension or number of bits of generated fingerprint. Default is 2048.

matchms.filtering.add_losses(spectrum_in: Spectrum, loss_mz_from=0.0, loss_mz_to=1000.0) Spectrum[source]

Derive losses based on precursor mass.

Parameters:
  • spectrum_in – Input spectrum.

  • loss_mz_from – Minimum allowed m/z value for losses. Default is 0.0.

  • loss_mz_to – Maximum allowed m/z value for losses. Default is 1000.0.

matchms.filtering.add_parent_mass(spectrum_in: Spectrum, estimate_from_adduct: bool = True, overwrite_existing_entry: bool = False) Spectrum[source]

Add estimated parent mass to metadata (if not present yet).

Method to calculate the parent mass from given precursor m/z together with charge and/or adduct. Will take precursor m/z from “precursor_mz” as provided by running add_precursor_mz. For estimate_from_adduct=True this function will estimate the parent mass based on the mass and charge of known adducts. The table of known adduct properties can be found under matchms/data/known_adducts_table.csv.

Parameters:
  • spectrum_in – Input spectrum.

  • estimate_from_adduct – When set to True, use adduct to estimate actual molecular mass (“parent mass”). Default is True. Switches back to charge-based estimate if adduct does not match a known adduct.

  • overwrite_existing_entry – Default is False. If set to True, a newly computed value will replace existing ones.

matchms.filtering.add_precursor_mz(spectrum_in)[source]

Add precursor_mz to correct field and make it a float.

For missing precursor_mz field: check if there is “pepmass”” entry instead. For string parsed as precursor_mz: convert to float.

matchms.filtering.add_retention_index(spectrum_in: Spectrum) Spectrum[source]

Add retention index into ‘retention_index’ key if present.

Parameters:

spectrum – Spectrum with RI information.

Return type:

Spectrum with RI info stored under ‘retention_index’.

matchms.filtering.add_retention_time(spectrum_in: Spectrum) Spectrum[source]

Add retention time information to the ‘retention_time’ key as float. Negative values and those not convertible to a float result in ‘retention_time’ being ‘None’.

Parameters:

spectrum – Spectrum with retention time information.

Return type:

Spectrum with harmonized retention time information.

matchms.filtering.clean_adduct(spectrum_in)[source]

Clean adduct and make it consistent in style. Will transform adduct strings of type ‘M+H+’ to ‘[M+H]+’.

Parameters:

spectrum_in – Matchms Spectrum object.

matchms.filtering.clean_compound_name(spectrum_in: Spectrum) Spectrum[source]

Clean compound name.

A list of frequently seen name additions that do not belong to the compound name will be removed.

matchms.filtering.correct_charge(spectrum_in: Spectrum) Spectrum[source]

Correct charge values based on given ionmode.

For some spectrums, the charge value is either undefined or inconsistent with its ionmode, which is corrected by this filter.

Parameters:

spectrum_in – Input spectrum.

matchms.filtering.default_filters(spectrum: Spectrum) Spectrum[source]

Collection of filters that are considered default and that do no require any (factory) arguments.

Collection is

  1. make_charge_int()

  2. add_compound_name()

  3. derive_adduct_from_name()

  4. derive_formula_from_name()

  5. clean_compound_name()

  6. interpret_pepmass()

  7. add_precursor_mz()

  8. derive_ionmode()

  9. correct_charge()

matchms.filtering.derive_adduct_from_name(spectrum_in: Spectrum, remove_adduct_from_name: bool = True) Spectrum | None[source]

Find adduct in compound name and add to metadata (if not present yet).

Method to interpret the given compound name to find the adduct.

Parameters:
  • spectrum_in – Input spectrum.

  • remove_adduct_from_name – Remove found adducts from compound name if set to True. Default is True.

matchms.filtering.derive_annotation_from_compound_name(spectrum_in: <module 'matchms.Spectrum' from '/home/docs/checkouts/readthedocs.org/user_builds/matchms/checkouts/latest/readthedocs/../matchms/Spectrum.py'>, annotated_compound_names_file: str | None = None, mass_tolerance: float = 0.1)[source]

Adds smiles, inchi, inchikey based on compound name by searching pubchem

This filter is only run, if there is not yet a valid smiles or inchi in the metadata. The smiles, inchi and inchikey are only added if the found annotation is close enough to the parent mass.

Parameters:
  • spectrum_in – The input spectrum.

  • annotated_compound_names_file (Optional[str]) – Any compound name that was searched for on pubchem will be added to this file. If a compound name is already in this file it will be used instead of looking up at pubchem. This file can be reused for future runs, speeding up the process. If None. The compound names found will still be cached for this run, but won’t be reusable for future runs. The csv file should contain the columns [“compound_name”, “smiles”, “inchi”, “inchikey”, “monoisotopic_mass”]

  • mass_tolerance – Acceptable mass difference between query compound and pubchem result.

matchms.filtering.derive_formula_from_name(spectrum_in: Spectrum, remove_formula_from_name: bool = True) Spectrum[source]

Detect and remove misplaced formula in compound name and add to metadata.

Method to find misplaced formulas in compound name based on regular expression. This will not chemically test the detected formula, so the search is limited to frequently occuring types of shape ‘C47H83N1O8P1’.

Parameters:
  • spectrum_in – Input spectrum.

  • remove_formula_from_name – Remove found formula from compound name if set to True. Default is True.

matchms.filtering.derive_inchi_from_smiles(spectrum_in: Spectrum) Spectrum[source]

Find missing Inchi and derive from smiles where possible.

matchms.filtering.derive_inchikey_from_inchi(spectrum_in: Spectrum) Spectrum[source]

Find missing InchiKey and derive from Inchi where possible.

matchms.filtering.derive_ionmode(spectrum_in: Spectrum) Spectrum[source]

Derive missing ionmode based on adduct.

Some input formates (e.g. MGF files) do not always provide a correct ionmode. This function reads the adduct from the metadata and uses this to fill in the correct ionmode where missing.

Parameters:

spectrum – Input spectrum.

Return type:

Spectrum object with ionmode attribute set.

matchms.filtering.derive_smiles_from_inchi(spectrum_in: Spectrum) Spectrum[source]

Find missing smiles and derive from Inchi where possible.

matchms.filtering.harmonize_undefined_inchi(spectrum_in: Spectrum, undefined: str = '', aliases: List[str] | None = None) Spectrum[source]

Replace all aliases for empty/undefined inchi entries by value of undefined argument.

Parameters:
  • undefined – Give desired entry for undefined inchi fields. Default is “”.

  • aliases – Enter list of strings that are expected to represent undefined entries. Default is [“”, “N/A”, “NA”, “n/a”].

matchms.filtering.harmonize_undefined_inchikey(spectrum_in: Spectrum, undefined: str = '', aliases: List[str] | None = None) Spectrum[source]

Replace all aliases for empty/undefined inchikey entries by undefined.

Parameters:
  • undefined – Give desired entry for undefined inchikey fields. Default is “”.

  • aliases – Enter list of strings that are expected to represent undefined entries. Default is [“”, “N/A”, “NA”, “n/a”, “no data”].

matchms.filtering.harmonize_undefined_smiles(spectrum_in: Spectrum, undefined: str = '', aliases: List[str] | None = None) Spectrum[source]

Replace all aliases for empty/undefined smiles entries by undefined.

Parameters:
  • undefined – Give desired entry for undefined smiles fields. Default is “”.

  • aliases – Enter list of strings that are expected to represent undefined entries. Default is [“”, “N/A”, “NA”, “n/a”, “no data”].

matchms.filtering.interpret_pepmass(spectrum_in)[source]

Reads pepmass field (if present) and adds values to correct field(s).

The field “pepmass” or “PEPMASS” is often used to describe the precursor ion. This function will interpret the values as (mz, intensity, charge) tuple. Those will be splitted (if present) added to the fields “precursor_mz”, “precursor_intensity”, and “charge”.

matchms.filtering.make_charge_int(spectrum_in: Spectrum) Spectrum[source]

Convert charge field to integer (if possible).

matchms.filtering.normalize_intensities(spectrum_in: Spectrum) Spectrum[source]

Normalize intensities of peaks (and losses) to unit height.

matchms.filtering.reduce_to_number_of_peaks(spectrum_in: Spectrum, n_required: int = 0, n_max: int = inf, ratio_desired: float | None = None) Spectrum[source]

Lowest intensity peaks will be removed when it has more peaks than desired.

Parameters:
  • spectrum_in – Input spectrum.

  • n_required – Number of minimum required peaks. Spectra with fewer peaks will be set to ‘None’. Default is 1.

  • n_max – Maximum number of peaks. Remove peaks if more peaks are found. Default is inf.

  • ratio_desired – Set desired ratio between maximum number of peaks and parent mass. For spectra without parent mass (e.g. GCMS spectra) this will raise an error when ratio_desired is used. Default is None.

matchms.filtering.remove_peaks_around_precursor_mz(spectrum_in: Spectrum, mz_tolerance: float = 17) Spectrum[source]
Remove peaks that are within mz_tolerance (in Da) of

the precursor mz, exlcuding the precursor peak.

Parameters:
  • spectrum_in – Input spectrum.

  • mz_tolerance – Tolerance of mz values that are not allowed to lie within the precursor mz. Default is 17 Da.

matchms.filtering.remove_peaks_outside_top_k(spectrum_in: Spectrum, k: int = 6, mz_window: float = 50) Spectrum[source]
Remove all peaks which are not within mz_window of at least one

of the k highest intensity peaks of the spectrum.

Parameters:
  • spectrum_in – Input spectrum.

  • k – The number of most intense peaks to compare to. Default is 6.

  • mz_window – Window of mz values (in Da) that are allowed to lie within the top k peaks. Default is 50 Da.

matchms.filtering.repair_adduct_based_on_smiles(spectrum_in: <module 'matchms.Spectrum' from '/home/docs/checkouts/readthedocs.org/user_builds/matchms/checkouts/latest/readthedocs/../matchms/Spectrum.py'>, mass_tolerance: float, accept_parent_mass_is_mol_wt: bool = True)[source]

Corrects the adduct of a spectrum based on its SMILES representation and the precursor m/z.

Given a spectrum, this function tries to match the spectrum’s parent mass, derived from its precursor m/z and known adducts, to the neutral monoisotopic mass of the molecule derived from its SMILES representation. If a match is found within a given mass tolerance, the adduct and parent mass of the spectrum are updated.

Parameters:

spectrum_inSpectrum

The input spectrum whose adduct needs to be repaired.

mass_tolerancefloat

Maximum allowed mass difference between the calculated parent mass and the neutral monoisotopic mass derived from the SMILES.

accept_parent_mass_is_mol_wtbool, optional (default=True)

Allows the function to attempt repairing the spectrum’s parent mass by assuming it represents the molecule’s weight. If True, further checks and corrections are made using repair_parent_mass_is_mol_wt.

matchms.filtering.repair_inchi_inchikey_smiles(spectrum_in: Spectrum) Spectrum[source]

Check if inchi, inchikey, and smiles entries seem correct. Detect and correct if any of those entries clearly belongs into one of the other two fields (e.g. inchikey found in inchi field).

matchms.filtering.repair_not_matching_annotation(spectrum_in: <module 'matchms.Spectrum' from '/home/docs/checkouts/readthedocs.org/user_builds/matchms/checkouts/latest/readthedocs/../matchms/Spectrum.py'>)[source]

Repairs mismatches in a spectrum’s annotations related to SMILES, InChI, and InChIKey.

Given a spectrum, this function ensures that the provided SMILES, InChI, and InChIKey annotations are consistent with one another. If there are discrepancies, they are resolved as follows:

  1. If the SMILES and InChI do not match:
    • Both SMILES and InChI are checked against the parent mass.

    • The annotation that matches the parent mass is retained, and the other is regenerated.

  2. If the InChIKey does not match the InChI:
    • A new InChIKey is generated from the InChI and replaces the old one.

Warnings and information logs are generated to track changes and potential issues. For correctness of InChIKey entries, only the first 14 characters are considered.

Parameters:

spectrum_inSpectrum

The input spectrum containing annotations to be checked and repaired.

Returns:

Spectrum

A cloned version of the input spectrum with corrected annotations. If the input spectrum is None, it returns None.

matchms.filtering.repair_parent_mass_is_mol_wt(spectrum_in: <module 'matchms.Spectrum' from '/home/docs/checkouts/readthedocs.org/user_builds/matchms/checkouts/latest/readthedocs/../matchms/Spectrum.py'>, mass_tolerance: float)[source]

Changes the parent mass from molecular mass into monoistopic mass

Manual entered precursor mz is sometimes wrongly added as Molar weight instead of monoisotopic mass

matchms.filtering.repair_parent_mass_match_smiles_wrapper(spectrum_in: Spectrum, mass_tolerance: float = 0.2) Spectrum | None[source]

Wrapper function for repairing a mismatch between parent mass and smiles mass

matchms.filtering.repair_precursor_is_parent_mass(spectrum_in: <module 'matchms.Spectrum' from '/home/docs/checkouts/readthedocs.org/user_builds/matchms/checkouts/latest/readthedocs/../matchms/Spectrum.py'>, mass_tolerance)[source]

Repairs parent mass and precursor mz if the parent mass is entered instead of the precursor_mz

matchms.filtering.repair_smiles_of_salts(spectrum_in, mass_tolerance)[source]

Repairs the smiles of a salt to match the parent mass. E.g. C1=NC2=NC=NC(=C2N1)N.Cl is converted to 1=NC2=NC=NC(=C2N1)N if this matches the parent mass Checks if parent mass matches one of the ions

matchms.filtering.require_correct_ionmode(spectrum_in: Spectrum, ion_mode_to_keep) Spectrum | None[source]

Validates the ion mode of a given spectrum. If the spectrum’s ion mode doesn’t match the ion_mode_to_keep, it will be removed and a log message will be generated.

Parameters:
  • spectrum_in (Spectrum) – The input spectrum to be validated. If None, the function will return None.

  • ion_mode_to_keep (str) – Desired ion mode (‘positive’, ‘negative’, or ‘both’). If not one of these, a ValueError is raised.

Returns:

The validated spectrum if its ion mode matches the desired one, or None otherwise.

Return type:

Spectrum or None

matchms.filtering.require_minimum_number_of_high_peaks(spectrum_in: Spectrum, no_peaks: int = 5, intensity_percent: float = 2.0) Spectrum[source]
Returns None if the number of peaks with relative intensity

above or equal to intensity_percent is less than no_peaks.

Parameters:
  • spectrum_in – Input spectrum.

  • no_peaks – Minimum number of peaks allowed to have relative intensity above intensity_percent. Less peaks will return none. Default is 5.

  • intensity_percent – Minimum relative intensity (as a percentage between 0-100) for peaks that are searched. Default is 2

matchms.filtering.require_minimum_number_of_peaks(spectrum_in: Spectrum, n_required: int = 10, ratio_required: float | None = None) Spectrum[source]

Spectrum will be set to None when it has fewer peaks than required.

Parameters:
  • spectrum_in – Input spectrum.

  • n_required – Number of minimum required peaks. Spectra with fewer peaks will be set to ‘None’.

  • ratio_required – Set desired ratio between minimum number of peaks and parent mass. Default is None.

matchms.filtering.require_parent_mass_match_smiles(spectrum_in: Spectrum, mass_tolerance) Spectrum | None[source]

Validates if the parent mass of the given spectrum matches the mass calculated from its associated SMILES string within a specified tolerance.

Parameters:
  • spectrum_in (Spectrum) – The input spectrum to be validated. If None, the function will return None.

  • mass_tolerance (float) – The tolerance for the mass difference between the spectrum’s parent mass and the mass calculated from its SMILES string.

Returns:

The validated spectrum if its parent mass matches the SMILES mass within the specified tolerance, or None otherwise.

Return type:

Spectrum or None

matchms.filtering.require_precursor_below_mz(spectrum_in: Spectrum, max_mz: float = 1000) Spectrum[source]
Returns None if the precursor_mz of a spectrum is above

max_mz.

Parameters:
  • spectrum_in – Input spectrum.

  • max_mz – Maximum mz value for the precursor mz of a spectrum. All precursor mz values greater or equal to this will return none. Default is 1000.

matchms.filtering.require_precursor_mz(spectrum_in: Spectrum, minimum_accepted_mz: float = 10.0) Spectrum | None[source]

Returns None if there is no precursor_mz or if <= minimum_accepted_mz

Parameters:
  • spectrum_in – Input spectrum.

  • minimum_accepted_mz – Set to minimum acceptable value for precursor m/z. Default is set to 10.0.

matchms.filtering.require_valid_annotation(spectrum: <module 'matchms.Spectrum' from '/home/docs/checkouts/readthedocs.org/user_builds/matchms/checkouts/latest/readthedocs/../matchms/Spectrum.py'>)[source]

Removes spectra that are not fully annotated (correct and matching, smiles, inchi and inchikey)

matchms.filtering.select_by_intensity(spectrum_in: Spectrum, intensity_from: float = 10.0, intensity_to: float = 200.0) Spectrum[source]

Keep only peaks within set intensity range (keep if intensity_from >= intensity >= intensity_to). In most cases it is adviced to use select_by_relative_intensity() function instead.

Parameters:
  • intensity_from – Set lower threshold for peak intensity. Default is 10.0.

  • intensity_to – Set upper threshold for peak intensity. Default is 200.0.

matchms.filtering.select_by_mz(spectrum_in: Spectrum, mz_from: float = 0.0, mz_to: float = 1000.0) Spectrum[source]

Keep only peaks between mz_from and mz_to (keep if mz_from >= m/z >= mz_to).

Parameters:
  • mz_from – Set lower threshold for m/z peak positions. Default is 0.0.

  • mz_to – Set upper threshold for m/z peak positions. Default is 1000.0.

matchms.filtering.select_by_relative_intensity(spectrum_in: Spectrum, intensity_from: float = 0.0, intensity_to: float = 1.0) Spectrum[source]

Keep only peaks within set relative intensity range (keep if intensity_from >= intensity >= intensity_to).

Parameters:
  • intensity_from – Set lower threshold for relative peak intensity. Default is 0.0.

  • intensity_to – Set upper threshold for relative peak intensity. Default is 1.0.

Subpackages

Submodules