matchms.similarity package

Functions for computing spectra similarities

Matchms provides a number of frequently used similarity scores to compare mass spectra. This includes

  • scores based on comparing peak positions and intensities (CosineGreedy or ModifiedCosine)

  • simple scores that only assess precursor m/z or parent mass matches (PrecursorMzMatch or: ParentMassMatch)

  • scores assessing molecular similarity if structures (SMILES, InchiKey) are given as metadata (FingerprintSimilarity)

  • score for assessing matches in user-defined metadata fields which can be used to find equal entries (e.g. instrument_type) or numerical values within a specified tolerance (for instance: retention_time, collision energy…) (MetadataMatch)

It is also easily possible to add own custom similarity measures or import external ones (such as Spec2Vec).

class matchms.similarity.CosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate ‘cosine similarity score’ between two spectra.

The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved in a ‘greedy’ way. This can perform notably faster, but does occasionally deviate slightly from a fully correct solution (as with the Hungarian algorithm, see CosineHungarian). In practice this will rarely affect similarity scores notably, in particular for smaller tolerances.

For example

import numpy as np
from matchms import Spectrum
from matchms.similarity import CosineGreedy

reference = Spectrum(mz=np.array([100, 150, 200.]),
                     intensities=np.array([0.7, 0.2, 0.1]))
query = Spectrum(mz=np.array([100, 140, 190.]),
                 intensities=np.array([0.4, 0.2, 0.1]))

# Use factory to construct a similarity function
cosine_greedy = CosineGreedy(tolerance=0.2)

score = cosine_greedy.pair(reference, query)

print(f"Cosine score is {score['score']:.2f} with {score['matches']} matched peaks")

Should output

Cosine score is 0.83 with 1 matched peaks
__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Parameters
  • tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.

  • mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.

  • intensity_power – The power to raise intensity to in the cosine function. The default is 1.

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray

Optional: Provide optimized method to calculate an numpy.array of similarity scores for given reference and query spectrums. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters
  • references – List of reference objects

  • queries – List of query objects

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]

Calculate cosine score between two spectra.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

Returns

Tuple with cosine score and number of matched peaks.

Return type

Score

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.CosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate ‘cosine similarity score’ between two spectra (using Hungarian algorithm).

The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved using the Hungarian algorithm. This can perform notably slower than the ‘greedy’ implementation in CosineGreedy, but does represent a mathematically proper solution to the problem.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Parameters
  • tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.

  • mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.

  • intensity_power – The power to raise intensity to in the cosine function. The default is 1.

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray

Optional: Provide optimized method to calculate an numpy.array of similarity scores for given reference and query spectrums. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters
  • references – List of reference objects

  • queries – List of query objects

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]

Calculate cosine score between two spectra.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

  • Returns

  • --------

  • peaks. (Tuple with cosine score and number of matched) –

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.FingerprintSimilarity(similarity_measure: str = 'jaccard', set_empty_scores: Union[float, int, str] = 'nan')[source]

Bases: BaseSimilarity

Calculate similarity between molecules based on their fingerprints.

For this similarity measure to work, fingerprints are expected to be derived by running add_fingerprint().

Code example:

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.filtering import add_fingerprint
from matchms.similarity import FingerprintSimilarity

spectrum_1 = Spectrum(mz=np.array([], dtype="float"),
                      intensities=np.array([], dtype="float"),
                      metadata={"smiles": "CCC(C)C(C(=O)O)NC(=O)CCl"})

spectrum_2 = Spectrum(mz=np.array([], dtype="float"),
                      intensities=np.array([], dtype="float"),
                      metadata={"smiles": "CC(C)C(C(=O)O)NC(=O)CCl"})

spectrum_3 = Spectrum(mz=np.array([], dtype="float"),
                      intensities=np.array([], dtype="float"),
                      metadata={"smiles": "C(C(=O)O)(NC(=O)O)S"})

spectrums = [spectrum_1, spectrum_2, spectrum_3]
# Add fingerprints
spectrums = [add_fingerprint(x, nbits=256) for x in spectrums]

# Specify type and calculate similarities
similarity_measure = FingerprintSimilarity("jaccard")
scores = calculate_scores(spectrums, spectrums, similarity_measure)
print(np.round(scores.scores, 3))

Should output

[[1.    0.878 0.415]
 [0.878 1.    0.444]
 [0.415 0.444 1.   ]]
__init__(similarity_measure: str = 'jaccard', set_empty_scores: Union[float, int, str] = 'nan')[source]
Parameters
  • similarity_measure – Chose similarity measure form “cosine”, “dice”, “jaccard”. The default is “jaccard”.

  • set_empty_scores – Define what should be given instead of a similarity score in cases where fingprints are missing. The default is “nan”, which will return numpy.nan’s in such cases.

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) array[source]

Calculate matrix of fingerprint based similarity scores.

Parameters
  • references – List of reference spectrums.

  • queries – List of query spectrums.

pair(reference: Spectrum, query: Spectrum) float[source]

Calculate fingerprint based similarity score between two spectra.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

score_datatype

alias of float64

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.IntersectMz(scaling: float = 1.0)[source]

Bases: BaseSimilarity

Example score for illustrating how to build custom spectra similarity score.

IntersectMz will count all exact matches of peaks and divide it by all unique peaks found in both spectrums.

Example of how matchms similarity functions can be used:

import numpy as np
from matchms import Spectrum
from matchms.similarity import IntersectMz

spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]),
                      intensities=np.array([0.7, 0.2, 0.1]))
spectrum_2 = Spectrum(mz=np.array([100, 140, 190.]),
                      intensities=np.array([0.4, 0.2, 0.1]))

# Construct a similarity function
similarity_measure = IntersectMz(scaling=1.0)

score = similarity_measure.pair(spectrum_1, spectrum_2)

print(f"IntersectMz score is {score:.2f}")

Should output

IntersectMz score is 0.20
__init__(scaling: float = 1.0)[source]

Constructor. Here, function parameters are defined.

Parameters

scaling – Scale scores to maximum possible score being ‘scaling’.

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray

Optional: Provide optimized method to calculate an numpy.array of similarity scores for given reference and query spectrums. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters
  • references – List of reference objects

  • queries – List of query objects

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) float[source]

This will calculate the similarity score between two spectra.

score_datatype

alias of float64

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.MetadataMatch(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]

Bases: BaseSimilarity

Return True if metadata entries of a specified field match between two spectra.

This is supposed to be used to compare a wide range of possible metadata entries and use this to later select related or similar spectra.

Example to calculate scores between 2 pairs of spectrums and iterate over the scores

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.similarity import MetadataMatch

spectrum_1 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "orbitrap",
                                "id": 1})
spectrum_2 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "qtof",
                                "id": 2})
spectrum_3 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "qtof",
                                "id": 3})
spectrum_4 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "orbitrap",
                                "id": 4})
references = [spectrum_1, spectrum_2]
queries = [spectrum_3, spectrum_4]

similarity_score = MetadataMatch(field="instrument_type")
scores = calculate_scores(references, queries, similarity_score)

for (reference, query, score) in scores:
    print(f"Metadata match between {reference.get('id')} and {query.get('id')}" +
          f" is {score:.2f}")

Should output

Metadata match between 1 and 3 is 0.00
Metadata match between 1 and 4 is 1.00
Metadata match between 2 and 3 is 1.00
Metadata match between 2 and 4 is 0.00
__init__(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]
Parameters
  • field – Specify field name for metadata that should be compared.

  • matching_type – Specify how field entries should be matched. Can be one of [“equal_match”, “difference”].

  • tolerance – Specify tolerance below which two values are counted as match. This only applied to numerical values.

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray[source]

Compare parent masses between all references and queries.

Parameters
  • references – List/array of reference spectrums.

  • queries – List/array of Single query spectrums.

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) float[source]

Compare precursor m/z between reference and query spectrum.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

score_datatype

alias of bool

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.ModifiedCosine(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate ‘modified cosine score’ between mass spectra.

The modified cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is simply the difference in precursor-m/z between the two spectra. See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for further details.

For example

import numpy as np
from matchms import Spectrum
from matchms.similarity import ModifiedCosine

spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]),
                      intensities=np.array([0.7, 0.2, 0.1]),
                      metadata={"precursor_mz": 100.0})
spectrum_2 = Spectrum(mz=np.array([104.9, 140, 190.]),
                      intensities=np.array([0.4, 0.2, 0.1]),
                      metadata={"precursor_mz": 105.0})

# Use factory to construct a similarity function
modified_cosine = ModifiedCosine(tolerance=0.2)

score = modified_cosine.pair(spectrum_1, spectrum_2)

print(f"Modified cosine score is {score['score']:.2f} with {score['matches']} matched peaks")

Should output

Modified cosine score is 0.83 with 1 matched peaks
__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Parameters
  • tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.

  • mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.

  • intensity_power – The power to raise intensity to in the cosine function. The default is 1.

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray

Optional: Provide optimized method to calculate an numpy.array of similarity scores for given reference and query spectrums. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters
  • references – List of reference objects

  • queries – List of query objects

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]

Calculate modified cosine score between two spectra.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

Return type

Tuple with cosine score and number of matched peaks.

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.NeutralLossesCosine(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]

Bases: BaseSimilarity

Calculate ‘neutral losses cosine score’ between mass spectra.

The neutral losses cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’ once a mass-shift is applied. The mass shift is the difference in precursor-m/z between the two spectra. In general, ModifiedCosine is recommended over NeutralLossesCosine because it will on average deliver more reliable results.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]
Parameters
  • tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.

  • mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.

  • intensity_power – The power to raise intensity to in the cosine function. The default is 1.

  • ignore_peaks_above_precursor – By default this is set to True, meaning that peaks with m/z values larger than the precursor-m/z will be ignored (since those would correspond to negative “neutral losses”).

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray

Optional: Provide optimized method to calculate an numpy.array of similarity scores for given reference and query spectrums. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters
  • references – List of reference objects

  • queries – List of query objects

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]

Calculate neutral losses cosine score between two spectra.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

Return type

Tuple with cosine score and number of matched peaks.

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.ParentMassMatch(tolerance: float = 0.1)[source]

Bases: BaseSimilarity

Return True if spectrums match in parent mass (within tolerance), and False otherwise.

Example to calculate scores between 2 spectrums and iterate over the scores

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.similarity import ParentMassMatch

spectrum_1 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "1", "parent_mass": 100})
spectrum_2 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "2", "parent_mass": 110})
spectrum_3 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "3", "parent_mass": 103})
spectrum_4 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "4", "parent_mass": 111})
references = [spectrum_1, spectrum_2]
queries = [spectrum_3, spectrum_4]

similarity_score = ParentMassMatch(tolerance=5.0)
scores = calculate_scores(references, queries, similarity_score)

for (reference, query, score) in scores:
    print(f"Parentmass match between {reference.get('id')} and {query.get('id')}" +
          f" is {score:.2f}")

Should output

Parentmass match between 1 and 3 is 1.00
Parentmass match between 1 and 4 is 0.00
Parentmass match between 2 and 3 is 0.00
Parentmass match between 2 and 4 is 1.00
__init__(tolerance: float = 0.1)[source]
Parameters

tolerance – Specify tolerance below which two masses are counted as match.

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray[source]

Compare parent masses between all references and queries.

Parameters
  • references – List/array of reference spectrums.

  • queries – List/array of Single query spectrums.

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) float[source]

Compare parent masses between reference and query spectrum.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

score_datatype

alias of bool

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

class matchms.similarity.PrecursorMzMatch(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Bases: BaseSimilarity

Return True if spectrums match in precursor m/z (within tolerance), and False otherwise. The match within tolerance can be calculated based on an absolute m/z difference (tolerance_type=”Dalton”) or based on a relative difference in ppm (tolerance_type=”ppm”).

Example to calculate scores between 2 pairs of spectrums and iterate over the scores

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.similarity import PrecursorMzMatch

spectrum_1 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "1", "precursor_mz": 100})
spectrum_2 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "2", "precursor_mz": 110})
spectrum_3 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "3", "precursor_mz": 103})
spectrum_4 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "4", "precursor_mz": 111})
references = [spectrum_1, spectrum_2]
queries = [spectrum_3, spectrum_4]

similarity_score = PrecursorMzMatch(tolerance=5.0, tolerance_type="Dalton")
scores = calculate_scores(references, queries, similarity_score)

for (reference, query, score) in scores:
    print(f"Precursor m/z match between {reference.get('id')} and {query.get('id')}" +
          f" is {score:.2f}")

Should output

Precursor m/z match between 1 and 3 is 1.00
Precursor m/z match between 1 and 4 is 0.00
Precursor m/z match between 2 and 3 is 0.00
Precursor m/z match between 2 and 4 is 1.00
__init__(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]
Parameters
  • tolerance – Specify tolerance below which two m/z are counted as match.

  • tolerance_type – Chose between fixed tolerance in Dalton (=”Dalton”) or a relative difference in ppm (=”ppm”).

matrix(references: List[Spectrum], queries: List[Spectrum], is_symmetric: bool = False) ndarray[source]

Compare parent masses between all references and queries.

Parameters
  • references – List/array of reference spectrums.

  • queries – List/array of Single query spectrums.

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) float[source]

Compare precursor m/z between reference and query spectrum.

Parameters
  • reference – Single reference spectrum.

  • query – Single query spectrum.

score_datatype

alias of bool

sort(scores: ndarray)

Return array of indexes for sorted list of scores. This method can be adapted for different styles of scores.

Parameters

scores – 1D Array of scores.

Returns

Indexes of sorted scores.

Return type

idx_sorted

Submodules