matchms.similarity.MetadataMatch module

class matchms.similarity.MetadataMatch.MetadataMatch(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]

Bases: BaseSimilarity

Return True if metadata entries of a specified field match between two spectra.

This is supposed to be used to compare a wide range of possible metadata entries and use this to later select related or similar spectra.

Example to calculate scores between 2 pairs of spectrums and iterate over the scores

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.similarity import MetadataMatch

spectrum_1 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "orbitrap",
                                "id": 1})
spectrum_2 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "qtof",
                                "id": 2})
spectrum_3 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "qtof",
                                "id": 3})
spectrum_4 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "orbitrap",
                                "id": 4})
references = [spectrum_1, spectrum_2]
queries = [spectrum_3, spectrum_4]

similarity_score = MetadataMatch(field="instrument_type")
scores = calculate_scores(references, queries, similarity_score)

for (reference, query, score) in scores:
    print(f"Metadata match between {reference.get('id')} and {query.get('id')}" +
          f" is {score}")

Should output

Metadata match between 1 and 4 is [True]
Metadata match between 2 and 3 is [True]
__init__(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]
Parameters:
  • field – Specify field name for metadata that should be compared.

  • matching_type – Specify how field entries should be matched. Can be one of [“equal_match”, “difference”].

  • tolerance – Specify tolerance below which two values are counted as match. This only applied to numerical values.

keep_score(score)

In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) ndarray[source]

Compare parent masses between all references and queries.

Parameters:
  • references – List/array of reference spectrums.

  • queries – List/array of Single query spectrums.

  • array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) float[source]

Compare precursor m/z between reference and query spectrum.

Parameters:
  • reference – Single reference spectrum.

  • query – Single query spectrum.

score_datatype

alias of bool

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectrums as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:
  • references – List of reference objects

  • queries – List of query objects

  • idx_row – List/array of row indices

  • idx_col – List/array of column indices

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

to_dict() dict

Return a dictionary representation of a similarity function.