matchms.similarity.MetadataMatch module

class matchms.similarity.MetadataMatch.MetadataMatch(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Bases: BaseSimilarityWithSparse

Return True if metadata entries of a specified field match between two spectra.

This is supposed to be used to compare a wide range of possible metadata entries and use this to later select related or similar spectra.

Matching can be done by:

exact equality (matching_type="equal_match")
numerical difference within a tolerance (matching_type="difference")

For numerical differences, the tolerance can be interpreted as:

absolute difference in Dalton / raw units (tolerance_type="Dalton")
relative difference in ppm (tolerance_type="ppm")

Example to calculate scores between 2 pairs of spectra and inspect the score matrix

import numpy as np
from matchms import Spectrum
from matchms.similarity import MetadataMatch

spectrum_1 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "orbitrap", "id": 1},
)
spectrum_2 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "qtof", "id": 2},
)
spectrum_3 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "qtof", "id": 3},
)
spectrum_4 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "orbitrap", "id": 4},
)

spectra_1 = [spectrum_1, spectrum_2]
spectra_2 = [spectrum_3, spectrum_4]

similarity = MetadataMatch(field="instrument_type")
scores = similarity.matrix(spectra_1, spectra_2)

score_array = scores.to_array()

for i, spectrum_1 in enumerate(spectra_1):
    for j, spectrum_2 in enumerate(spectra_2):
        print(
            f"Metadata match between {spectrum_1.get('id')} and "
            f"{spectrum_2.get('id')} is {bool(score_array[i, j])}"
        )

Should output

Metadata match between 1 and 3 is False
Metadata match between 1 and 4 is True
Metadata match between 2 and 3 is True
Metadata match between 2 and 4 is False

__init__(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Parameters:

field – Specify field name for metadata that should be compared.
matching_type – Specify how field entries should be matched. Can be one of ["equal_match", "difference"]. "equal_match": entries must be exactly equal (default). "difference": entries are considered a match if their numerical difference is less than or equal to tolerance.
tolerance – Specify tolerance below which two values are counted as match. This only applies to numerical values.
tolerance_type – Choose between fixed tolerance in Dalton / raw units ("Dalton") or a relative difference in ppm ("ppm"). This only applies when matching_type="difference".

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores[source]

Compare metadata entries between all spectra in spectra_1 and spectra_2.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – Included for API compatibility. Not used here because this optimized implementation does not iterate pairwise in Python.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum)[source]

Compare metadata entries between two spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

score_datatype: alias of bool

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row=None, idx_col=None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True) → Scores[source]

Compare metadata entries and return sparse scores.

This method uses optimized metadata matching when no explicit indices are provided. If explicit idx_row and idx_col are given, it falls back to the generic sparse implementation from BaseSimilarityWithSparse.

to_dict() → dict: Return a dictionary representation of the similarity function.