matchms.similarity.CosineGreedy module

class matchms.similarity.CosineGreedy.CosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]

Bases: BaseSimilarityWithSparse

Calculate ‘cosine similarity score’ between two spectra.

The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved in a ‘greedy’ way. This can perform notably faster, but does occasionally deviate slightly from a fully correct solution (as with the Hungarian algorithm, see CosineHungarian). In practice this will rarely affect similarity scores notably, in particular for smaller tolerances.

For example

import numpy as np
from matchms import Spectrum
from matchms.similarity import CosineGreedy

spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]),
                     intensities=np.array([0.7, 0.2, 0.1]),
                     metadata={"precursor_mz": 200.0})
spectrum_2 = Spectrum(mz=np.array([100, 140, 190.]),
                 intensities=np.array([0.4, 0.2, 0.1]),
                 metadata={"precursor_mz": 190.0})

# Use factory to construct a similarity function
cosine_greedy = CosineGreedy(tolerance=0.2)

score = cosine_greedy.pair(spectrum_1, spectrum_2)

print(f"Cosine score is {score['score']:.2f} with {score['matches']} matched peaks")

Should output

Cosine score is 0.83 with 1 matched peaks

Unlike in matchms < 1.0, this method also applies a noise filter by default, which removes peaks with intensity below a certain cutoff. This is typically highly beneficial for the performance of the greedy algorithm, and for most applications the results are very similar to the exact assignment variant. If you want to disable this noise filtering, you can set noise_cutoff to 0 or None.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)

Calculate a dense similarity matrix.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself. For commutative similarities this automatically uses a symmetric optimization.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
progress_bar – When True, show a progress bar. Default is True.

Returns:

Dense score result wrapped in a Scores container.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]

Calculate cosine score between two spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

Returns:

Tuple with cosine score and number of matched peaks. The score can be access as score[“score”] and the number of matched peaks as score[“matches”].

Return type:

Score

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)

Calculate sparse similarity results.

Filtering is applied to the full score before score field projection.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
idx_row – Row indices of pairs to compute. If None and idx_col is also None, all pairwise comparisons are considered and only retained scores are stored.
idx_col – Column indices of pairs to compute. Must have the same shape as idx_row.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
score_filter – Optional callable receiving the full score and returning whether it should be retained. If None, keep_score() is used.
progress_bar – When True, show a progress bar.

Returns:

Sparse score result wrapped in a Scores container.

Return type:

Scores

to_dict() → dict: Return a dictionary representation of the similarity function.