matchms.similarity package

Functions for computing spectra similarities

Matchms provides similarity measures for comparing mass spectra and their metadata. The recommended high-level entry points for peak-based cosine scoring are Cosine and ModifiedCosine.

These classes choose an appropriate implementation internally and are intended as the default choice for most workflows. Users who need a specific algorithmic variant can select one of the explicit implementations directly, for example CosineLinear, CosineFlash, CosineGreedy, or CosineHungarian.

Available similarity functions include:

cosine-based peak similarity (Cosine, CosineLinear, CosineFlash, CosineGreedy, CosineHungarian)
modified cosine similarity for spectra with shifted fragment peaks (ModifiedCosine, CosineFlash with matching_mode=”hybrid”, ModifiedCosineGreedy, ModifiedCosineHungarian)
neutral-loss-based peak similarity (NeutralLossesCosine)
fast embedding-based or approximate similarity methods (BinnedEmbeddingSimilarity, CosineBlink, FlashEntropy)
simple precursor or parent-mass matching (PrecursorMzMatch, ParentMassMatch)
molecular-structure similarity based on metadata such as SMILES or InChIKey (FingerprintSimilarity)
metadata-based matching for user-defined fields, for example exact matches in instrument_type or numerical matches within a tolerance for fields such as retention_time or collision_energy (MetadataMatch)

Custom similarity measures can be added by subclassing BaseSimilarity. Similarities that also provide sparse score computation should subclass BaseSimilarityWithSparse.

External similarity measures, such as Spec2Vec, can also be used together with matchms workflows.

class matchms.similarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

Bases: BaseEmbeddingSimilarity

Compare spectra by cosine/euclidean similarity of binned intensities.

Spectra are converted to fixed-length vectors by summing intensities in equally spaced m/z bins. Each vector is normalized to its maximum bin intensity when that maximum is positive. Empty spectra, spectra without peaks in the configured m/z range, and spectra with only zero intensities produce a zero vector instead of NaNs.

Parameters:

similarity – Similarity measure used for comparing embeddings. Supported values are "cosine" and "euclidean".
max_mz – Maximum m/z value to include. Values outside [0, max_mz] are ignored.
bin_width – Width of each m/z bin.
intensity_power – Power applied to peak intensities before binning.

__init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) → Any

Build an ANN index for the input spectra.

Parameters:

reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:

ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.

compute_embeddings(spectra: Iterable[Spectrum]) → ndarray[source]

Convert spectra into binned embeddings.

Parameters:: spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
Returns:: Array of shape (n_spectra, n_bins) containing the binned embeddings.
Return type:: np.ndarray

compute_similarity_matrix_from_embeddings(embeddings_1: ndarray, embeddings_2: ndarray | None = None) → ndarray

Compute a raw NumPy similarity matrix from precomputed embeddings.

This helper keeps the old raw-array use case available without changing the public matrix() contract inherited from BaseSimilarity.

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) → tuple[ndarray, ndarray]

Get approximate nearest neighbors for input spectra.

Parameters:

query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) → ndarray

Get embeddings either by computing them or loading from disk.

Parameters:

spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() → tuple[ndarray, ndarray]

Get nearest neighbors for all points in the index.

Returns:: Neighbor indices and similarity scores.
Return type:: Tuple[np.ndarray, np.ndarray]
Raises:: ValueError – If unsupported index backend is used.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

load_ann_index(path: str | Path) → Any

Load an ANN index from disk.

Parameters:: path (Union[str, Path]) – Path to load the index from.
Returns:: The loaded ANN index.
Return type:: Any
Raises:: ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) → ndarray

Load embeddings from a numpy file.

Parameters:: npy_path (Union[str, Path]) – Path to the numpy file.
Returns:: Embeddings array.
Return type:: np.ndarray
Raises:: ValueError – If loaded array is not 2D.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores

Compute similarity matrix between spectra_1 and spectra_2.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Embedding similarities expose only ("score",).
progress_bar – Included for API compatibility. Embeddings are computed in batch and this implementation currently does not display a progress bar.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

property n_bins: int: Number of bins used for each embedding vector.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → float

Compute similarity between a pair of spectra.

Parameters:

spectrum_1 (SpectrumType) – Reference spectrum.
spectrum_2 (SpectrumType) – Query spectrum.

save_ann_index(path: str | Path) → None

Save the ANN index to disk.

Parameters:: path (Union[str, Path]) – Path to save the index to.
Raises:: ValueError – If no index exists to save.

score_datatype: alias of float64

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

static store_embeddings(npy_path: str | Path, embeddings: ndarray) → None

Store embeddings in a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to save the embeddings to.
embeddings (np.ndarray) – Embeddings array to store.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.Cosine(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]

Bases: BaseSimilarity

Calculate Cosine scores between mass spectra.

This is matchms central Cosine class. The Cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance.

Matchms provides various implementations of the Cosine score which are combined here in what we believe to be the typical best choice for most users.

By default, the parameter use_hungarian is set to False, which means that the greedy algorithm is used to find the best matches. This is typically faster than the Hungarian algorithm, and for most applications the results are very similar. If you need the exact optimal solution, you can set use_hungarian to True, which will use the Hungarian algorithm to find the best matches.

__init__(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]

Initialize cosine score class.

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
use_hungarian – Whether to use the Hungarian algorithm to find the best matches. The default is False, which means that the greedy algorithm is used to find the best matches. The greedy algorithm is typically faster than the Hungarian algorithm, and for most applications the results are very similar.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01. Will only be used if use_hungarian is False.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]

Calculate matrix of Cosine scores.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.

Returns:

Dense score matrix as a Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]: Calculate approximate modified cosine score between two spectra.

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.CosineBlink(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]

Bases: BaseSimilarity

BLINK-style approximate cosine similarity for mass spectra with fast .pair() and .matrix(). This score is implemented based on the method BLINK, proposed by Harwood et al. (2023, https://www.nature.com/articles/s41598-023-40496-9).

Integer binning with bin_width (Da); tolerance window is ± floor(tolerance/bin_width) bins.
Per-spectrum L2 normalization (after optional mz/intensity weighting).
Blur only one side (spectra_2 in .matrix(), smaller spectrum in .pair()).

Parameters:

tolerance – True m/z tolerance (Da). Peaks within +/- tolerance are considered matches. Default 0.01.
bin_width – Discretization width (Da). Default 0.001 (1 mDa). Effective radius R=floor(tolerance/bin_width).
mz_power – Power for mz weighting (intensity *= mz**mz_power). Default 0.0.
intensity_power – Power for intensity weighting before normalization. Default 1.0 (set 0.5 for sqrt scaling).
clip_to_one – Clip score to [0,1]. Default True.
use_numba (bool) – Use numba-accelerated pairwise kernel when available. Default True.
prefilter (bool) – Apply BLINK-like pre-filtering (remove <1% base peak, > precursor m/z, zeros). Default True.
min_relative_intensity (float) – Relative base-peak threshold for prefilter. Default 0.01 (1%).
crop_above_precursor (bool) – Drop fragments > precursor m/z if available in metadata. Default True.
remove_zero_intensities (bool) – Remove peaks with intensity <= 0. Default True.
top_k (Optional[int]) – Keep only top-K most intense fragments after other filters (per spectrum). Default None.
path) (# Batching (matrix)
batch_size (int) – Number of query spectra per batch in .matrix(). Default 1024.
sparse_score_min (float) – When array_type=’sparse’, drop scores < sparse_score_min. Default 0.0.

__init__(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores[source]

All-vs-all BLINK-style cosine scores.

Implementation: - Build a global dense bin axis in integer bins from min to max across refs+queries

(rows ~ (max_bin - min_bin + 1)), which keeps matrices sparse.

Build a CSR intensity matrix for refs (rows=bins, cols=ref spectra) after per-spectrum L2 normalization.
For spectra_2, build per-batch blurred CSR by expanding each nonzero to its ±R neighbors.
Multiply: scores_batch = (I_ref.T @ I_qry_blur), accumulate into the final output.

Parameters:

spectra_1 – List of input spectra.
spectra_2 – List of input spectra.
score_fields – Requested score fields.

Returns:

Dense Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]

Calculate BLINK-style cosine between two spectra.

Parameters:

spectrum_1 – Single reference spectrum.
spectrum_2 – Single query spectrum.

score_datatype: alias of float32

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.CosineFlash(*args, dtype: dtype = <class 'numpy.float64'>, **kwargs)[source]

Bases: _BaseFlashSimilarity

Flash Cosine similarity following the original Flash Entropy (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it. This corresponds to the “CosineGreedy” scoring logic but with the same fast Flash path as Flash Entropy.

Key options:

matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.

Notes:

.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).

Parameters:

matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’).
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
intensity_power – The power to raise intensity to in the cosine function. The default is 1 (no weighting).
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is False.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.

__init__(*args, dtype: dtype = <class 'numpy.float64'>, **kwargs)[source]

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]

Calculate matrix of Flash Cosine scores.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.

Returns:

Dense score matrix as a Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → ndarray[source]

Calculate the similarity for one pair of spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

Returns:

Similarity result for one pair. The returned value should be compatible with self.score_datatype.

Return type:

score

Examples

Scalar score:: return np.asarray(score, dtype=self.score_datatype)
Structured score:: return np.asarray((score, matches), dtype=self.score_datatype)

score_datatype: alias of float64

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.CosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]

Bases: BaseSimilarityWithSparse

Calculate ‘cosine similarity score’ between two spectra.

The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved in a ‘greedy’ way. This can perform notably faster, but does occasionally deviate slightly from a fully correct solution (as with the Hungarian algorithm, see CosineHungarian). In practice this will rarely affect similarity scores notably, in particular for smaller tolerances.

For example

import numpy as np
from matchms import Spectrum
from matchms.similarity import CosineGreedy

spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]),
                     intensities=np.array([0.7, 0.2, 0.1]),
                     metadata={"precursor_mz": 200.0})
spectrum_2 = Spectrum(mz=np.array([100, 140, 190.]),
                 intensities=np.array([0.4, 0.2, 0.1]),
                 metadata={"precursor_mz": 190.0})

# Use factory to construct a similarity function
cosine_greedy = CosineGreedy(tolerance=0.2)

score = cosine_greedy.pair(spectrum_1, spectrum_2)

print(f"Cosine score is {score['score']:.2f} with {score['matches']} matched peaks")

Should output

Cosine score is 0.83 with 1 matched peaks

Unlike in matchms < 1.0, this method also applies a noise filter by default, which removes peaks with intensity below a certain cutoff. This is typically highly beneficial for the performance of the greedy algorithm, and for most applications the results are very similar to the exact assignment variant. If you want to disable this noise filtering, you can set noise_cutoff to 0 or None.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)

Calculate a dense similarity matrix.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself. For commutative similarities this automatically uses a symmetric optimization.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
progress_bar – When True, show a progress bar. Default is True.

Returns:

Dense score result wrapped in a Scores container.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]

Calculate cosine score between two spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

Returns:

Tuple with cosine score and number of matched peaks. The score can be access as score[“score”] and the number of matched peaks as score[“matches”].

Return type:

Score

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)

Calculate sparse similarity results.

Filtering is applied to the full score before score field projection.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
idx_row – Row indices of pairs to compute. If None and idx_col is also None, all pairwise comparisons are considered and only retained scores are stored.
idx_col – Column indices of pairs to compute. Must have the same shape as idx_row.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
score_filter – Optional callable receiving the full score and returning whether it should be retained. If None, keep_score() is used.
progress_bar – When True, show a progress bar.

Returns:

Sparse score result wrapped in a Scores container.

Return type:

Scores

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.CosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarityWithSparse

Calculate ‘cosine similarity score’ between two spectra using the Hungarian algorithm.

The cosine score quantifies the similarity between two mass spectra by finding the optimal one-to-one matching between their peaks. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance.

The peak assignment is solved using the Hungarian algorithm (scipy.optimize.linear_sum_assignment), which finds the assignment that maximises the sum of intensity products. This is mathematically optimal but can be notably slower than the greedy heuristic in CosineGreedy.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)

Calculate a dense similarity matrix.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself. For commutative similarities this automatically uses a symmetric optimization.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
progress_bar – When True, show a progress bar. Default is True.

Returns:

Dense score result wrapped in a Scores container.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]

Calculate cosine score between two spectra.

Parameters:

spectrum_1 – Single spectrum.
spectrum_2 – Single spectrum.

Return type:

Tuple with cosine score and number of matched peaks.

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)

Calculate sparse similarity results.

Filtering is applied to the full score before score field projection.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
idx_row – Row indices of pairs to compute. If None and idx_col is also None, all pairwise comparisons are considered and only retained scores are stored.
idx_col – Column indices of pairs to compute. Must have the same shape as idx_row.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
score_filter – Optional callable receiving the full score and returning whether it should be retained. If None, keep_score() is used.
progress_bar – When True, show a progress bar.

Returns:

Sparse score result wrapped in a Scores container.

Return type:

Scores

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.CosineLinear(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate ‘linear cosine similarity score’ between two spectra.

This implements the CosineLinear similarity from SIRIUS (BOECKER lab), which achieves O(n+m) time complexity by requiring spectra to be “well-separated” (consecutive peaks more than 2x tolerance apart). A preprocessing step (sirius_merge_close_peaks) enforces this invariant by greedily merging close peaks in descending intensity order.

For example

import numpy as np
from matchms import Spectrum
from matchms.similarity import CosineLinear

reference = Spectrum(mz=np.array([100, 150, 200.]),
                     intensities=np.array([0.7, 0.2, 0.1]))
query = Spectrum(mz=np.array([100, 140, 190.]),
                 intensities=np.array([0.4, 0.2, 0.1]))

cosine_linear = CosineLinear(tolerance=0.2)
score = cosine_linear.pair(reference, query)

print(f"CosineLinear score is {score['score']:.2f} with {score['matches']} matched peaks")

Should output

CosineLinear score is 0.83 with 1 matched peaks

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1. Peaks closer than 2 * tolerance are merged before scoring.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)[source]

Optimized matrix computation that precomputes merged spectra.

Each spectrum is merged once (N+M calls to sirius_merge_close_peaks) instead of 2*N*M times in the naive double-loop approach.

pair(reference: Spectrum, query: Spectrum) → tuple[float, int][source]

Calculate linear cosine score between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

Returns:

Tuple with cosine score and number of matched peaks.

Return type:

Score

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.FingerprintSimilarity(fingerprint_generator, similarity_measure: str = 'tanimoto', set_empty_scores: float | int | str = 'nan', ignore_stereochemistry: bool = False, count: bool = False, folded: bool = True, return_csr: bool = False, invalid_policy: str = 'raise', **fingerprint_config_kwargs)[source]

Bases: BaseSimilarity

Calculate similarity between molecules based on molecular fingerprints.

Fingerprints can either be provided explicitly as Fingerprints objects or computed internally from input spectra.

This class no longer expects fingerprints to be stored directly in spectrum metadata. Instead, it uses a Fingerprints container.

Currently supported similarity measures are:

"cosine"
"tanimoto"

Notes

Tanimoto is used in its generalized form and therefore also works for count/weighted fingerprints.
Fingerprints may be stored densely (NumPy) or sparsely (CSR).

__init__(fingerprint_generator, similarity_measure: str = 'tanimoto', set_empty_scores: float | int | str = 'nan', ignore_stereochemistry: bool = False, count: bool = False, folded: bool = True, return_csr: bool = False, invalid_policy: str = 'raise', **fingerprint_config_kwargs)[source]

Parameters:

fingerprint_generator – A chemap-compatible fingerprint generator.
similarity_measure – Choose similarity measure from "cosine" or "tanimoto". The default is "tanimoto".
set_empty_scores – Define what should be returned instead of a similarity score in cases where fingerprints are missing. The default is "nan", which will return np.nan in such cases.
ignore_stereochemistry – Passed to internally created Fingerprints objects.
count – Passed to internally created Fingerprints objects.
folded – Passed to internally created Fingerprints objects.
return_csr – Passed to internally created Fingerprints objects.
invalid_policy – Passed to internally created Fingerprints objects.
**fingerprint_config_kwargs – Additional keyword arguments passed to internally created Fingerprints objects.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum] | None = None, spectra_2: Sequence[Spectrum] | None = None, fingerprints_1: Fingerprints | None = None, fingerprints_2: Fingerprints | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores[source]

Calculate matrix of fingerprint-based similarity scores.

Parameters:

spectra_1 – First collection of spectra. Used only if fingerprints_1 is not given.
spectra_2 – Second collection of spectra. Used only if fingerprints_2 is not given. If None and fingerprints_2 is None, compare the first input against itself.
fingerprints_1 – Optional precomputed Fingerprints object for the first input.
fingerprints_2 – Optional precomputed Fingerprints object for the second input. If None, compare the first input against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – Included for API compatibility. Not used here.

Returns:

Dense score matrix as a Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum)[source]

Pairwise fingerprint similarity is not supported in this API.

FingerprintSimilarity works on precomputed Fingerprints containers or computes fingerprints internally for collections of spectra in matrix().

Use matrix(…) instead.

score_datatype: alias of float64

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.FlashEntropy(*args, normalize_to_half: bool = True, **kwargs)[source]

Bases: _BaseFlashSimilarity

Flash entropy similarity (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it.

Key options:

matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.

Notes:

.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).

Parameters:

matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’).
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is True.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.

__init__(*args, normalize_to_half: bool = True, **kwargs)[source]

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]

Calculate matrix of Flash Entropy scores.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.

Returns:

Dense score matrix as a Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → ndarray[source]

Compute Flash Entropy for a single (reference, query) pair. Uses the same preprocessing and scoring logic as the matrix path, but builds a tiny 1-spectrum library from the query.

Careful: This is not the fast intended use; better .matrix() instead.

score_datatype: alias of float32

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.MetadataMatch(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Bases: BaseSimilarityWithSparse

Return True if metadata entries of a specified field match between two spectra.

This is supposed to be used to compare a wide range of possible metadata entries and use this to later select related or similar spectra.

Matching can be done by:

exact equality (matching_type="equal_match")
numerical difference within a tolerance (matching_type="difference")

For numerical differences, the tolerance can be interpreted as:

absolute difference in Dalton / raw units (tolerance_type="Dalton")
relative difference in ppm (tolerance_type="ppm")

Example to calculate scores between 2 pairs of spectra and inspect the score matrix

import numpy as np
from matchms import Spectrum
from matchms.similarity import MetadataMatch

spectrum_1 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "orbitrap", "id": 1},
)
spectrum_2 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "qtof", "id": 2},
)
spectrum_3 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "qtof", "id": 3},
)
spectrum_4 = Spectrum(
    mz=np.array([]),
    intensities=np.array([]),
    metadata={"instrument_type": "orbitrap", "id": 4},
)

spectra_1 = [spectrum_1, spectrum_2]
spectra_2 = [spectrum_3, spectrum_4]

similarity = MetadataMatch(field="instrument_type")
scores = similarity.matrix(spectra_1, spectra_2)

score_array = scores.to_array()

for i, spectrum_1 in enumerate(spectra_1):
    for j, spectrum_2 in enumerate(spectra_2):
        print(
            f"Metadata match between {spectrum_1.get('id')} and "
            f"{spectrum_2.get('id')} is {bool(score_array[i, j])}"
        )

Should output

Metadata match between 1 and 3 is False
Metadata match between 1 and 4 is True
Metadata match between 2 and 3 is True
Metadata match between 2 and 4 is False

__init__(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Parameters:

field – Specify field name for metadata that should be compared.
matching_type – Specify how field entries should be matched. Can be one of ["equal_match", "difference"]. "equal_match": entries must be exactly equal (default). "difference": entries are considered a match if their numerical difference is less than or equal to tolerance.
tolerance – Specify tolerance below which two values are counted as match. This only applies to numerical values.
tolerance_type – Choose between fixed tolerance in Dalton / raw units ("Dalton") or a relative difference in ppm ("ppm"). This only applies when matching_type="difference".

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores[source]

Compare metadata entries between all spectra in spectra_1 and spectra_2.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – Included for API compatibility. Not used here because this optimized implementation does not iterate pairwise in Python.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum)[source]

Compare metadata entries between two spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

score_datatype: alias of bool

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row=None, idx_col=None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True) → Scores[source]

Compare metadata entries and return sparse scores.

This method uses optimized metadata matching when no explicit indices are provided. If explicit idx_row and idx_col are given, it falls back to the generic sparse implementation from BaseSimilarityWithSparse.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.ModifiedCosine(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]

Bases: BaseSimilarity

Calculate an approximate modified cosine score between mass spectra.

This is matchms central Modified Cosine class. The Modified Cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is the difference in precursor m/z between the two spectra.

Matchms provides various implementations of the Modified Cosine score which are combined here in what we believe to be the typical best choice for most users.

By default, the parameter use_hungarian is set to False, which means that the greedy algorithm is used to find the best matches. This is typically faster than the Hungarian algorithm, and for most applications the results are very similar. If you need the exact optimal solution, you can set use_hungarian to True, which will use the Hungarian algorithm to find the best matches.

For more conceptual context, see Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743].

__init__(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]

Initialize the modified cosine score class.

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
use_hungarian – Whether to use the Hungarian algorithm to find the best matches. The default is False, which means that the greedy algorithm is used to find the best matches. The greedy algorithm is typically faster than the Hungarian algorithm, and for most applications the results are very similar.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01. Will only be used if use_hungarian is False.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]

Calculate matrix of Modified Cosine scores.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.

Returns:

Dense score matrix as a Scores object.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]: Calculate approximate modified cosine score between two spectra.

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.ModifiedCosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]

Bases: BaseSimilarityWithSparse

Calculate an approximate modified cosine score between mass spectra.

This implementation solves the peak assignment in a greedy way and is therefore an approximation. See ModifiedCosineHungarian for the exact assignment variant.

The modified cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is the difference in precursor m/z between the two spectra.

See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for further details.

Unlike in matchms < 1.0, this method also applies a noise filter by default, which removes peaks with intensity below a certain cutoff. This is typically highly beneficial for the performance of the greedy algorithm, and for most applications the results are very similar to the exact assignment variant. If you want to disable this noise filtering, you can set noise_cutoff to 0 or None.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]

Initialize approximate modified cosine.

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)

Calculate a dense similarity matrix.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself. For commutative similarities this automatically uses a symmetric optimization.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
progress_bar – When True, show a progress bar. Default is True.

Returns:

Dense score result wrapped in a Scores container.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]: Calculate approximate modified cosine score between two spectra.

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)

Calculate sparse similarity results.

Filtering is applied to the full score before score field projection.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
idx_row – Row indices of pairs to compute. If None and idx_col is also None, all pairwise comparisons are considered and only retained scores are stored.
idx_col – Column indices of pairs to compute. Must have the same shape as idx_row.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
score_filter – Optional callable receiving the full score and returning whether it should be retained. If None, keep_score() is used.
progress_bar – When True, show a progress bar.

Returns:

Sparse score result wrapped in a Scores container.

Return type:

Scores

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.ModifiedCosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarityWithSparse

Calculate exact modified cosine score between mass spectra.

The modified cosine score quantifies similarity between two mass spectra with optional precursor-based mass shift. Potential matches are all peak pairs that are within tolerance either unshifted or shifted by precursor_mz(reference) - precursor_mz(query).

Peak assignment is solved globally via Hungarian assignment (linear sum assignment), which yields an exact one-to-one maximum-weight matching.

See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for the modified cosine concept.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Initialize exact modified cosine.

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)

Calculate a dense similarity matrix.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself. For commutative similarities this automatically uses a symmetric optimization.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
progress_bar – When True, show a progress bar. Default is True.

Returns:

Dense score result wrapped in a Scores container.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]: Calculate exact modified cosine score between two spectra.

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)

Calculate sparse similarity results.

Filtering is applied to the full score before score field projection.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
idx_row – Row indices of pairs to compute. If None and idx_col is also None, all pairwise comparisons are considered and only retained scores are stored.
idx_col – Column indices of pairs to compute. Must have the same shape as idx_row.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
score_filter – Optional callable receiving the full score and returning whether it should be retained. If None, keep_score() is used.
progress_bar – When True, show a progress bar.

Returns:

Sparse score result wrapped in a Scores container.

Return type:

Scores

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.NeutralLossesCosine(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]

Bases: BaseSimilarityWithSparse

Calculate ‘neutral losses cosine score’ between mass spectra.

The neutral losses cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’ once a mass-shift is applied. The mass shift is the difference in precursor-m/z between the two spectra. In general, ModifiedCosineGreedy is recommended over NeutralLossesCosine because it will on average deliver more reliable results.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
ignore_peaks_above_precursor – By default this is set to True, meaning that peaks with m/z values larger than the precursor-m/z will be ignored (since those would correspond to negative “neutral losses”).

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)

Calculate a dense similarity matrix.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself. For commutative similarities this automatically uses a symmetric optimization.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
progress_bar – When True, show a progress bar. Default is True.

Returns:

Dense score result wrapped in a Scores container.

Return type:

Scores

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → tuple[float, int][source]

Calculate neutral losses cosine score between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

Return type:

Tuple with cosine score and number of matched peaks.

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)

Calculate sparse similarity results.

Filtering is applied to the full score before score field projection.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
idx_row – Row indices of pairs to compute. If None and idx_col is also None, all pairwise comparisons are considered and only retained scores are stored.
idx_col – Column indices of pairs to compute. Must have the same shape as idx_row.
score_fields – Score fields to return. - None means return all available fields. - For scalar scores, only ("score",) is valid. - For structured scores, this can be a subset such as ("score",).
score_filter – Optional callable receiving the full score and returning whether it should be retained. If None, keep_score() is used.
progress_bar – When True, show a progress bar.

Returns:

Sparse score result wrapped in a Scores container.

Return type:

Scores

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.ParentMassMatch(tolerance: float = 0.1)[source]

Bases: MetadataMatch

Return True if spectra match in parent mass, and False otherwise.

__init__(tolerance: float = 0.1)[source]

Parameters:: tolerance – Specify tolerance below which two parent masses are counted as match.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores

Compare metadata entries between all spectra in spectra_1 and spectra_2.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – Included for API compatibility. Not used here because this optimized implementation does not iterate pairwise in Python.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum)

Compare metadata entries between two spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

score_datatype: alias of bool

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row=None, idx_col=None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True) → Scores

Compare metadata entries and return sparse scores.

This method uses optimized metadata matching when no explicit indices are provided. If explicit idx_row and idx_col are given, it falls back to the generic sparse implementation from BaseSimilarityWithSparse.

to_dict() → dict: Return a dictionary representation of the similarity function.

class matchms.similarity.PrecursorMzMatch(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Bases: MetadataMatch

Return True if spectra match in precursor m/z, and False otherwise.

The match within tolerance can be calculated based on an absolute m/z difference (tolerance_type="Dalton") or based on a relative difference in ppm (tolerance_type="ppm").

__init__(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Parameters:

tolerance – Specify tolerance below which two precursor m/z values are counted as match.
tolerance_type – Choose between fixed tolerance in Dalton ("Dalton") or a relative difference in ppm ("ppm").

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

keep_score(score) → bool

Return whether a score should be retained in sparse outputs.

This defines the default sparse retention behavior. Users can override it per call via score_filter=....

Default behavior: - scalar score: keep if score != 0 - structured score: keep if all fields are non-zero

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores

Compare metadata entries between all spectra in spectra_1 and spectra_2.

Parameters:

spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only ("score",) is supported.
progress_bar – Included for API compatibility. Not used here because this optimized implementation does not iterate pairwise in Python.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum)

Compare metadata entries between two spectra.

Parameters:

spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.

score_datatype: alias of bool

sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row=None, idx_col=None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True) → Scores

Compare metadata entries and return sparse scores.

This method uses optimized metadata matching when no explicit indices are provided. If explicit idx_row and idx_col are given, it falls back to the generic sparse implementation from BaseSimilarityWithSparse.

to_dict() → dict: Return a dictionary representation of the similarity function.

matchms.similarity package

Functions for computing spectra similarities

Submodules