matchms.similarity package

Functions for computing spectra similarities

Matchms provides a number of frequently used similarity scores to compare mass spectra. This includes

scores based on comparing peak positions and intensities (CosineGreedy, ModifiedCosineGreedy, ModifiedCosineHungarian)
simple scores that only assess precursor m/z or parent mass matches (PrecursorMzMatch or: ParentMassMatch)
scores assessing molecular similarity if structures (SMILES, InchiKey) are given as metadata (FingerprintSimilarity)
score for assessing matches in user-defined metadata fields which can be used to find equal entries (e.g. instrument_type) or numerical values within a specified tolerance (for instance: retention_time, collision energy…) (MetadataMatch)

It is also easily possible to add own custom similarity measures or import external ones (such as Spec2Vec).

class matchms.similarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

Bases: BaseEmbeddingSimilarity

A similarity measure that bins spectra into a fixed number of bins and uses the binned intensities as embedding features. By default, the similarity between spectra is computed as the cosine similarity between their binned representations.

Parameters:

similarity (str, optional) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.
max_mz (float, optional) – The maximum m/z value to consider when binning. Default is 1005.
bin_width (float, optional) – The width of each bin in m/z units. Default is 1.
intensity_power – The power to raise the peak intensities. Default is 1.

__init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) → Any

Build an ANN index for the reference spectra.

Parameters:

reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:

ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.

compute_embeddings(spectra: Iterable[Spectrum]) → ndarray[source]

Convert spectra into binned embeddings.

Parameters:: spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
Returns:: Array of shape (n_spectra, n_bins) containing the binned embeddings.
Return type:: np.ndarray

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) → Tuple[ndarray, ndarray]

Get approximate nearest neighbors for query spectra.

Parameters:

query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) → ndarray

Get embeddings either by computing them or loading from disk.

Parameters:

spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() → Tuple[ndarray, ndarray]

Get nearest neighbors for all points in the index.

Returns:: Neighbor indices and similarity scores.
Return type:: Tuple[np.ndarray, np.ndarray]
Raises:: ValueError – If unsupported index backend is used.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

load_ann_index(path: str | Path) → Any

Load an ANN index from disk.

Parameters:: path (Union[str, Path]) – Path to load the index from.
Returns:: The loaded ANN index.
Return type:: Any
Raises:: ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) → ndarray

Load embeddings from a numpy file.

Parameters:: npy_path (Union[str, Path]) – Path to the numpy file.
Returns:: Embeddings array.
Return type:: np.ndarray
Raises:: ValueError – If loaded array is not 2D.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = True) → ndarray

Compute similarity matrix between reference and query spectra.

Parameters:

references – List of reference spectra.
queries – List of query spectra.
array_type – Type of array to return. Must be “numpy”.
is_symmetric – Whether the matrix is symmetric. Must be True.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

pair(reference: Spectrum, query: Spectrum) → float

Compute similarity between a pair of spectra.

Parameters:

reference (SpectrumType) – Reference spectrum.
query (SpectrumType) – Query spectrum.

Returns:

Similarity score between the spectra.

Return type:

float

save_ann_index(path: str | Path) → None

Save the ANN index to disk.

Parameters:: path (Union[str, Path]) – Path to save the index to.
Raises:: ValueError – If no index exists to save.

score_datatype: alias of float64

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

static store_embeddings(npy_path: str | Path, embs: ndarray) → None

Store embeddings in a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to save the embeddings to.
embs (np.ndarray) – Embeddings array to store.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.BlinkCosine(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]

Bases: BaseSimilarity

BLINK-style approximate cosine similarity for mass spectra with fast .pair() and .matrix(). This score is implemented based on the method BLINK, proposed by Harwood et al. (2023, https://www.nature.com/articles/s41598-023-40496-9).

Integer binning with bin_width (Da); tolerance window is ± floor(tolerance/bin_width) bins.
Per-spectrum L2 normalization (after optional mz/intensity weighting).
Blur only one side (queries in .matrix(), smaller spectrum in .pair()).
Pairwise returns (score, ~matches). Matrix returns only scores.

Parameters:

tolerance – True m/z tolerance (Da). Peaks within +/- tolerance are considered matches. Default 0.01.
bin_width – Discretization width (Da). Default 0.001 (1 mDa). Effective radius R=floor(tolerance/bin_width).
mz_power – Power for mz weighting (intensity *= mz**mz_power). Default 0.0.
intensity_power – Power for intensity weighting before normalization. Default 1.0 (set 0.5 for sqrt scaling).
clip_to_one – Clip score to [0,1]. Default True.
use_numba (bool) – Use numba-accelerated pairwise kernel when available. Default True.
prefilter (bool) – Apply BLINK-like pre-filtering (remove <1% base peak, > precursor m/z, zeros). Default True.
min_relative_intensity (float) – Relative base-peak threshold for prefilter. Default 0.01 (1%).
crop_above_precursor (bool) – Drop fragments > precursor m/z if available in metadata. Default True.
remove_zero_intensities (bool) – Remove peaks with intensity <= 0. Default True.
top_k (Optional[int]) – Keep only top-K most intense fragments after other filters (per spectrum). Default None.
path) (# Batching (matrix)
batch_size (int) – Number of query spectra per batch in .matrix(). Default 1024.
sparse_score_min (float) – When array_type=’sparse’, drop scores < sparse_score_min. Default 0.0.

__init__(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False)[source]

All-vs-all BLINK-style cosine scores.

Implementation: - Build a global dense bin axis in integer bins from min to max across refs+queries

(rows ~ (max_bin - min_bin + 1)), which keeps matrices sparse.

Build a CSR intensity matrix for refs (rows=bins, cols=ref spectra) after per-spectrum L2 normalization.
For queries, build per-batch blurred CSR by expanding each nonzero to its ±R neighbors.
Multiply: scores_batch = (I_ref.T @ I_qry_blur), accumulate into the final output.

Parameters:

references – List of reference spectra.
queries – List of query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array

Returns:

If array_type == ‘numpy’: dense (n_ref, n_query) If array_type == ‘sparse’: COO sparse (n_ref, n_query), dropping scores < sparse_score_min

Return type:

numpy.ndarray or scipy.sparse.coo_array

pair(reference: Spectrum, query: Spectrum) → Tuple[float, int][source]

Calculate BLINK-style cosine between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.CosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate ‘cosine similarity score’ between two spectra.

The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved in a ‘greedy’ way. This can perform notably faster, but does occasionally deviate slightly from a fully correct solution (as with the Hungarian algorithm, see CosineHungarian). In practice this will rarely affect similarity scores notably, in particular for smaller tolerances.

For example

import numpy as np
from matchms import Spectrum
from matchms.similarity import CosineGreedy

reference = Spectrum(mz=np.array([100, 150, 200.]),
                     intensities=np.array([0.7, 0.2, 0.1]),
                     metadata={"precursor_mz": 200.0})
query = Spectrum(mz=np.array([100, 140, 190.]),
                 intensities=np.array([0.4, 0.2, 0.1]),
                 metadata={"precursor_mz": 190.0})

# Use factory to construct a similarity function
cosine_greedy = CosineGreedy(tolerance=0.2)

score = cosine_greedy.pair(reference, query)

print(f"Cosine score is {score['score']:.2f} with {score['matches']} matched peaks")

Should output

Cosine score is 0.83 with 1 matched peaks

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) → ndarray

Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

pair(reference: Spectrum, query: Spectrum) → Tuple[float, int][source]

Calculate cosine score between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

Returns:

Tuple with cosine score and number of matched peaks.

Return type:

Score

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.CosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate ‘cosine similarity score’ between two spectra using the Hungarian algorithm.

The cosine score quantifies the similarity between two mass spectra by finding the optimal one-to-one matching between their peaks. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance.

The peak assignment is solved using the Hungarian algorithm (scipy.optimize.linear_sum_assignment), which finds the assignment that maximises the sum of intensity products. This is mathematically optimal but can be notably slower than the greedy heuristic in CosineGreedy.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) → ndarray

Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

pair(reference: Spectrum, query: Spectrum) → Tuple[float, int][source]

Calculate cosine score between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

Return type:

Tuple with cosine score and number of matched peaks.

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.CosineLinear(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate ‘linear cosine similarity score’ between two spectra.

This implements the CosineLinear similarity from SIRIUS (BOECKER lab), which achieves O(n+m) time complexity by requiring spectra to be “well-separated” (consecutive peaks more than 2x tolerance apart). A preprocessing step (sirius_merge_close_peaks) enforces this invariant by greedily merging close peaks in descending intensity order.

For example

import numpy as np
from matchms import Spectrum
from matchms.similarity import CosineLinear

reference = Spectrum(mz=np.array([100, 150, 200.]),
                     intensities=np.array([0.7, 0.2, 0.1]))
query = Spectrum(mz=np.array([100, 140, 190.]),
                 intensities=np.array([0.4, 0.2, 0.1]))

cosine_linear = CosineLinear(tolerance=0.2)
score = cosine_linear.pair(reference, query)

print(f"CosineLinear score is {score['score']:.2f} with {score['matches']} matched peaks")

Should output

CosineLinear score is 0.83 with 1 matched peaks

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1. Peaks closer than 2 * tolerance are merged before scoring.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) → ndarray[source]

Optimized matrix computation that precomputes merged spectra.

Each spectrum is merged once (N+M calls to sirius_merge_close_peaks) instead of 2*N*M times in the naive double-loop approach.

pair(reference: Spectrum, query: Spectrum) → ndarray[source]

Calculate linear cosine score between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

Returns:

Tuple with cosine score and number of matched peaks.

Return type:

Score

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.FingerprintSimilarity(similarity_measure: str = 'jaccard', set_empty_scores: float | int | str = 'nan')[source]

Bases: BaseSimilarity

Calculate similarity between molecules based on their fingerprints.

For this similarity measure to work, fingerprints are expected to be derived by running add_fingerprint().

Code example:

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.filtering import add_fingerprint
from matchms.similarity import FingerprintSimilarity

spectrum_1 = Spectrum(
    mz=np.array([], dtype="float"),
    intensities=np.array([], dtype="float"),
    metadata={"smiles": "CCC(C)C(C(=O)O)NC(=O)CCl", "precursor_mz": 200.2}
    )

spectrum_2 = Spectrum(
    mz=np.array([], dtype="float"),
    intensities=np.array([], dtype="float"),
    metadata={"smiles": "CC(C)C(C(=O)O)NC(=O)CCl", "precursor_mz": 200.2}
)

spectrum_3 = Spectrum(
    mz=np.array([], dtype="float"),
    intensities=np.array([], dtype="float"),
    metadata={"smiles": "C(C(=O)O)(NC(=O)O)S", "precursor_mz": 200.2}
)

spectra = [spectrum_1, spectrum_2, spectrum_3]
# Add fingerprints
spectra = [add_fingerprint(x, nbits=256) for x in spectra]

# Specify type and calculate similarities
similarity_measure = FingerprintSimilarity("jaccard")
scores = calculate_scores(spectra, spectra, similarity_measure)
print(np.round(scores.scores.to_array(), 3).tolist())

Should output

[[1.0, 0.878, 0.415], [0.878, 1.0, 0.444], [0.415, 0.444, 1.0]]

__init__(similarity_measure: str = 'jaccard', set_empty_scores: float | int | str = 'nan')[source]

Parameters:

similarity_measure – Chose similarity measure form “cosine”, “dice”, “jaccard”. The default is “jaccard”.
set_empty_scores – Define what should be given instead of a similarity score in cases where fingprints are missing. The default is “nan”, which will return np.nan’s in such cases.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) → array[source]

Calculate matrix of fingerprint based similarity scores.

Parameters:

references – List of reference spectra.
queries – List of query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array

pair(reference: Spectrum, query: Spectrum) → float[source]

Calculate fingerprint based similarity score between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

score_datatype: alias of float64

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.FlashSimilarity(score_type: str = 'spectral_entropy', matching_mode: str = 'fragment', tolerance: float = 0.02, use_ppm: bool = False, remove_precursor: bool = False, precursor_window: float = 1.6, noise_cutoff: float = 0.01, normalize_to_half: bool = True, merge_within: float = 0, identity_precursor_tolerance: float | None = None, identity_use_ppm: bool = False, dtype: dtype = <class 'numpy.float64'>)[source]

Bases: BaseSimilarity

Flash entropy similarity (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it.

Key options:

matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.

Notes:

.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).

Parameters:

score_type – Score type: ‘spectral_entropy’ (default) or ‘cosine’.
matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’). Chose “hybrid” in combination with score_type=”cosine” to compute the modified cosine score.
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is True.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.

__init__(score_type: str = 'spectral_entropy', matching_mode: str = 'fragment', tolerance: float = 0.02, use_ppm: bool = False, remove_precursor: bool = False, precursor_window: float = 1.6, noise_cutoff: float = 0.01, normalize_to_half: bool = True, merge_within: float = 0, identity_precursor_tolerance: float | None = None, identity_use_ppm: bool = False, dtype: dtype = <class 'numpy.float64'>)[source]

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, n_jobs: int = -1) → ndarray[source]

Calculate matrix of Flash entropy similarity scores.

Parameters:

references:: List of reference spectra.
queries:: List of query spectra.
array_type:: Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a SparseStacked COO-style array.
is_symmetric:: If True, the matrix will be symmetric (i.e., references and queries must have the same length). Here has no consequence on runtime.
n_jobs:: Number of parallel jobs to run. Default is set to -1, which means that all available CPUs minus one will be used.

pair(reference: Spectrum, query: Spectrum) → ndarray[source]

Compute Flash similarity for a single (reference, query) pair. Uses the same preprocessing and scoring logic as the matrix path, but builds a tiny 1-spectrum library from the query.

Careful: This is not the fast intended use; better .matrix() instead.

score_datatype: alias of float32

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.IntersectMz(scaling: float = 1.0)[source]

Bases: BaseSimilarity

Example score for illustrating how to build custom spectra similarity score.

IntersectMz will count all exact matches of peaks and divide it by all unique peaks found in both spectra.

Example of how matchms similarity functions can be used:

import numpy as np
from matchms import Spectrum
from matchms.similarity import IntersectMz

spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]),
                      intensities=np.array([0.7, 0.2, 0.1]))
spectrum_2 = Spectrum(mz=np.array([100, 140, 190.]),
                      intensities=np.array([0.4, 0.2, 0.1]))

# Construct a similarity function
similarity_measure = IntersectMz(scaling=1.0)

score = similarity_measure.pair(spectrum_1, spectrum_2)

print(f"IntersectMz score is {score:.2f}")

Should output

IntersectMz score is 0.20

__init__(scaling: float = 1.0)[source]

Constructor. Here, function parameters are defined.

Parameters:: scaling – Scale scores to maximum possible score being ‘scaling’.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) → ndarray

Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

pair(reference: Spectrum, query: Spectrum) → float[source]: This will calculate the similarity score between two spectra.

score_datatype: alias of float64

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.MetadataMatch(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]

Bases: BaseSimilarity

Return True if metadata entries of a specified field match between two spectra.

This is supposed to be used to compare a wide range of possible metadata entries and use this to later select related or similar spectra.

Example to calculate scores between 2 pairs of spectra and iterate over the scores

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.similarity import MetadataMatch

spectrum_1 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "orbitrap",
                                "id": 1})
spectrum_2 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "qtof",
                                "id": 2})
spectrum_3 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "qtof",
                                "id": 3})
spectrum_4 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"instrument_type": "orbitrap",
                                "id": 4})
references = [spectrum_1, spectrum_2]
queries = [spectrum_3, spectrum_4]

similarity_score = MetadataMatch(field="instrument_type")
scores = calculate_scores(references, queries, similarity_score)

for (reference, query, score) in scores:
    print(f"Metadata match between {reference.get('id')} and {query.get('id')}" +
          f" is {bool(score[0])}")

Should output

Metadata match between 1 and 4 is True
Metadata match between 2 and 3 is True

__init__(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]

Parameters:

field – Specify field name for metadata that should be compared.
matching_type – Specify how field entries should be matched. Can be one of [“equal_match”, “difference”]. “equal_match”: Entries must be exactly equal (default). “difference”: Entries are considered a match if their numerical difference is less than or equal to “tolerance”.
tolerance – Specify tolerance below which two values are counted as match. This only applied to numerical values.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) → ndarray[source]

Compare parent masses between all references and queries.

Parameters:

references – List/array of reference spectra.
queries – List/array of Single query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) → float[source]

Compare precursor m/z between reference and query spectrum.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

score_datatype: alias of bool

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.ModifiedCosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate an approximate modified cosine score between mass spectra.

This implementation solves the peak assignment in a greedy way and is therefore an approximation. See ModifiedCosineHungarian for the exact assignment variant.

The modified cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is the difference in precursor m/z between the two spectra.

See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for further details.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Initialize approximate modified cosine.

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) → ndarray

Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

pair(reference: Spectrum, query: Spectrum) → Tuple[float, int][source]: Calculate approximate modified cosine score between two spectra.

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.ModifiedCosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Bases: BaseSimilarity

Calculate exact modified cosine score between mass spectra.

The modified cosine score quantifies similarity between two mass spectra with optional precursor-based mass shift. Potential matches are all peak pairs that are within tolerance either unshifted or shifted by precursor_mz(reference) - precursor_mz(query).

Peak assignment is solved globally via Hungarian assignment (linear sum assignment), which yields an exact one-to-one maximum-weight matching.

See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for the modified cosine concept.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]

Initialize exact modified cosine.

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) → ndarray

Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

pair(reference: Spectrum, query: Spectrum) → Tuple[float, int][source]: Calculate exact modified cosine score between two spectra.

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.NeutralLossesCosine(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]

Bases: BaseSimilarity

Calculate ‘neutral losses cosine score’ between mass spectra.

The neutral losses cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’ once a mass-shift is applied. The mass shift is the difference in precursor-m/z between the two spectra. In general, ModifiedCosineGreedy is recommended over NeutralLossesCosine because it will on average deliver more reliable results.

__init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]

Parameters:

tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
ignore_peaks_above_precursor – By default this is set to True, meaning that peaks with m/z values larger than the precursor-m/z will be ignored (since those would correspond to negative “neutral losses”).

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) → ndarray

Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

pair(reference: Spectrum, query: Spectrum) → Tuple[float, int][source]

Calculate neutral losses cosine score between two spectra.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

Return type:

Tuple with cosine score and number of matched peaks.

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.ParentMassMatch(tolerance: float = 0.1)[source]

Bases: BaseSimilarity

Return True if spectra match in parent mass (within tolerance), and False otherwise.

Example to calculate scores between 2 spectra and iterate over the scores

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.similarity import ParentMassMatch

spectrum_1 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "1", "parent_mass": 100})
spectrum_2 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "2", "parent_mass": 110})
spectrum_3 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "3", "parent_mass": 103})
spectrum_4 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "4", "parent_mass": 111})
references = [spectrum_1, spectrum_2]
queries = [spectrum_3, spectrum_4]

similarity_score = ParentMassMatch(tolerance=5.0)
scores = calculate_scores(references, queries, similarity_score)

for (reference, query, score) in scores:
    print(f"Parentmass match between {reference.get('id')} and {query.get('id')}" +
          f" is {bool(score[0])}")

Should output

Parentmass match between 1 and 3 is True
Parentmass match between 2 and 4 is True

__init__(tolerance: float = 0.1)[source]

Parameters:: tolerance – Specify tolerance below which two masses are counted as match.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) → ndarray[source]

Compare parent masses between all references and queries.

Parameters:

references – List/array of reference spectra.
queries – List/array of Single query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) → float[source]

Compare parent masses between reference and query spectrum.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

score_datatype: alias of bool

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

class matchms.similarity.PrecursorMzMatch(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Bases: BaseSimilarity

Return True if spectra match in precursor m/z (within tolerance), and False otherwise. The match within tolerance can be calculated based on an absolute m/z difference (tolerance_type=”Dalton”) or based on a relative difference in ppm (tolerance_type=”ppm”).

Example to calculate scores between 2 pairs of spectra and iterate over the scores

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.similarity import PrecursorMzMatch

spectrum_1 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "1", "precursor_mz": 100})
spectrum_2 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "2", "precursor_mz": 110})
spectrum_3 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "3", "precursor_mz": 103})
spectrum_4 = Spectrum(mz=np.array([]),
                      intensities=np.array([]),
                      metadata={"id": "4", "precursor_mz": 111})
references = [spectrum_1, spectrum_2]
queries = [spectrum_3, spectrum_4]

similarity_score = PrecursorMzMatch(tolerance=5.0, tolerance_type="Dalton")
scores = calculate_scores(references, queries, similarity_score)

for (reference, query, score) in scores:
    print(f"Precursor m/z match between {reference.get('id')} and {query.get('id')}" +
          f" is {bool(score[0])}")

Should output

Precursor m/z match between 1 and 3 is True
Precursor m/z match between 2 and 4 is True

__init__(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]

Parameters:

tolerance – Specify tolerance below which two m/z are counted as match.
tolerance_type – Chose between fixed tolerance in Dalton (=”Dalton”) or a relative difference in ppm (=”ppm”).

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) → ndarray[source]

Compare parent masses between all references and queries.

Parameters:

references – List/array of reference spectra.
queries – List/array of Single query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

pair(reference: Spectrum, query: Spectrum) → float[source]

Compare precursor m/z between reference and query spectrum.

Parameters:

reference – Single reference spectrum.
query – Single query spectrum.

score_datatype: alias of bool

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

to_dict() → dict: Return a dictionary representation of a similarity function.

matchms.similarity package

Functions for computing spectra similarities

Parameters:

Submodules