matchms.similarity package
Functions for computing spectra similarities
Matchms provides similarity measures for comparing mass spectra and their
metadata. The recommended high-level entry points for peak-based cosine scoring
are Cosine and
ModifiedCosine.
These classes choose an appropriate implementation internally and are intended
as the default choice for most workflows. Users who need a specific algorithmic
variant can select one of the explicit implementations directly, for example
CosineLinear,
CosineFlash,
CosineGreedy, or
CosineHungarian.
Available similarity functions include:
cosine-based peak similarity (
Cosine,CosineLinear,CosineFlash,CosineGreedy,CosineHungarian)modified cosine similarity for spectra with shifted fragment peaks (
ModifiedCosine,CosineFlashwith matching_mode=”hybrid”,ModifiedCosineGreedy,ModifiedCosineHungarian)neutral-loss-based peak similarity (
NeutralLossesCosine)fast embedding-based or approximate similarity methods (
BinnedEmbeddingSimilarity,CosineBlink,FlashEntropy)simple precursor or parent-mass matching (
PrecursorMzMatch,ParentMassMatch)molecular-structure similarity based on metadata such as SMILES or InChIKey (
FingerprintSimilarity)metadata-based matching for user-defined fields, for example exact matches in
instrument_typeor numerical matches within a tolerance for fields such asretention_timeorcollision_energy(MetadataMatch)
Custom similarity measures can be added by subclassing
BaseSimilarity. Similarities that also provide
sparse score computation should subclass
BaseSimilarityWithSparse.
External similarity measures, such as Spec2Vec, can also be used together with matchms workflows.
- class matchms.similarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
Bases:
BaseEmbeddingSimilarityCompare spectra by cosine/euclidean similarity of binned intensities.
Spectra are converted to fixed-length vectors by summing intensities in equally spaced m/z bins. Each vector is normalized to its maximum bin intensity when that maximum is positive. Empty spectra, spectra without peaks in the configured m/z range, and spectra with only zero intensities produce a zero vector instead of NaNs.
- Parameters:
similarity – Similarity measure used for comparing embeddings. Supported values are
"cosine"and"euclidean".max_mz – Maximum m/z value to include. Values outside
[0, max_mz]are ignored.bin_width – Width of each m/z bin.
intensity_power – Power applied to peak intensities before binning.
- __init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
- build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any
Build an ANN index for the input spectra.
- Parameters:
reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.
- Returns:
The constructed ANN index.
- Return type:
Any
- Raises:
ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.
- compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]
Convert spectra into binned embeddings.
- Parameters:
spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
- Returns:
Array of shape (n_spectra, n_bins) containing the binned embeddings.
- Return type:
np.ndarray
- compute_similarity_matrix_from_embeddings(embeddings_1: ndarray, embeddings_2: ndarray | None = None) ndarray
Compute a raw NumPy similarity matrix from precomputed embeddings.
This helper keeps the old raw-array use case available without changing the public
matrix()contract inherited fromBaseSimilarity.
- get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) tuple[ndarray, ndarray]
Get approximate nearest neighbors for input spectra.
- Parameters:
query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If no index is built or k is larger than index k.
- get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray
Get embeddings either by computing them or loading from disk.
- Parameters:
spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If neither spectra nor npy_path is provided.
- get_index_anns() tuple[ndarray, ndarray]
Get nearest neighbors for all points in the index.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If unsupported index backend is used.
- load_ann_index(path: str | Path) Any
Load an ANN index from disk.
- Parameters:
path (Union[str, Path]) – Path to load the index from.
- Returns:
The loaded ANN index.
- Return type:
Any
- Raises:
ValueError – If loaded index similarity metric doesn’t match current metric.
- static load_embeddings(npy_path: str | Path) ndarray
Load embeddings from a numpy file.
- Parameters:
npy_path (Union[str, Path]) – Path to the numpy file.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If loaded array is not 2D.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores
Compute similarity matrix between spectra_1 and spectra_2.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If
None, comparespectra_1against itself.score_fields – Requested score fields. Embedding similarities expose only
("score",).progress_bar – Included for API compatibility. Embeddings are computed in batch and this implementation currently does not display a progress bar.
- Returns:
Similarity matrix.
- Return type:
np.ndarray
- Raises:
ValueError – If array_type is not “numpy” or is_symmetric is False.
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) float
Compute similarity between a pair of spectra.
- Parameters:
spectrum_1 (SpectrumType) – Reference spectrum.
spectrum_2 (SpectrumType) – Query spectrum.
- save_ann_index(path: str | Path) None
Save the ANN index to disk.
- Parameters:
path (Union[str, Path]) – Path to save the index to.
- Raises:
ValueError – If no index exists to save.
- score_datatype
alias of
float64
- sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True)
Sparse score computation is not available for this similarity.
- class matchms.similarity.Cosine(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]
Bases:
BaseSimilarityCalculate Cosine scores between mass spectra.
This is matchms central Cosine class. The Cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given
tolerance.Matchms provides various implementations of the Cosine score which are combined here in what we believe to be the typical best choice for most users.
By default, the parameter
use_hungarianis set to False, which means that the greedy algorithm is used to find the best matches. This is typically faster than the Hungarian algorithm, and for most applications the results are very similar. If you need the exact optimal solution, you can setuse_hungarianto True, which will use the Hungarian algorithm to find the best matches.- __init__(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]
Initialize cosine score class.
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
use_hungarian – Whether to use the Hungarian algorithm to find the best matches. The default is False, which means that the greedy algorithm is used to find the best matches. The greedy algorithm is typically faster than the Hungarian algorithm, and for most applications the results are very similar.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01. Will only be used if use_hungarian is False.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]
Calculate matrix of Cosine scores.
- Parameters:
spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.
- Returns:
Dense score matrix as a
Scoresobject.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate approximate modified cosine score between two spectra.
- class matchms.similarity.CosineBlink(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]
Bases:
BaseSimilarityBLINK-style approximate cosine similarity for mass spectra with fast .pair() and .matrix(). This score is implemented based on the method BLINK, proposed by Harwood et al. (2023, https://www.nature.com/articles/s41598-023-40496-9).
Integer binning with bin_width (Da); tolerance window is ± floor(tolerance/bin_width) bins.
Per-spectrum L2 normalization (after optional mz/intensity weighting).
Blur only one side (spectra_2 in .matrix(), smaller spectrum in .pair()).
- Parameters:
tolerance – True m/z tolerance (Da). Peaks within +/- tolerance are considered matches. Default 0.01.
bin_width – Discretization width (Da). Default 0.001 (1 mDa). Effective radius R=floor(tolerance/bin_width).
mz_power – Power for mz weighting (intensity *= mz**mz_power). Default 0.0.
intensity_power – Power for intensity weighting before normalization. Default 1.0 (set 0.5 for sqrt scaling).
clip_to_one – Clip score to [0,1]. Default True.
use_numba (bool) – Use numba-accelerated pairwise kernel when available. Default True.
prefilter (bool) – Apply BLINK-like pre-filtering (remove <1% base peak, > precursor m/z, zeros). Default True.
min_relative_intensity (float) – Relative base-peak threshold for prefilter. Default 0.01 (1%).
crop_above_precursor (bool) – Drop fragments > precursor m/z if available in metadata. Default True.
remove_zero_intensities (bool) – Remove peaks with intensity <= 0. Default True.
top_k (Optional[int]) – Keep only top-K most intense fragments after other filters (per spectrum). Default None.
path) (# Batching (matrix)
batch_size (int) – Number of query spectra per batch in .matrix(). Default 1024.
sparse_score_min (float) – When array_type=’sparse’, drop scores < sparse_score_min. Default 0.0.
- __init__(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores[source]
All-vs-all BLINK-style cosine scores.
Implementation: - Build a global dense bin axis in integer bins from min to max across refs+queries
(rows ~ (max_bin - min_bin + 1)), which keeps matrices sparse.
Build a CSR intensity matrix for refs (rows=bins, cols=ref spectra) after per-spectrum L2 normalization.
For spectra_2, build per-batch blurred CSR by expanding each nonzero to its ±R neighbors.
Multiply: scores_batch = (I_ref.T @ I_qry_blur), accumulate into the final output.
- Parameters:
spectra_1 – List of input spectra.
spectra_2 – List of input spectra.
score_fields – Requested score fields.
- Returns:
Dense Scores object.
- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate BLINK-style cosine between two spectra.
- Parameters:
spectrum_1 – Single reference spectrum.
spectrum_2 – Single query spectrum.
- score_datatype
alias of
float32
- class matchms.similarity.CosineFlash(*args, dtype: dtype = <class 'numpy.float64'>, **kwargs)[source]
Bases:
_BaseFlashSimilarityFlash Cosine similarity following the original Flash Entropy (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it. This corresponds to the “CosineGreedy” scoring logic but with the same fast Flash path as Flash Entropy.
- Key options:
matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
- cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.
- Notes:
.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).
- Parameters:
matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’).
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
intensity_power – The power to raise intensity to in the cosine function. The default is 1 (no weighting).
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is False.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]
Calculate matrix of Flash Cosine scores.
- Parameters:
spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.
- Returns:
Dense score matrix as a
Scoresobject.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) ndarray[source]
Calculate the similarity for one pair of spectra.
- Parameters:
spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.
- Returns:
Similarity result for one pair. The returned value should be compatible with
self.score_datatype.- Return type:
score
Examples
- Scalar score:
return np.asarray(score, dtype=self.score_datatype)- Structured score:
return np.asarray((score, matches), dtype=self.score_datatype)
- score_datatype
alias of
float64
- class matchms.similarity.CosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]
Bases:
BaseSimilarityWithSparseCalculate ‘cosine similarity score’ between two spectra.
The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved in a ‘greedy’ way. This can perform notably faster, but does occasionally deviate slightly from a fully correct solution (as with the Hungarian algorithm, see
CosineHungarian). In practice this will rarely affect similarity scores notably, in particular for smaller tolerances.For example
import numpy as np from matchms import Spectrum from matchms.similarity import CosineGreedy spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]), intensities=np.array([0.7, 0.2, 0.1]), metadata={"precursor_mz": 200.0}) spectrum_2 = Spectrum(mz=np.array([100, 140, 190.]), intensities=np.array([0.4, 0.2, 0.1]), metadata={"precursor_mz": 190.0}) # Use factory to construct a similarity function cosine_greedy = CosineGreedy(tolerance=0.2) score = cosine_greedy.pair(spectrum_1, spectrum_2) print(f"Cosine score is {score['score']:.2f} with {score['matches']} matched peaks")
Should output
Cosine score is 0.83 with 1 matched peaks
Unlike in matchms < 1.0, this method also applies a noise filter by default, which removes peaks with intensity below a certain cutoff. This is typically highly beneficial for the performance of the greedy algorithm, and for most applications the results are very similar to the exact assignment variant. If you want to disable this noise filtering, you can set
noise_cutoffto 0 or None.- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01.
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)
Calculate a dense similarity matrix.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself. For commutative similarities this automatically uses a symmetric optimization.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).progress_bar – When True, show a progress bar. Default is True.
- Returns:
Dense score result wrapped in a
Scorescontainer.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate cosine score between two spectra.
- Parameters:
spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.
- Returns:
Tuple with cosine score and number of matched peaks. The score can be access as score[“score”] and the number of matched peaks as score[“matches”].
- Return type:
Score
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)
Calculate sparse similarity results.
Filtering is applied to the full score before score field projection.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself.idx_row – Row indices of pairs to compute. If None and
idx_colis also None, all pairwise comparisons are considered and only retained scores are stored.idx_col – Column indices of pairs to compute. Must have the same shape as
idx_row.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).score_filter – Optional callable receiving the full score and returning whether it should be retained. If None,
keep_score()is used.progress_bar – When True, show a progress bar.
- Returns:
Sparse score result wrapped in a
Scorescontainer.- Return type:
- class matchms.similarity.CosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityWithSparseCalculate ‘cosine similarity score’ between two spectra using the Hungarian algorithm.
The cosine score quantifies the similarity between two mass spectra by finding the optimal one-to-one matching between their peaks. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance.
The peak assignment is solved using the Hungarian algorithm (
scipy.optimize.linear_sum_assignment), which finds the assignment that maximises the sum of intensity products. This is mathematically optimal but can be notably slower than the greedy heuristic inCosineGreedy.- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)
Calculate a dense similarity matrix.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself. For commutative similarities this automatically uses a symmetric optimization.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).progress_bar – When True, show a progress bar. Default is True.
- Returns:
Dense score result wrapped in a
Scorescontainer.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate cosine score between two spectra.
- Parameters:
spectrum_1 – Single spectrum.
spectrum_2 – Single spectrum.
- Return type:
Tuple with cosine score and number of matched peaks.
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)
Calculate sparse similarity results.
Filtering is applied to the full score before score field projection.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself.idx_row – Row indices of pairs to compute. If None and
idx_colis also None, all pairwise comparisons are considered and only retained scores are stored.idx_col – Column indices of pairs to compute. Must have the same shape as
idx_row.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).score_filter – Optional callable receiving the full score and returning whether it should be retained. If None,
keep_score()is used.progress_bar – When True, show a progress bar.
- Returns:
Sparse score result wrapped in a
Scorescontainer.- Return type:
- class matchms.similarity.CosineLinear(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityCalculate ‘linear cosine similarity score’ between two spectra.
This implements the CosineLinear similarity from SIRIUS (BOECKER lab), which achieves O(n+m) time complexity by requiring spectra to be “well-separated” (consecutive peaks more than 2x tolerance apart). A preprocessing step (sirius_merge_close_peaks) enforces this invariant by greedily merging close peaks in descending intensity order.
For example
import numpy as np from matchms import Spectrum from matchms.similarity import CosineLinear reference = Spectrum(mz=np.array([100, 150, 200.]), intensities=np.array([0.7, 0.2, 0.1])) query = Spectrum(mz=np.array([100, 140, 190.]), intensities=np.array([0.4, 0.2, 0.1])) cosine_linear = CosineLinear(tolerance=0.2) score = cosine_linear.pair(reference, query) print(f"CosineLinear score is {score['score']:.2f} with {score['matches']} matched peaks")
Should output
CosineLinear score is 0.83 with 1 matched peaks
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1. Peaks closer than 2 * tolerance are merged before scoring.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)[source]
Optimized matrix computation that precomputes merged spectra.
Each spectrum is merged once (N+M calls to sirius_merge_close_peaks) instead of 2*N*M times in the naive double-loop approach.
- pair(reference: Spectrum, query: Spectrum) tuple[float, int][source]
Calculate linear cosine score between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- Returns:
Tuple with cosine score and number of matched peaks.
- Return type:
Score
- class matchms.similarity.FingerprintSimilarity(fingerprint_generator, similarity_measure: str = 'tanimoto', set_empty_scores: float | int | str = 'nan', ignore_stereochemistry: bool = False, count: bool = False, folded: bool = True, return_csr: bool = False, invalid_policy: str = 'raise', **fingerprint_config_kwargs)[source]
Bases:
BaseSimilarityCalculate similarity between molecules based on molecular fingerprints.
Fingerprints can either be provided explicitly as
Fingerprintsobjects or computed internally from input spectra.This class no longer expects fingerprints to be stored directly in spectrum metadata. Instead, it uses a
Fingerprintscontainer.Currently supported similarity measures are:
"cosine""tanimoto"
Notes
Tanimoto is used in its generalized form and therefore also works for count/weighted fingerprints.
Fingerprints may be stored densely (NumPy) or sparsely (CSR).
- __init__(fingerprint_generator, similarity_measure: str = 'tanimoto', set_empty_scores: float | int | str = 'nan', ignore_stereochemistry: bool = False, count: bool = False, folded: bool = True, return_csr: bool = False, invalid_policy: str = 'raise', **fingerprint_config_kwargs)[source]
- Parameters:
fingerprint_generator – A chemap-compatible fingerprint generator.
similarity_measure – Choose similarity measure from
"cosine"or"tanimoto". The default is"tanimoto".set_empty_scores – Define what should be returned instead of a similarity score in cases where fingerprints are missing. The default is
"nan", which will returnnp.nanin such cases.ignore_stereochemistry – Passed to internally created
Fingerprintsobjects.count – Passed to internally created
Fingerprintsobjects.folded – Passed to internally created
Fingerprintsobjects.return_csr – Passed to internally created
Fingerprintsobjects.invalid_policy – Passed to internally created
Fingerprintsobjects.**fingerprint_config_kwargs – Additional keyword arguments passed to internally created
Fingerprintsobjects.
- matrix(spectra_1: Sequence[Spectrum] | None = None, spectra_2: Sequence[Spectrum] | None = None, fingerprints_1: Fingerprints | None = None, fingerprints_2: Fingerprints | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores[source]
Calculate matrix of fingerprint-based similarity scores.
- Parameters:
spectra_1 – First collection of spectra. Used only if fingerprints_1 is not given.
spectra_2 – Second collection of spectra. Used only if fingerprints_2 is not given. If None and fingerprints_2 is None, compare the first input against itself.
fingerprints_1 – Optional precomputed Fingerprints object for the first input.
fingerprints_2 – Optional precomputed Fingerprints object for the second input. If None, compare the first input against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – Included for API compatibility. Not used here.
- Returns:
Dense score matrix as a
Scoresobject.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum)[source]
Pairwise fingerprint similarity is not supported in this API.
FingerprintSimilarity works on precomputed Fingerprints containers or computes fingerprints internally for collections of spectra in matrix().
Use matrix(…) instead.
- score_datatype
alias of
float64
- class matchms.similarity.FlashEntropy(*args, normalize_to_half: bool = True, **kwargs)[source]
Bases:
_BaseFlashSimilarityFlash entropy similarity (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it.
- Key options:
matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
- cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.
- Notes:
.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).
- Parameters:
matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’).
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is True.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]
Calculate matrix of Flash Entropy scores.
- Parameters:
spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.
- Returns:
Dense score matrix as a
Scoresobject.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) ndarray[source]
Compute Flash Entropy for a single (reference, query) pair. Uses the same preprocessing and scoring logic as the matrix path, but builds a tiny 1-spectrum library from the query.
Careful: This is not the fast intended use; better .matrix() instead.
- score_datatype
alias of
float32
- class matchms.similarity.MetadataMatch(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]
Bases:
BaseSimilarityWithSparseReturn True if metadata entries of a specified field match between two spectra.
This is supposed to be used to compare a wide range of possible metadata entries and use this to later select related or similar spectra.
Matching can be done by:
exact equality (
matching_type="equal_match")numerical difference within a tolerance (
matching_type="difference")
For numerical differences, the tolerance can be interpreted as:
absolute difference in Dalton / raw units (
tolerance_type="Dalton")relative difference in ppm (
tolerance_type="ppm")
Example to calculate scores between 2 pairs of spectra and inspect the score matrix
import numpy as np from matchms import Spectrum from matchms.similarity import MetadataMatch spectrum_1 = Spectrum( mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "orbitrap", "id": 1}, ) spectrum_2 = Spectrum( mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "qtof", "id": 2}, ) spectrum_3 = Spectrum( mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "qtof", "id": 3}, ) spectrum_4 = Spectrum( mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "orbitrap", "id": 4}, ) spectra_1 = [spectrum_1, spectrum_2] spectra_2 = [spectrum_3, spectrum_4] similarity = MetadataMatch(field="instrument_type") scores = similarity.matrix(spectra_1, spectra_2) score_array = scores.to_array() for i, spectrum_1 in enumerate(spectra_1): for j, spectrum_2 in enumerate(spectra_2): print( f"Metadata match between {spectrum_1.get('id')} and " f"{spectrum_2.get('id')} is {bool(score_array[i, j])}" )
Should output
Metadata match between 1 and 3 is False Metadata match between 1 and 4 is True Metadata match between 2 and 3 is True Metadata match between 2 and 4 is False
- __init__(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]
- Parameters:
field – Specify field name for metadata that should be compared.
matching_type – Specify how field entries should be matched. Can be one of
["equal_match", "difference"]."equal_match": entries must be exactly equal (default)."difference": entries are considered a match if their numerical difference is less than or equal totolerance.tolerance – Specify tolerance below which two values are counted as match. This only applies to numerical values.
tolerance_type – Choose between fixed tolerance in Dalton / raw units (
"Dalton") or a relative difference in ppm ("ppm"). This only applies whenmatching_type="difference".
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores[source]
Compare metadata entries between all spectra in spectra_1 and spectra_2.
- Parameters:
spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – Included for API compatibility. Not used here because this optimized implementation does not iterate pairwise in Python.
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum)[source]
Compare metadata entries between two spectra.
- Parameters:
spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row=None, idx_col=None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True) Scores[source]
Compare metadata entries and return sparse scores.
This method uses optimized metadata matching when no explicit indices are provided. If explicit idx_row and idx_col are given, it falls back to the generic sparse implementation from BaseSimilarityWithSparse.
- class matchms.similarity.ModifiedCosine(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]
Bases:
BaseSimilarityCalculate an approximate modified cosine score between mass spectra.
This is matchms central Modified Cosine class. The Modified Cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given
tolerance, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is the difference in precursor m/z between the two spectra.Matchms provides various implementations of the Modified Cosine score which are combined here in what we believe to be the typical best choice for most users.
By default, the parameter
use_hungarianis set to False, which means that the greedy algorithm is used to find the best matches. This is typically faster than the Hungarian algorithm, and for most applications the results are very similar. If you need the exact optimal solution, you can setuse_hungarianto True, which will use the Hungarian algorithm to find the best matches.For more conceptual context, see Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743].
- __init__(tolerance: float = 0.1, intensity_power: float = 1.0, use_hungarian: bool = False, noise_cutoff: float = 0.01)[source]
Initialize the modified cosine score class.
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
use_hungarian – Whether to use the Hungarian algorithm to find the best matches. The default is False, which means that the greedy algorithm is used to find the best matches. The greedy algorithm is typically faster than the Hungarian algorithm, and for most applications the results are very similar.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01. Will only be used if use_hungarian is False.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True, n_jobs: int = -1)[source]
Calculate matrix of Modified Cosine scores.
- Parameters:
spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – When True, show a progress bar.
n_jobs – Number of parallel jobs to run. Default is -1, which means that all available CPUs minus one will be used.
- Returns:
Dense score matrix as a
Scoresobject.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate approximate modified cosine score between two spectra.
- class matchms.similarity.ModifiedCosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]
Bases:
BaseSimilarityWithSparseCalculate an approximate modified cosine score between mass spectra.
This implementation solves the peak assignment in a greedy way and is therefore an approximation. See
ModifiedCosineHungarianfor the exact assignment variant.The modified cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given
tolerance, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is the difference in precursor m/z between the two spectra.See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for further details.
Unlike in matchms < 1.0, this method also applies a noise filter by default, which removes peaks with intensity below a certain cutoff. This is typically highly beneficial for the performance of the greedy algorithm, and for most applications the results are very similar to the exact assignment variant. If you want to disable this noise filtering, you can set
noise_cutoffto 0 or None.- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, noise_cutoff: float = 0.01)[source]
Initialize approximate modified cosine.
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
noise_cutoff – Minimum relative intensity for a peak to be considered. Default is 0.01.
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)
Calculate a dense similarity matrix.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself. For commutative similarities this automatically uses a symmetric optimization.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).progress_bar – When True, show a progress bar. Default is True.
- Returns:
Dense score result wrapped in a
Scorescontainer.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate approximate modified cosine score between two spectra.
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)
Calculate sparse similarity results.
Filtering is applied to the full score before score field projection.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself.idx_row – Row indices of pairs to compute. If None and
idx_colis also None, all pairwise comparisons are considered and only retained scores are stored.idx_col – Column indices of pairs to compute. Must have the same shape as
idx_row.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).score_filter – Optional callable receiving the full score and returning whether it should be retained. If None,
keep_score()is used.progress_bar – When True, show a progress bar.
- Returns:
Sparse score result wrapped in a
Scorescontainer.- Return type:
- class matchms.similarity.ModifiedCosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityWithSparseCalculate exact modified cosine score between mass spectra.
The modified cosine score quantifies similarity between two mass spectra with optional precursor-based mass shift. Potential matches are all peak pairs that are within
toleranceeither unshifted or shifted byprecursor_mz(reference) - precursor_mz(query).Peak assignment is solved globally via Hungarian assignment (linear sum assignment), which yields an exact one-to-one maximum-weight matching.
See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for the modified cosine concept.
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Initialize exact modified cosine.
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)
Calculate a dense similarity matrix.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself. For commutative similarities this automatically uses a symmetric optimization.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).progress_bar – When True, show a progress bar. Default is True.
- Returns:
Dense score result wrapped in a
Scorescontainer.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate exact modified cosine score between two spectra.
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)
Calculate sparse similarity results.
Filtering is applied to the full score before score field projection.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself.idx_row – Row indices of pairs to compute. If None and
idx_colis also None, all pairwise comparisons are considered and only retained scores are stored.idx_col – Column indices of pairs to compute. Must have the same shape as
idx_row.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).score_filter – Optional callable receiving the full score and returning whether it should be retained. If None,
keep_score()is used.progress_bar – When True, show a progress bar.
- Returns:
Sparse score result wrapped in a
Scorescontainer.- Return type:
- class matchms.similarity.NeutralLossesCosine(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]
Bases:
BaseSimilarityWithSparseCalculate ‘neutral losses cosine score’ between mass spectra.
The neutral losses cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’ once a mass-shift is applied. The mass shift is the difference in precursor-m/z between the two spectra. In general, ModifiedCosineGreedy is recommended over NeutralLossesCosine because it will on average deliver more reliable results.
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
ignore_peaks_above_precursor – By default this is set to True, meaning that peaks with m/z values larger than the precursor-m/z will be ignored (since those would correspond to negative “neutral losses”).
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True)
Calculate a dense similarity matrix.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself. For commutative similarities this automatically uses a symmetric optimization.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).progress_bar – When True, show a progress bar. Default is True.
- Returns:
Dense score result wrapped in a
Scorescontainer.- Return type:
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) tuple[float, int][source]
Calculate neutral losses cosine score between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- Return type:
Tuple with cosine score and number of matched peaks.
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row: ArrayLike | None = None, idx_col: ArrayLike | None = None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True)
Calculate sparse similarity results.
Filtering is applied to the full score before score field projection.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare
spectra_1against itself.idx_row – Row indices of pairs to compute. If None and
idx_colis also None, all pairwise comparisons are considered and only retained scores are stored.idx_col – Column indices of pairs to compute. Must have the same shape as
idx_row.score_fields – Score fields to return. -
Nonemeans return all available fields. - For scalar scores, only("score",)is valid. - For structured scores, this can be a subset such as("score",).score_filter – Optional callable receiving the full score and returning whether it should be retained. If None,
keep_score()is used.progress_bar – When True, show a progress bar.
- Returns:
Sparse score result wrapped in a
Scorescontainer.- Return type:
- class matchms.similarity.ParentMassMatch(tolerance: float = 0.1)[source]
Bases:
MetadataMatchReturn True if spectra match in parent mass, and False otherwise.
- __init__(tolerance: float = 0.1)[source]
- Parameters:
tolerance – Specify tolerance below which two parent masses are counted as match.
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores
Compare metadata entries between all spectra in spectra_1 and spectra_2.
- Parameters:
spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – Included for API compatibility. Not used here because this optimized implementation does not iterate pairwise in Python.
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum)
Compare metadata entries between two spectra.
- Parameters:
spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row=None, idx_col=None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True) Scores
Compare metadata entries and return sparse scores.
This method uses optimized metadata matching when no explicit indices are provided. If explicit idx_row and idx_col are given, it falls back to the generic sparse implementation from BaseSimilarityWithSparse.
- class matchms.similarity.PrecursorMzMatch(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]
Bases:
MetadataMatchReturn True if spectra match in precursor m/z, and False otherwise.
The match within tolerance can be calculated based on an absolute m/z difference (
tolerance_type="Dalton") or based on a relative difference in ppm (tolerance_type="ppm").- __init__(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]
- Parameters:
tolerance – Specify tolerance below which two precursor m/z values are counted as match.
tolerance_type – Choose between fixed tolerance in Dalton (
"Dalton") or a relative difference in ppm ("ppm").
- keep_score(score) bool
Return whether a score should be retained in sparse outputs.
This defines the default sparse retention behavior. Users can override it per call via
score_filter=....Default behavior: - scalar score: keep if
score != 0- structured score: keep if all fields are non-zero
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores
Compare metadata entries between all spectra in spectra_1 and spectra_2.
- Parameters:
spectra_1 – First collection of input spectra.
spectra_2 – Second collection of input spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Only
("score",)is supported.progress_bar – Included for API compatibility. Not used here because this optimized implementation does not iterate pairwise in Python.
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum)
Compare metadata entries between two spectra.
- Parameters:
spectrum_1 – First spectrum.
spectrum_2 – Second spectrum.
- sparse_matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, idx_row=None, idx_col=None, score_fields: Sequence[str] | None = None, score_filter: Callable[[ndarray], bool] | None = None, progress_bar: bool = True) Scores
Compare metadata entries and return sparse scores.
This method uses optimized metadata matching when no explicit indices are provided. If explicit idx_row and idx_col are given, it falls back to the generic sparse implementation from BaseSimilarityWithSparse.
Submodules
- matchms.similarity.BaseEmbeddingSimilarity module
BaseEmbeddingSimilarityBaseEmbeddingSimilarity.indexBaseEmbeddingSimilarity.index_backendBaseEmbeddingSimilarity.index_kwargsBaseEmbeddingSimilarity.index_kBaseEmbeddingSimilarity.__init__()BaseEmbeddingSimilarity.build_ann_index()BaseEmbeddingSimilarity.compute_embeddings()BaseEmbeddingSimilarity.compute_similarity_matrix_from_embeddings()BaseEmbeddingSimilarity.get_anns()BaseEmbeddingSimilarity.get_embeddings()BaseEmbeddingSimilarity.get_index_anns()BaseEmbeddingSimilarity.is_structured_scoreBaseEmbeddingSimilarity.load_ann_index()BaseEmbeddingSimilarity.load_embeddings()BaseEmbeddingSimilarity.matrix()BaseEmbeddingSimilarity.pair()BaseEmbeddingSimilarity.save_ann_index()BaseEmbeddingSimilarity.score_datatypeBaseEmbeddingSimilarity.sparse_matrix()BaseEmbeddingSimilarity.store_embeddings()BaseEmbeddingSimilarity.to_dict()
- matchms.similarity.BaseSimilarity module
- matchms.similarity.BinnedEmbeddingSimilarity module
BinnedEmbeddingSimilarityBinnedEmbeddingSimilarity.__init__()BinnedEmbeddingSimilarity.build_ann_index()BinnedEmbeddingSimilarity.compute_embeddings()BinnedEmbeddingSimilarity.compute_similarity_matrix_from_embeddings()BinnedEmbeddingSimilarity.get_anns()BinnedEmbeddingSimilarity.get_embeddings()BinnedEmbeddingSimilarity.get_index_anns()BinnedEmbeddingSimilarity.is_structured_scoreBinnedEmbeddingSimilarity.load_ann_index()BinnedEmbeddingSimilarity.load_embeddings()BinnedEmbeddingSimilarity.matrix()BinnedEmbeddingSimilarity.n_binsBinnedEmbeddingSimilarity.pair()BinnedEmbeddingSimilarity.save_ann_index()BinnedEmbeddingSimilarity.score_datatypeBinnedEmbeddingSimilarity.sparse_matrix()BinnedEmbeddingSimilarity.store_embeddings()BinnedEmbeddingSimilarity.to_dict()
- matchms.similarity.Cosine module
- matchms.similarity.CosineBlink module
- matchms.similarity.CosineGreedy module
- matchms.similarity.CosineHungarian module
- matchms.similarity.CosineLinear module
- matchms.similarity.FingerprintSimilarity module
- matchms.similarity.FlashSimilarity module
- matchms.similarity.MetadataMatch module
- matchms.similarity.ModifiedCosine module
- matchms.similarity.ModifiedCosineGreedy module
- matchms.similarity.ModifiedCosineHungarian module
- matchms.similarity.NeutralLossesCosine module
- matchms.similarity.ParentMassMatch module
- matchms.similarity.PrecursorMzMatch module
- matchms.similarity.cosine_linear_functions module
- matchms.similarity.flash_utils module
- matchms.similarity.spectrum_similarity_functions module
- matchms.similarity.vector_similarity_functions module