matchms.similarity package
Functions for computing spectra similarities
Matchms provides a number of frequently used similarity scores to compare mass spectra. This includes
scores based on comparing peak positions and intensities (
CosineGreedy,ModifiedCosineGreedy,ModifiedCosineHungarian)simple scores that only assess precursor m/z or parent mass matches (
PrecursorMzMatchor:ParentMassMatch)scores assessing molecular similarity if structures (SMILES, InchiKey) are given as metadata (
FingerprintSimilarity)score for assessing matches in user-defined metadata fields which can be used to find equal entries (e.g. instrument_type) or numerical values within a specified tolerance (for instance: retention_time, collision energy…) (
MetadataMatch)
It is also easily possible to add own custom similarity measures or import external ones (such as Spec2Vec).
- class matchms.similarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
Bases:
BaseEmbeddingSimilarityA similarity measure that bins spectra into a fixed number of bins and uses the binned intensities as embedding features. By default, the similarity between spectra is computed as the cosine similarity between their binned representations.
- Parameters:
similarity (str, optional) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.
max_mz (float, optional) – The maximum m/z value to consider when binning. Default is 1005.
bin_width (float, optional) – The width of each bin in m/z units. Default is 1.
intensity_power – The power to raise the peak intensities. Default is 1.
- __init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
- build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any
Build an ANN index for the reference spectra.
- Parameters:
reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.
- Returns:
The constructed ANN index.
- Return type:
Any
- Raises:
ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.
- compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]
Convert spectra into binned embeddings.
- Parameters:
spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
- Returns:
Array of shape (n_spectra, n_bins) containing the binned embeddings.
- Return type:
np.ndarray
- get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) Tuple[ndarray, ndarray]
Get approximate nearest neighbors for query spectra.
- Parameters:
query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If no index is built or k is larger than index k.
- get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray
Get embeddings either by computing them or loading from disk.
- Parameters:
spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If neither spectra nor npy_path is provided.
- get_index_anns() Tuple[ndarray, ndarray]
Get nearest neighbors for all points in the index.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If unsupported index backend is used.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- load_ann_index(path: str | Path) Any
Load an ANN index from disk.
- Parameters:
path (Union[str, Path]) – Path to load the index from.
- Returns:
The loaded ANN index.
- Return type:
Any
- Raises:
ValueError – If loaded index similarity metric doesn’t match current metric.
- static load_embeddings(npy_path: str | Path) ndarray
Load embeddings from a numpy file.
- Parameters:
npy_path (Union[str, Path]) – Path to the numpy file.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If loaded array is not 2D.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = True) ndarray
Compute similarity matrix between reference and query spectra.
- Parameters:
references – List of reference spectra.
queries – List of query spectra.
array_type – Type of array to return. Must be “numpy”.
is_symmetric – Whether the matrix is symmetric. Must be True.
- Returns:
Similarity matrix.
- Return type:
np.ndarray
- Raises:
ValueError – If array_type is not “numpy” or is_symmetric is False.
- pair(reference: Spectrum, query: Spectrum) float
Compute similarity between a pair of spectra.
- Parameters:
reference (SpectrumType) – Reference spectrum.
query (SpectrumType) – Query spectrum.
- Returns:
Similarity score between the spectra.
- Return type:
- save_ann_index(path: str | Path) None
Save the ANN index to disk.
- Parameters:
path (Union[str, Path]) – Path to save the index to.
- Raises:
ValueError – If no index exists to save.
- score_datatype
alias of
float64
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.BlinkCosine(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]
Bases:
BaseSimilarityBLINK-style approximate cosine similarity for mass spectra with fast .pair() and .matrix(). This score is implemented based on the method BLINK, proposed by Harwood et al. (2023, https://www.nature.com/articles/s41598-023-40496-9).
Integer binning with bin_width (Da); tolerance window is ± floor(tolerance/bin_width) bins.
Per-spectrum L2 normalization (after optional mz/intensity weighting).
Blur only one side (queries in .matrix(), smaller spectrum in .pair()).
Pairwise returns (score, ~matches). Matrix returns only scores.
- Parameters:
tolerance – True m/z tolerance (Da). Peaks within +/- tolerance are considered matches. Default 0.01.
bin_width – Discretization width (Da). Default 0.001 (1 mDa). Effective radius R=floor(tolerance/bin_width).
mz_power – Power for mz weighting (intensity *= mz**mz_power). Default 0.0.
intensity_power – Power for intensity weighting before normalization. Default 1.0 (set 0.5 for sqrt scaling).
clip_to_one – Clip score to [0,1]. Default True.
use_numba (bool) – Use numba-accelerated pairwise kernel when available. Default True.
prefilter (bool) – Apply BLINK-like pre-filtering (remove <1% base peak, > precursor m/z, zeros). Default True.
min_relative_intensity (float) – Relative base-peak threshold for prefilter. Default 0.01 (1%).
crop_above_precursor (bool) – Drop fragments > precursor m/z if available in metadata. Default True.
remove_zero_intensities (bool) – Remove peaks with intensity <= 0. Default True.
top_k (Optional[int]) – Keep only top-K most intense fragments after other filters (per spectrum). Default None.
path) (# Batching (matrix)
batch_size (int) – Number of query spectra per batch in .matrix(). Default 1024.
sparse_score_min (float) – When array_type=’sparse’, drop scores < sparse_score_min. Default 0.0.
- __init__(tolerance: float = 0.01, bin_width: float = 0.001, mz_power: float = 0.0, intensity_power: float = 1.0, clip_to_one: bool = True, use_numba: bool = True, prefilter: bool = True, min_relative_intensity: float = 0.01, crop_above_precursor: bool = True, remove_zero_intensities: bool = True, top_k: int | None = None, batch_size: int = 1024, sparse_score_min: float = 0.0)[source]
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False)[source]
All-vs-all BLINK-style cosine scores.
Implementation: - Build a global dense bin axis in integer bins from min to max across refs+queries
(rows ~ (max_bin - min_bin + 1)), which keeps matrices sparse.
Build a CSR intensity matrix for refs (rows=bins, cols=ref spectra) after per-spectrum L2 normalization.
For queries, build per-batch blurred CSR by expanding each nonzero to its ±R neighbors.
Multiply: scores_batch = (I_ref.T @ I_qry_blur), accumulate into the final output.
- Parameters:
references – List of reference spectra.
queries – List of query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array
- Returns:
If array_type == ‘numpy’: dense (n_ref, n_query) If array_type == ‘sparse’: COO sparse (n_ref, n_query), dropping scores < sparse_score_min
- Return type:
- pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]
Calculate BLINK-style cosine between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.CosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityCalculate ‘cosine similarity score’ between two spectra.
The cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’. The underlying peak assignment problem is here solved in a ‘greedy’ way. This can perform notably faster, but does occasionally deviate slightly from a fully correct solution (as with the Hungarian algorithm, see
CosineHungarian). In practice this will rarely affect similarity scores notably, in particular for smaller tolerances.For example
import numpy as np from matchms import Spectrum from matchms.similarity import CosineGreedy reference = Spectrum(mz=np.array([100, 150, 200.]), intensities=np.array([0.7, 0.2, 0.1]), metadata={"precursor_mz": 200.0}) query = Spectrum(mz=np.array([100, 140, 190.]), intensities=np.array([0.4, 0.2, 0.1]), metadata={"precursor_mz": 190.0}) # Use factory to construct a similarity function cosine_greedy = CosineGreedy(tolerance=0.2) score = cosine_greedy.pair(reference, query) print(f"Cosine score is {score['score']:.2f} with {score['matches']} matched peaks")
Should output
Cosine score is 0.83 with 1 matched peaks
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) ndarray
Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]
Calculate cosine score between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- Returns:
Tuple with cosine score and number of matched peaks.
- Return type:
Score
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.CosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityCalculate ‘cosine similarity score’ between two spectra using the Hungarian algorithm.
The cosine score quantifies the similarity between two mass spectra by finding the optimal one-to-one matching between their peaks. Two peaks are considered a potential match if their m/z ratios lie within the given tolerance.
The peak assignment is solved using the Hungarian algorithm (
scipy.optimize.linear_sum_assignment), which finds the assignment that maximises the sum of intensity products. This is mathematically optimal but can be notably slower than the greedy heuristic inCosineGreedy.- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) ndarray
Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]
Calculate cosine score between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- Return type:
Tuple with cosine score and number of matched peaks.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.CosineLinear(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityCalculate ‘linear cosine similarity score’ between two spectra.
This implements the CosineLinear similarity from SIRIUS (BOECKER lab), which achieves O(n+m) time complexity by requiring spectra to be “well-separated” (consecutive peaks more than 2x tolerance apart). A preprocessing step (sirius_merge_close_peaks) enforces this invariant by greedily merging close peaks in descending intensity order.
For example
import numpy as np from matchms import Spectrum from matchms.similarity import CosineLinear reference = Spectrum(mz=np.array([100, 150, 200.]), intensities=np.array([0.7, 0.2, 0.1])) query = Spectrum(mz=np.array([100, 140, 190.]), intensities=np.array([0.4, 0.2, 0.1])) cosine_linear = CosineLinear(tolerance=0.2) score = cosine_linear.pair(reference, query) print(f"CosineLinear score is {score['score']:.2f} with {score['matches']} matched peaks")
Should output
CosineLinear score is 0.83 with 1 matched peaks
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1. Peaks closer than 2 * tolerance are merged before scoring.
mz_power – The power to raise m/z to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) ndarray[source]
Optimized matrix computation that precomputes merged spectra.
Each spectrum is merged once (N+M calls to sirius_merge_close_peaks) instead of 2*N*M times in the naive double-loop approach.
- pair(reference: Spectrum, query: Spectrum) ndarray[source]
Calculate linear cosine score between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- Returns:
Tuple with cosine score and number of matched peaks.
- Return type:
Score
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.FingerprintSimilarity(similarity_measure: str = 'jaccard', set_empty_scores: float | int | str = 'nan')[source]
Bases:
BaseSimilarityCalculate similarity between molecules based on their fingerprints.
For this similarity measure to work, fingerprints are expected to be derived by running
add_fingerprint().Code example:
import numpy as np from matchms import calculate_scores from matchms import Spectrum from matchms.filtering import add_fingerprint from matchms.similarity import FingerprintSimilarity spectrum_1 = Spectrum(mz=np.array([], dtype="float"), intensities=np.array([], dtype="float"), metadata={"smiles": "CCC(C)C(C(=O)O)NC(=O)CCl"}) spectrum_2 = Spectrum(mz=np.array([], dtype="float"), intensities=np.array([], dtype="float"), metadata={"smiles": "CC(C)C(C(=O)O)NC(=O)CCl"}) spectrum_3 = Spectrum(mz=np.array([], dtype="float"), intensities=np.array([], dtype="float"), metadata={"smiles": "C(C(=O)O)(NC(=O)O)S"}) spectra = [spectrum_1, spectrum_2, spectrum_3] # Add fingerprints spectra = [add_fingerprint(x, nbits=256) for x in spectra] # Specify type and calculate similarities similarity_measure = FingerprintSimilarity("jaccard") scores = calculate_scores(spectra, spectra, similarity_measure) print(np.round(scores.scores.to_array(), 3).tolist())
Should output
[[1.0, 0.878, 0.415], [0.878, 1.0, 0.444], [0.415, 0.444, 1.0]]
- __init__(similarity_measure: str = 'jaccard', set_empty_scores: float | int | str = 'nan')[source]
- Parameters:
similarity_measure – Chose similarity measure form “cosine”, “dice”, “jaccard”. The default is “jaccard”.
set_empty_scores – Define what should be given instead of a similarity score in cases where fingprints are missing. The default is “nan”, which will return np.nan’s in such cases.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) array[source]
Calculate matrix of fingerprint based similarity scores.
- Parameters:
references – List of reference spectra.
queries – List of query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array
- pair(reference: Spectrum, query: Spectrum) float[source]
Calculate fingerprint based similarity score between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- score_datatype
alias of
float64
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.FlashSimilarity(score_type: str = 'spectral_entropy', matching_mode: str = 'fragment', tolerance: float = 0.02, use_ppm: bool = False, remove_precursor: bool = False, precursor_window: float = 1.6, noise_cutoff: float = 0.01, normalize_to_half: bool = True, merge_within: float = 0, identity_precursor_tolerance: float | None = None, identity_use_ppm: bool = False, dtype: dtype = <class 'numpy.float64'>)[source]
Bases:
BaseSimilarityFlash entropy similarity (Li & Fiehn, 2023) with a fast .matrix() that builds a library-wide index over ‘queries’ and streams all ‘references’ through it.
- Key options:
matching_mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (fragment-priority).
tolerance in Da or symmetric ppm (use_ppm=True).
- cleanup: remove precursor & > (precursor_mz - 1.6), 1% noise removal,
entropy weighting, normalize ∑I’ = 0.5, optional within-peak merge.
- Notes:
.pair() works but is not the fast path. Use .matrix().
For identity-search behavior, pass identity_precursor_tolerance (Da or ppm).
- Parameters:
score_type – Score type: ‘spectral_entropy’ (default) or ‘cosine’.
matching_mode – Matching mode: ‘fragment’, ‘neutral_loss’, or ‘hybrid’ (default is ‘fragment’). Chose “hybrid” in combination with score_type=”cosine” to compute the modified cosine score.
tolerance – Matching tolerance in Da or ppm (use_ppm=True). Default is 0.02.
use_ppm – If True, interpret tolerance as parts-per-million. Default is False.
remove_precursor – If True, remove precursor peak and peaks within precursor_window. Default is False.
precursor_window – If remove_precursor is True, remove peaks within this window around the precursor m/z. Default is 1.6 Da (as suggested by Li & Fiehn(2023)).
noise_cutoff – If > 0, remove peaks with intensities below this fraction of the maximum intensity. Default is 0.01 (1%).
normalize_to_half – If True, normalize intensities such that the sum of intensities is 0.5. Default is True.
merge_within – If > 0, merge peaks within this distance (in Da) to a single peak. Default is 0.
identity_precursor_tolerance – If not None, enforce identity search behavior by requiring the precursor m/z of the query to be within this tolerance of the reference precursor m/z.
identity_use_ppm – If True, interpret identity_precursor_tolerance as ppm. Default is False.
dtype – Data type for the output scores. Default is np.float64 which properly accounts for highest resolution MS/MS data (even far beyond current MS/MS possibilties!). To save memory, np.float32 can be used instead, which is sufficient for peak resolutions up to about 8,000,000.
- __init__(score_type: str = 'spectral_entropy', matching_mode: str = 'fragment', tolerance: float = 0.02, use_ppm: bool = False, remove_precursor: bool = False, precursor_window: float = 1.6, noise_cutoff: float = 0.01, normalize_to_half: bool = True, merge_within: float = 0, identity_precursor_tolerance: float | None = None, identity_use_ppm: bool = False, dtype: dtype = <class 'numpy.float64'>)[source]
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, n_jobs: int = -1) ndarray[source]
Calculate matrix of Flash entropy similarity scores.
Parameters:
- references:
List of reference spectra.
- queries:
List of query spectra.
- array_type:
Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a SparseStacked COO-style array.
- is_symmetric:
If True, the matrix will be symmetric (i.e., references and queries must have the same length). Here has no consequence on runtime.
- n_jobs:
Number of parallel jobs to run. Default is set to -1, which means that all available CPUs minus one will be used.
- pair(reference: Spectrum, query: Spectrum) ndarray[source]
Compute Flash similarity for a single (reference, query) pair. Uses the same preprocessing and scoring logic as the matrix path, but builds a tiny 1-spectrum library from the query.
Careful: This is not the fast intended use; better .matrix() instead.
- score_datatype
alias of
float32
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.IntersectMz(scaling: float = 1.0)[source]
Bases:
BaseSimilarityExample score for illustrating how to build custom spectra similarity score.
IntersectMz will count all exact matches of peaks and divide it by all unique peaks found in both spectra.
Example of how matchms similarity functions can be used:
import numpy as np from matchms import Spectrum from matchms.similarity import IntersectMz spectrum_1 = Spectrum(mz=np.array([100, 150, 200.]), intensities=np.array([0.7, 0.2, 0.1])) spectrum_2 = Spectrum(mz=np.array([100, 140, 190.]), intensities=np.array([0.4, 0.2, 0.1])) # Construct a similarity function similarity_measure = IntersectMz(scaling=1.0) score = similarity_measure.pair(spectrum_1, spectrum_2) print(f"IntersectMz score is {score:.2f}")
Should output
IntersectMz score is 0.20
- __init__(scaling: float = 1.0)[source]
Constructor. Here, function parameters are defined.
- Parameters:
scaling – Scale scores to maximum possible score being ‘scaling’.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) ndarray
Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- pair(reference: Spectrum, query: Spectrum) float[source]
This will calculate the similarity score between two spectra.
- score_datatype
alias of
float64
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.MetadataMatch(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]
Bases:
BaseSimilarityReturn True if metadata entries of a specified field match between two spectra.
This is supposed to be used to compare a wide range of possible metadata entries and use this to later select related or similar spectra.
Example to calculate scores between 2 pairs of spectra and iterate over the scores
import numpy as np from matchms import calculate_scores from matchms import Spectrum from matchms.similarity import MetadataMatch spectrum_1 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "orbitrap", "id": 1}) spectrum_2 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "qtof", "id": 2}) spectrum_3 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "qtof", "id": 3}) spectrum_4 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"instrument_type": "orbitrap", "id": 4}) references = [spectrum_1, spectrum_2] queries = [spectrum_3, spectrum_4] similarity_score = MetadataMatch(field="instrument_type") scores = calculate_scores(references, queries, similarity_score) for (reference, query, score) in scores: print(f"Metadata match between {reference.get('id')} and {query.get('id')}" + f" is {bool(score[0])}")
Should output
Metadata match between 1 and 4 is True Metadata match between 2 and 3 is True
- __init__(field: str, matching_type: str = 'equal_match', tolerance: float = 0.1)[source]
- Parameters:
field – Specify field name for metadata that should be compared.
matching_type – Specify how field entries should be matched. Can be one of [“equal_match”, “difference”]. “equal_match”: Entries must be exactly equal (default). “difference”: Entries are considered a match if their numerical difference is less than or equal to “tolerance”.
tolerance – Specify tolerance below which two values are counted as match. This only applied to numerical values.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) ndarray[source]
Compare parent masses between all references and queries.
- Parameters:
references – List/array of reference spectra.
queries – List/array of Single query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
- pair(reference: Spectrum, query: Spectrum) float[source]
Compare precursor m/z between reference and query spectrum.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.ModifiedCosineGreedy(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityCalculate an approximate modified cosine score between mass spectra.
This implementation solves the peak assignment in a greedy way and is therefore an approximation. See
ModifiedCosineHungarianfor the exact assignment variant.The modified cosine score aims at quantifying the similarity between two mass spectra. Two peaks are considered a potential match if their m/z ratios lie within the given
tolerance, or if their m/z ratios lie within the tolerance once a mass-shift is applied. The mass shift is the difference in precursor m/z between the two spectra.See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for further details.
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Initialize approximate modified cosine.
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) ndarray
Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]
Calculate approximate modified cosine score between two spectra.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.ModifiedCosineHungarian(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Bases:
BaseSimilarityCalculate exact modified cosine score between mass spectra.
The modified cosine score quantifies similarity between two mass spectra with optional precursor-based mass shift. Potential matches are all peak pairs that are within
toleranceeither unshifted or shifted byprecursor_mz(reference) - precursor_mz(query).Peak assignment is solved globally via Hungarian assignment (linear sum assignment), which yields an exact one-to-one maximum-weight matching.
See Watrous et al. [PNAS, 2012, https://www.pnas.org/content/109/26/E1743] for the modified cosine concept.
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0)[source]
Initialize exact modified cosine.
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) ndarray
Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]
Calculate exact modified cosine score between two spectra.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.NeutralLossesCosine(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]
Bases:
BaseSimilarityCalculate ‘neutral losses cosine score’ between mass spectra.
The neutral losses cosine score aims at quantifying the similarity between two mass spectra. The score is calculated by finding best possible matches between peaks of two spectra. Two peaks are considered a potential match if their m/z ratios lie within the given ‘tolerance’ once a mass-shift is applied. The mass shift is the difference in precursor-m/z between the two spectra. In general, ModifiedCosineGreedy is recommended over NeutralLossesCosine because it will on average deliver more reliable results.
- __init__(tolerance: float = 0.1, mz_power: float = 0.0, intensity_power: float = 1.0, ignore_peaks_above_precursor: bool = True)[source]
- Parameters:
tolerance – Peaks will be considered a match when <= tolerance apart. Default is 0.1.
mz_power – The power to raise mz to in the cosine function. The default is 0, in which case the peak intensity products will not depend on the m/z ratios.
intensity_power – The power to raise intensity to in the cosine function. The default is 1.
ignore_peaks_above_precursor – By default this is set to True, meaning that peaks with m/z values larger than the precursor-m/z will be ignored (since those would correspond to negative “neutral losses”).
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False, progress_bar: bool = True) ndarray
Optional: Provide optimized method to calculate an np.array of similarity scores for given reference and query spectra. If no method is added here, the following naive implementation (i.e. a double for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- pair(reference: Spectrum, query: Spectrum) Tuple[float, int][source]
Calculate neutral losses cosine score between two spectra.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- Return type:
Tuple with cosine score and number of matched peaks.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.ParentMassMatch(tolerance: float = 0.1)[source]
Bases:
BaseSimilarityReturn True if spectra match in parent mass (within tolerance), and False otherwise.
Example to calculate scores between 2 spectra and iterate over the scores
import numpy as np from matchms import calculate_scores from matchms import Spectrum from matchms.similarity import ParentMassMatch spectrum_1 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "1", "parent_mass": 100}) spectrum_2 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "2", "parent_mass": 110}) spectrum_3 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "3", "parent_mass": 103}) spectrum_4 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "4", "parent_mass": 111}) references = [spectrum_1, spectrum_2] queries = [spectrum_3, spectrum_4] similarity_score = ParentMassMatch(tolerance=5.0) scores = calculate_scores(references, queries, similarity_score) for (reference, query, score) in scores: print(f"Parentmass match between {reference.get('id')} and {query.get('id')}" + f" is {bool(score[0])}")
Should output
Parentmass match between 1 and 3 is True Parentmass match between 2 and 4 is True
- __init__(tolerance: float = 0.1)[source]
- Parameters:
tolerance – Specify tolerance below which two masses are counted as match.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) ndarray[source]
Compare parent masses between all references and queries.
- Parameters:
references – List/array of reference spectra.
queries – List/array of Single query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
- pair(reference: Spectrum, query: Spectrum) float[source]
Compare parent masses between reference and query spectrum.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
- class matchms.similarity.PrecursorMzMatch(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]
Bases:
BaseSimilarityReturn True if spectra match in precursor m/z (within tolerance), and False otherwise. The match within tolerance can be calculated based on an absolute m/z difference (tolerance_type=”Dalton”) or based on a relative difference in ppm (tolerance_type=”ppm”).
Example to calculate scores between 2 pairs of spectra and iterate over the scores
import numpy as np from matchms import calculate_scores from matchms import Spectrum from matchms.similarity import PrecursorMzMatch spectrum_1 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "1", "precursor_mz": 100}) spectrum_2 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "2", "precursor_mz": 110}) spectrum_3 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "3", "precursor_mz": 103}) spectrum_4 = Spectrum(mz=np.array([]), intensities=np.array([]), metadata={"id": "4", "precursor_mz": 111}) references = [spectrum_1, spectrum_2] queries = [spectrum_3, spectrum_4] similarity_score = PrecursorMzMatch(tolerance=5.0, tolerance_type="Dalton") scores = calculate_scores(references, queries, similarity_score) for (reference, query, score) in scores: print(f"Precursor m/z match between {reference.get('id')} and {query.get('id')}" + f" is {bool(score[0])}")
Should output
Precursor m/z match between 1 and 3 is True Precursor m/z match between 2 and 4 is True
- __init__(tolerance: float = 0.1, tolerance_type: str = 'Dalton')[source]
- Parameters:
tolerance – Specify tolerance below which two m/z are counted as match.
tolerance_type – Chose between fixed tolerance in Dalton (=”Dalton”) or a relative difference in ppm (=”ppm”).
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) ndarray[source]
Compare parent masses between all references and queries.
- Parameters:
references – List/array of reference spectra.
queries – List/array of Single query spectra.
array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array.
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
- pair(reference: Spectrum, query: Spectrum) float[source]
Compare precursor m/z between reference and query spectrum.
- Parameters:
reference – Single reference spectrum.
query – Single query spectrum.
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.
Submodules
- matchms.similarity.BaseEmbeddingSimilarity module
BaseEmbeddingSimilarityBaseEmbeddingSimilarity.indexBaseEmbeddingSimilarity.index_backendBaseEmbeddingSimilarity.index_kwargsBaseEmbeddingSimilarity.index_kBaseEmbeddingSimilarity.__init__()BaseEmbeddingSimilarity.build_ann_index()BaseEmbeddingSimilarity.compute_embeddings()BaseEmbeddingSimilarity.get_anns()BaseEmbeddingSimilarity.get_embeddings()BaseEmbeddingSimilarity.get_index_anns()BaseEmbeddingSimilarity.keep_score()BaseEmbeddingSimilarity.load_ann_index()BaseEmbeddingSimilarity.load_embeddings()BaseEmbeddingSimilarity.matrix()BaseEmbeddingSimilarity.pair()BaseEmbeddingSimilarity.save_ann_index()BaseEmbeddingSimilarity.score_datatypeBaseEmbeddingSimilarity.sparse_array()BaseEmbeddingSimilarity.store_embeddings()BaseEmbeddingSimilarity.to_dict()
- matchms.similarity.BaseSimilarity module
- matchms.similarity.BinnedEmbeddingSimilarity module
BinnedEmbeddingSimilarityBinnedEmbeddingSimilarity.__init__()BinnedEmbeddingSimilarity.build_ann_index()BinnedEmbeddingSimilarity.compute_embeddings()BinnedEmbeddingSimilarity.get_anns()BinnedEmbeddingSimilarity.get_embeddings()BinnedEmbeddingSimilarity.get_index_anns()BinnedEmbeddingSimilarity.keep_score()BinnedEmbeddingSimilarity.load_ann_index()BinnedEmbeddingSimilarity.load_embeddings()BinnedEmbeddingSimilarity.matrix()BinnedEmbeddingSimilarity.pair()BinnedEmbeddingSimilarity.save_ann_index()BinnedEmbeddingSimilarity.score_datatypeBinnedEmbeddingSimilarity.sparse_array()BinnedEmbeddingSimilarity.store_embeddings()BinnedEmbeddingSimilarity.to_dict()
- matchms.similarity.BlinkCosine module
- matchms.similarity.CosineGreedy module
- matchms.similarity.CosineHungarian module
- matchms.similarity.CosineLinear module
- matchms.similarity.FingerprintSimilarity module
- matchms.similarity.FlashSimilarity module
- matchms.similarity.IntersectMz module
- matchms.similarity.MetadataMatch module
- matchms.similarity.ModifiedCosineGreedy module
- matchms.similarity.ModifiedCosineHungarian module
- matchms.similarity.NeutralLossesCosine module
- matchms.similarity.ParentMassMatch module
- matchms.similarity.PrecursorMzMatch module
- matchms.similarity.cosine_linear_functions module
- matchms.similarity.flash_utils module
- matchms.similarity.spectrum_similarity_functions module
- matchms.similarity.vector_similarity_functions module