matchms.similarity.FingerprintSimilarity module

class matchms.similarity.FingerprintSimilarity.FingerprintSimilarity(similarity_measure: str = 'jaccard', set_empty_scores: float | int | str = 'nan')[source]

Bases: BaseSimilarity

Calculate similarity between molecules based on their fingerprints.

For this similarity measure to work, fingerprints are expected to be derived by running add_fingerprint().

Code example:

import numpy as np
from matchms import calculate_scores
from matchms import Spectrum
from matchms.filtering import add_fingerprint
from matchms.similarity import FingerprintSimilarity

spectrum_1 = Spectrum(mz=np.array([], dtype="float"),
                      intensities=np.array([], dtype="float"),
                      metadata={"smiles": "CCC(C)C(C(=O)O)NC(=O)CCl"})

spectrum_2 = Spectrum(mz=np.array([], dtype="float"),
                      intensities=np.array([], dtype="float"),
                      metadata={"smiles": "CC(C)C(C(=O)O)NC(=O)CCl"})

spectrum_3 = Spectrum(mz=np.array([], dtype="float"),
                      intensities=np.array([], dtype="float"),
                      metadata={"smiles": "C(C(=O)O)(NC(=O)O)S"})

spectrums = [spectrum_1, spectrum_2, spectrum_3]
# Add fingerprints
spectrums = [add_fingerprint(x, nbits=256) for x in spectrums]

# Specify type and calculate similarities
similarity_measure = FingerprintSimilarity("jaccard")
scores = calculate_scores(spectrums, spectrums, similarity_measure)
print(np.round(scores.scores.to_array(), 3))

Should output

[[1.    0.878 0.415]
 [0.878 1.    0.444]
 [0.415 0.444 1.   ]]
__init__(similarity_measure: str = 'jaccard', set_empty_scores: float | int | str = 'nan')[source]
Parameters:
  • similarity_measure – Chose similarity measure form “cosine”, “dice”, “jaccard”. The default is “jaccard”.

  • set_empty_scores – Define what should be given instead of a similarity score in cases where fingprints are missing. The default is “nan”, which will return np.nan’s in such cases.

keep_score(score)

In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = False) array[source]

Calculate matrix of fingerprint based similarity scores.

Parameters:
  • references – List of reference spectrums.

  • queries – List of query spectrums.

  • array_type – Specify the output array type. Can be “numpy” or “sparse”. Default is “numpy” and will return a numpy array. “sparse” will return a COO-sparse array

pair(reference: Spectrum, query: Spectrum) float[source]

Calculate fingerprint based similarity score between two spectra.

Parameters:
  • reference – Single reference spectrum.

  • query – Single query spectrum.

score_datatype

alias of float64

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectrums as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:
  • references – List of reference objects

  • queries – List of query objects

  • idx_row – List/array of row indices

  • idx_col – List/array of column indices

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

to_dict() dict

Return a dictionary representation of a similarity function.