matchms.similarity.BinnedEmbeddingSimilarity module

class matchms.similarity.BinnedEmbeddingSimilarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

Bases: BaseEmbeddingSimilarity

A similarity measure that bins spectra into a fixed number of bins and uses the binned intensities as embedding features. By default, the similarity between spectra is computed as the cosine similarity between their binned representations.

Parameters:
  • similarity (str, optional) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.

  • max_mz (float, optional) – The maximum m/z value to consider when binning. Default is 1005.

  • bin_width (float, optional) – The width of each bin in m/z units. Default is 1.

  • intensity_power – The power to raise the peak intensities. Default is 1.

__init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any

Build an ANN index for the reference spectra.

Parameters:
  • reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.

  • embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.

  • k (int, optional) – Number of nearest neighbors to use for the ANN index.

  • index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.

  • **index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:
  • ImportError – If pynndescent is not installed.

  • ValueError – If an unsupported index_backend is specified.

compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]

Convert spectra into binned embeddings.

Parameters:

spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.

Returns:

Array of shape (n_spectra, n_bins) containing the binned embeddings.

Return type:

np.ndarray

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) Tuple[ndarray, ndarray]

Get approximate nearest neighbors for query spectra.

Parameters:
  • query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.

  • k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray

Get embeddings either by computing them or loading from disk.

Parameters:
  • spectra – List of spectra to compute embeddings for.

  • npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() Tuple[ndarray, ndarray]

Get nearest neighbors for all points in the index.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If unsupported index backend is used.

keep_score(score)

In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

load_ann_index(path: str | Path) Any

Load an ANN index from disk.

Parameters:

path (Union[str, Path]) – Path to load the index from.

Returns:

The loaded ANN index.

Return type:

Any

Raises:

ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) ndarray

Load embeddings from a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to the numpy file.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If loaded array is not 2D.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = True) ndarray

Compute similarity matrix between reference and query spectra.

Parameters:
  • references – List of reference spectra.

  • queries – List of query spectra.

  • array_type – Type of array to return. Must be “numpy”.

  • is_symmetric – Whether the matrix is symmetric. Must be True.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

pair(reference: Spectrum, query: Spectrum) float

Compute similarity between a pair of spectra.

Parameters:
  • reference (SpectrumType) – Reference spectrum.

  • query (SpectrumType) – Query spectrum.

Returns:

Similarity score between the spectra.

Return type:

float

save_ann_index(path: str | Path) None

Save the ANN index to disk.

Parameters:

path (Union[str, Path]) – Path to save the index to.

Raises:

ValueError – If no index exists to save.

score_datatype

alias of float64

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:
  • references – List of reference objects

  • queries – List of query objects

  • idx_row – List/array of row indices

  • idx_col – List/array of column indices

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

  • progress_bar – When True a progress bar is shown. Default is True.

static store_embeddings(npy_path: str | Path, embs: ndarray) None

Store embeddings in a numpy file.

Parameters:
  • npy_path (Union[str, Path]) – Path to save the embeddings to.

  • embs (np.ndarray) – Embeddings array to store.

to_dict() dict

Return a dictionary representation of a similarity function.