matchms.similarity.BaseEmbeddingSimilarity module

class matchms.similarity.BaseEmbeddingSimilarity.BaseEmbeddingSimilarity(similarity: str = 'cosine')[source]

Bases: BaseSimilarity

Base class for similarity measures that work with embeddings.

This class provides functionality for computing similarities between spectra based on their embeddings (vector representations). It supports cosine and euclidean similarity metrics, and includes approximate nearest neighbor (ANN) search capabilities.

Parameters:

similarity (str) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.

index

The ANN index object; if built.

Type:

object

index_backend

The backend used for ANN indexing (currently only “pynndescent” supported); if index is built.

Type:

str

index_kwargs

Additional arguments passed to the ANN index constructor; if index is built.

Type:

dict

index_k

Number of nearest neighbors used in the ANN index; if index is built.

Type:

int

__init__(similarity: str = 'cosine')[source]
build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any[source]

Build an ANN index for the reference spectra.

Parameters:
  • reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.

  • embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.

  • k (int, optional) – Number of nearest neighbors to use for the ANN index.

  • index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.

  • **index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:
  • ImportError – If pynndescent is not installed.

  • ValueError – If an unsupported index_backend is specified.

abstractmethod compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]

Compute embeddings for a list of spectra.

Parameters:

spectra – List of spectra to compute embeddings for.

Returns:

Embeddings for the spectra. Shape: (n_spectra, n_embedding_features).

Return type:

np.ndarray

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) Tuple[ndarray, ndarray][source]

Get approximate nearest neighbors for query spectra.

Parameters:
  • query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.

  • k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray[source]

Get embeddings either by computing them or loading from disk.

Parameters:
  • spectra – List of spectra to compute embeddings for.

  • npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() Tuple[ndarray, ndarray][source]

Get nearest neighbors for all points in the index.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If unsupported index backend is used.

keep_score(score)

In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

load_ann_index(path: str | Path) Any[source]

Load an ANN index from disk.

Parameters:

path (Union[str, Path]) – Path to load the index from.

Returns:

The loaded ANN index.

Return type:

Any

Raises:

ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) ndarray[source]

Load embeddings from a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to the numpy file.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If loaded array is not 2D.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = True) ndarray[source]

Compute similarity matrix between reference and query spectra.

Parameters:
  • references – List of reference spectra.

  • queries – List of query spectra.

  • array_type – Type of array to return. Must be “numpy”.

  • is_symmetric – Whether the matrix is symmetric. Must be True.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

pair(reference: Spectrum, query: Spectrum) float[source]

Compute similarity between a pair of spectra.

Parameters:
  • reference (SpectrumType) – Reference spectrum.

  • query (SpectrumType) – Query spectrum.

Returns:

Similarity score between the spectra.

Return type:

float

save_ann_index(path: str | Path) None[source]

Save the ANN index to disk.

Parameters:

path (Union[str, Path]) – Path to save the index to.

Raises:

ValueError – If no index exists to save.

score_datatype

alias of float64

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:
  • references – List of reference objects

  • queries – List of query objects

  • idx_row – List/array of row indices

  • idx_col – List/array of column indices

  • is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.

  • progress_bar – When True a progress bar is shown. Default is True.

static store_embeddings(npy_path: str | Path, embs: ndarray) None[source]

Store embeddings in a numpy file.

Parameters:
  • npy_path (Union[str, Path]) – Path to save the embeddings to.

  • embs (np.ndarray) – Embeddings array to store.

to_dict() dict

Return a dictionary representation of a similarity function.