matchms.similarity.BaseEmbeddingSimilarity module

class matchms.similarity.BaseEmbeddingSimilarity.BaseEmbeddingSimilarity(similarity: str = 'cosine')[source]

Bases: BaseSimilarity

Base class for similarity measures that work with embeddings.

This class provides functionality for computing similarities between spectra based on their embeddings (vector representations). It supports cosine and euclidean similarity metrics, and includes approximate nearest neighbor (ANN) search capabilities.

Parameters:: similarity (str) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.

index

The ANN index object; if built.

Type:: object

index_backend

The backend used for ANN indexing (currently only “pynndescent” supported); if index is built.

Type:: str

index_kwargs

Additional arguments passed to the ANN index constructor; if index is built.

Type:: dict

index_k

Number of nearest neighbors used in the ANN index; if index is built.

Type:: int

__init__(similarity: str = 'cosine')[source]

build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) → Any[source]

Build an ANN index for the input spectra.

Parameters:

reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:

ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.

abstractmethod compute_embeddings(spectra: Iterable[Spectrum]) → ndarray[source]

Compute embeddings for a list of spectra.

Parameters:: spectra – List of spectra to compute embeddings for.
Returns:: Embeddings for the spectra. Shape: (n_spectra, n_embedding_features).
Return type:: np.ndarray

compute_similarity_matrix_from_embeddings(embeddings_1: ndarray, embeddings_2: ndarray | None = None) → ndarray[source]

Compute a raw NumPy similarity matrix from precomputed embeddings.

This helper keeps the old raw-array use case available without changing the public matrix() contract inherited from BaseSimilarity.

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) → tuple[ndarray, ndarray][source]

Get approximate nearest neighbors for input spectra.

Parameters:

query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) → ndarray[source]

Get embeddings either by computing them or loading from disk.

Parameters:

spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() → tuple[ndarray, ndarray][source]

Get nearest neighbors for all points in the index.

Returns:: Neighbor indices and similarity scores.
Return type:: Tuple[np.ndarray, np.ndarray]
Raises:: ValueError – If unsupported index backend is used.

property is_structured_score: bool: Return True if this similarity uses a structured score dtype.

load_ann_index(path: str | Path) → Any[source]

Load an ANN index from disk.

Parameters:: path (Union[str, Path]) – Path to load the index from.
Returns:: The loaded ANN index.
Return type:: Any
Raises:: ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) → ndarray[source]

Load embeddings from a numpy file.

Parameters:: npy_path (Union[str, Path]) – Path to the numpy file.
Returns:: Embeddings array.
Return type:: np.ndarray
Raises:: ValueError – If loaded array is not 2D.

matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) → Scores[source]

Compute similarity matrix between spectra_1 and spectra_2.

Parameters:

spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If None, compare spectra_1 against itself.
score_fields – Requested score fields. Embedding similarities expose only ("score",).
progress_bar – Included for API compatibility. Embeddings are computed in batch and this implementation currently does not display a progress bar.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

pair(spectrum_1: Spectrum, spectrum_2: Spectrum) → float[source]

Compute similarity between a pair of spectra.

Parameters:

spectrum_1 (SpectrumType) – Reference spectrum.
spectrum_2 (SpectrumType) – Query spectrum.

save_ann_index(path: str | Path) → None[source]

Save the ANN index to disk.

Parameters:: path (Union[str, Path]) – Path to save the index to.
Raises:: ValueError – If no index exists to save.

score_datatype: alias of float64

sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True): Sparse score computation is not available for this similarity.

static store_embeddings(npy_path: str | Path, embeddings: ndarray) → None[source]

Store embeddings in a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to save the embeddings to.
embeddings (np.ndarray) – Embeddings array to store.

to_dict() → dict: Return a dictionary representation of the similarity function.