matchms.similarity.BaseEmbeddingSimilarity module
- class matchms.similarity.BaseEmbeddingSimilarity.BaseEmbeddingSimilarity(similarity: str = 'cosine')[source]
Bases:
BaseSimilarityBase class for similarity measures that work with embeddings.
This class provides functionality for computing similarities between spectra based on their embeddings (vector representations). It supports cosine and euclidean similarity metrics, and includes approximate nearest neighbor (ANN) search capabilities.
- Parameters:
similarity (str) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.
- index_backend
The backend used for ANN indexing (currently only “pynndescent” supported); if index is built.
- Type:
- index_kwargs
Additional arguments passed to the ANN index constructor; if index is built.
- Type:
- build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any[source]
Build an ANN index for the input spectra.
- Parameters:
reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.
- Returns:
The constructed ANN index.
- Return type:
Any
- Raises:
ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.
- abstractmethod compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]
Compute embeddings for a list of spectra.
- Parameters:
spectra – List of spectra to compute embeddings for.
- Returns:
Embeddings for the spectra. Shape: (n_spectra, n_embedding_features).
- Return type:
np.ndarray
- compute_similarity_matrix_from_embeddings(embeddings_1: ndarray, embeddings_2: ndarray | None = None) ndarray[source]
Compute a raw NumPy similarity matrix from precomputed embeddings.
This helper keeps the old raw-array use case available without changing the public
matrix()contract inherited fromBaseSimilarity.
- get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) tuple[ndarray, ndarray][source]
Get approximate nearest neighbors for input spectra.
- Parameters:
query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If no index is built or k is larger than index k.
- get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray[source]
Get embeddings either by computing them or loading from disk.
- Parameters:
spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If neither spectra nor npy_path is provided.
- get_index_anns() tuple[ndarray, ndarray][source]
Get nearest neighbors for all points in the index.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If unsupported index backend is used.
- load_ann_index(path: str | Path) Any[source]
Load an ANN index from disk.
- Parameters:
path (Union[str, Path]) – Path to load the index from.
- Returns:
The loaded ANN index.
- Return type:
Any
- Raises:
ValueError – If loaded index similarity metric doesn’t match current metric.
- static load_embeddings(npy_path: str | Path) ndarray[source]
Load embeddings from a numpy file.
- Parameters:
npy_path (Union[str, Path]) – Path to the numpy file.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If loaded array is not 2D.
- matrix(spectra_1: Sequence[Spectrum], spectra_2: Sequence[Spectrum] | None = None, score_fields: Sequence[str] | None = None, progress_bar: bool = True) Scores[source]
Compute similarity matrix between spectra_1 and spectra_2.
- Parameters:
spectra_1 – First collection of spectra.
spectra_2 – Second collection of spectra. If
None, comparespectra_1against itself.score_fields – Requested score fields. Embedding similarities expose only
("score",).progress_bar – Included for API compatibility. Embeddings are computed in batch and this implementation currently does not display a progress bar.
- Returns:
Similarity matrix.
- Return type:
np.ndarray
- Raises:
ValueError – If array_type is not “numpy” or is_symmetric is False.
- pair(spectrum_1: Spectrum, spectrum_2: Spectrum) float[source]
Compute similarity between a pair of spectra.
- Parameters:
spectrum_1 (SpectrumType) – Reference spectrum.
spectrum_2 (SpectrumType) – Query spectrum.
- save_ann_index(path: str | Path) None[source]
Save the ANN index to disk.
- Parameters:
path (Union[str, Path]) – Path to save the index to.
- Raises:
ValueError – If no index exists to save.
- score_datatype
alias of
float64
- sparse_matrix(spectra_1, spectra_2=None, idx_row=None, idx_col=None, score_fields=None, score_filter=None, progress_bar: bool = True)
Sparse score computation is not available for this similarity.