matchms.similarity.BaseEmbeddingSimilarity module
- class matchms.similarity.BaseEmbeddingSimilarity.BaseEmbeddingSimilarity(similarity: str = 'cosine')[source]
Bases:
BaseSimilarityBase class for similarity measures that work with embeddings.
This class provides functionality for computing similarities between spectra based on their embeddings (vector representations). It supports cosine and euclidean similarity metrics, and includes approximate nearest neighbor (ANN) search capabilities.
- Parameters:
similarity (str) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.
- index_backend
The backend used for ANN indexing (currently only “pynndescent” supported); if index is built.
- Type:
- index_kwargs
Additional arguments passed to the ANN index constructor; if index is built.
- Type:
- build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any[source]
Build an ANN index for the reference spectra.
- Parameters:
reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.
- Returns:
The constructed ANN index.
- Return type:
Any
- Raises:
ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.
- abstractmethod compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]
Compute embeddings for a list of spectra.
- Parameters:
spectra – List of spectra to compute embeddings for.
- Returns:
Embeddings for the spectra. Shape: (n_spectra, n_embedding_features).
- Return type:
np.ndarray
- get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) Tuple[ndarray, ndarray][source]
Get approximate nearest neighbors for query spectra.
- Parameters:
query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If no index is built or k is larger than index k.
- get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray[source]
Get embeddings either by computing them or loading from disk.
- Parameters:
spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If neither spectra nor npy_path is provided.
- get_index_anns() Tuple[ndarray, ndarray][source]
Get nearest neighbors for all points in the index.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If unsupported index backend is used.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- load_ann_index(path: str | Path) Any[source]
Load an ANN index from disk.
- Parameters:
path (Union[str, Path]) – Path to load the index from.
- Returns:
The loaded ANN index.
- Return type:
Any
- Raises:
ValueError – If loaded index similarity metric doesn’t match current metric.
- static load_embeddings(npy_path: str | Path) ndarray[source]
Load embeddings from a numpy file.
- Parameters:
npy_path (Union[str, Path]) – Path to the numpy file.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If loaded array is not 2D.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = True) ndarray[source]
Compute similarity matrix between reference and query spectra.
- Parameters:
references – List of reference spectra.
queries – List of query spectra.
array_type – Type of array to return. Must be “numpy”.
is_symmetric – Whether the matrix is symmetric. Must be True.
- Returns:
Similarity matrix.
- Return type:
np.ndarray
- Raises:
ValueError – If array_type is not “numpy” or is_symmetric is False.
- pair(reference: Spectrum, query: Spectrum) float[source]
Compute similarity between a pair of spectra.
- Parameters:
reference (SpectrumType) – Reference spectrum.
query (SpectrumType) – Query spectrum.
- Returns:
Similarity score between the spectra.
- Return type:
- save_ann_index(path: str | Path) None[source]
Save the ANN index to disk.
- Parameters:
path (Union[str, Path]) – Path to save the index to.
- Raises:
ValueError – If no index exists to save.
- score_datatype
alias of
float64
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.