matchms.similarity.BinnedEmbeddingSimilarity module
- class matchms.similarity.BinnedEmbeddingSimilarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
Bases:
BaseEmbeddingSimilarityA similarity measure that bins spectra into a fixed number of bins and uses the binned intensities as embedding features. By default, the similarity between spectra is computed as the cosine similarity between their binned representations.
- Parameters:
similarity (str, optional) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.
max_mz (float, optional) – The maximum m/z value to consider when binning. Default is 1005.
bin_width (float, optional) – The width of each bin in m/z units. Default is 1.
intensity_power – The power to raise the peak intensities. Default is 1.
- __init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]
- build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) Any
Build an ANN index for the reference spectra.
- Parameters:
reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.
- Returns:
The constructed ANN index.
- Return type:
Any
- Raises:
ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.
- compute_embeddings(spectra: Iterable[Spectrum]) ndarray[source]
Convert spectra into binned embeddings.
- Parameters:
spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
- Returns:
Array of shape (n_spectra, n_bins) containing the binned embeddings.
- Return type:
np.ndarray
- get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) Tuple[ndarray, ndarray]
Get approximate nearest neighbors for query spectra.
- Parameters:
query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If no index is built or k is larger than index k.
- get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) ndarray
Get embeddings either by computing them or loading from disk.
- Parameters:
spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If neither spectra nor npy_path is provided.
- get_index_anns() Tuple[ndarray, ndarray]
Get nearest neighbors for all points in the index.
- Returns:
Neighbor indices and similarity scores.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If unsupported index backend is used.
- keep_score(score)
In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.
- load_ann_index(path: str | Path) Any
Load an ANN index from disk.
- Parameters:
path (Union[str, Path]) – Path to load the index from.
- Returns:
The loaded ANN index.
- Return type:
Any
- Raises:
ValueError – If loaded index similarity metric doesn’t match current metric.
- static load_embeddings(npy_path: str | Path) ndarray
Load embeddings from a numpy file.
- Parameters:
npy_path (Union[str, Path]) – Path to the numpy file.
- Returns:
Embeddings array.
- Return type:
np.ndarray
- Raises:
ValueError – If loaded array is not 2D.
- matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = True) ndarray
Compute similarity matrix between reference and query spectra.
- Parameters:
references – List of reference spectra.
queries – List of query spectra.
array_type – Type of array to return. Must be “numpy”.
is_symmetric – Whether the matrix is symmetric. Must be True.
- Returns:
Similarity matrix.
- Return type:
np.ndarray
- Raises:
ValueError – If array_type is not “numpy” or is_symmetric is False.
- pair(reference: Spectrum, query: Spectrum) float
Compute similarity between a pair of spectra.
- Parameters:
reference (SpectrumType) – Reference spectrum.
query (SpectrumType) – Query spectrum.
- Returns:
Similarity score between the spectra.
- Return type:
- save_ann_index(path: str | Path) None
Save the ANN index to disk.
- Parameters:
path (Union[str, Path]) – Path to save the index to.
- Raises:
ValueError – If no index exists to save.
- score_datatype
alias of
float64
- sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)
Optional: Provide optimized method to calculate an sparse matrix of similarity scores.
Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.
- Parameters:
references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.