matchms.similarity.BinnedEmbeddingSimilarity module

class matchms.similarity.BinnedEmbeddingSimilarity.BinnedEmbeddingSimilarity(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

Bases: BaseEmbeddingSimilarity

A similarity measure that bins spectra into a fixed number of bins and uses the binned intensities as embedding features. By default, the similarity between spectra is computed as the cosine similarity between their binned representations.

Parameters:

similarity (str, optional) – The similarity measure to use for comparing embeddings. Default is “cosine”. Options are “cosine” or “euclidean”.
max_mz (float, optional) – The maximum m/z value to consider when binning. Default is 1005.
bin_width (float, optional) – The width of each bin in m/z units. Default is 1.
intensity_power – The power to raise the peak intensities. Default is 1.

__init__(similarity: str = 'cosine', max_mz: float = 1005, bin_width: float = 1, intensity_power: float = 1)[source]

build_ann_index(reference_spectra: Iterable[Spectrum] | None = None, embeddings_path: str | Path | None = None, k: int = 100, index_backend: str = 'pynndescent', **index_kwargs) → Any

Build an ANN index for the reference spectra.

Parameters:

reference_spectra (Optional[Iterable[SpectrumType]]) – List of reference spectra to build the ANN index for.
embeddings_path (Optional[Union[str, Path]]) – If embeddings are already computed, provide the path to the numpy file.
k (int, optional) – Number of nearest neighbors to use for the ANN index.
index_backend (str, optional) – Backend to use for ANN index. Currently only “pynndescent” is supported.
**index_kwargs – Additional keyword arguments passed to the index constructor.

Returns:

The constructed ANN index.

Return type:

Any

Raises:

ImportError – If pynndescent is not installed.
ValueError – If an unsupported index_backend is specified.

compute_embeddings(spectra: Iterable[Spectrum]) → ndarray[source]

Convert spectra into binned embeddings.

Parameters:: spectra (Iterable[SpectrumType]) – The spectra to convert into embeddings.
Returns:: Array of shape (n_spectra, n_bins) containing the binned embeddings.
Return type:: np.ndarray

get_anns(query_spectra: Iterable[Spectrum] | ndarray, k: int = 100) → Tuple[ndarray, ndarray]

Get approximate nearest neighbors for query spectra.

Parameters:

query_spectra (Union[Iterable[SpectrumType], np.ndarray]) – Query spectra or their embeddings.
k (int, optional) – Number of nearest neighbors to return.

Returns:

Neighbor indices and similarity scores.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If no index is built or k is larger than index k.

get_embeddings(spectra: Iterable[Spectrum] | None = None, npy_path: str | Path | None = None) → ndarray

Get embeddings either by computing them or loading from disk.

Parameters:

spectra – List of spectra to compute embeddings for.
npy_path – Path to load/save embeddings from/to. If provided, embeddings are loaded from disk if it exists, otherwise they are computed and saved on disk to the provided path.

Returns:

Embeddings array.

Return type:

np.ndarray

Raises:

ValueError – If neither spectra nor npy_path is provided.

get_index_anns() → Tuple[ndarray, ndarray]

Get nearest neighbors for all points in the index.

Returns:: Neighbor indices and similarity scores.
Return type:: Tuple[np.ndarray, np.ndarray]
Raises:: ValueError – If unsupported index backend is used.

keep_score(score): In the .matrix method scores will be collected in a sparse way. Overwrite this method here if values other than False or 0 should not be stored in the final collection.

load_ann_index(path: str | Path) → Any

Load an ANN index from disk.

Parameters:: path (Union[str, Path]) – Path to load the index from.
Returns:: The loaded ANN index.
Return type:: Any
Raises:: ValueError – If loaded index similarity metric doesn’t match current metric.

static load_embeddings(npy_path: str | Path) → ndarray

Load embeddings from a numpy file.

Parameters:: npy_path (Union[str, Path]) – Path to the numpy file.
Returns:: Embeddings array.
Return type:: np.ndarray
Raises:: ValueError – If loaded array is not 2D.

matrix(references: List[Spectrum], queries: List[Spectrum], array_type: str = 'numpy', is_symmetric: bool = True) → ndarray

Compute similarity matrix between reference and query spectra.

Parameters:

references – List of reference spectra.
queries – List of query spectra.
array_type – Type of array to return. Must be “numpy”.
is_symmetric – Whether the matrix is symmetric. Must be True.

Returns:

Similarity matrix.

Return type:

np.ndarray

Raises:

ValueError – If array_type is not “numpy” or is_symmetric is False.

pair(reference: Spectrum, query: Spectrum) → float

Compute similarity between a pair of spectra.

Parameters:

reference (SpectrumType) – Reference spectrum.
query (SpectrumType) – Query spectrum.

Returns:

Similarity score between the spectra.

Return type:

float

save_ann_index(path: str | Path) → None

Save the ANN index to disk.

Parameters:: path (Union[str, Path]) – Path to save the index to.
Raises:: ValueError – If no index exists to save.

score_datatype: alias of float64

sparse_array(references: List[Spectrum], queries: List[Spectrum], idx_row, idx_col, is_symmetric: bool = False, progress_bar: bool = True)

Optional: Provide optimized method to calculate an sparse matrix of similarity scores.

Compute similarity scores for pairs of reference and query spectra as given by the indices idx_row (references) and idx_col (queries). If no method is added here, the following naive implementation (i.e. a for-loop) is used.

Parameters:

references – List of reference objects
queries – List of query objects
idx_row – List/array of row indices
idx_col – List/array of column indices
is_symmetric – Set to True when references and queries are identical (as for instance for an all-vs-all comparison). By using the fact that score[i,j] = score[j,i] the calculation will be about 2x faster.
progress_bar – When True a progress bar is shown. Default is True.

static store_embeddings(npy_path: str | Path, embs: ndarray) → None

Store embeddings in a numpy file.

Parameters:

npy_path (Union[str, Path]) – Path to save the embeddings to.
embs (np.ndarray) – Embeddings array to store.

to_dict() → dict: Return a dictionary representation of a similarity function.