matchms.SpectraCollection module

class matchms.SpectraCollection.SpectraCollection(spectra: list[Spectrum] | Generator[Spectrum, None, None], mz_precision=1e-06)[source]

Bases: object

Central collection object for matchms spectra datasets.

A SpectraCollection stores many spectra in a synchronized, table-like representation. It separates spectrum-level metadata from peak data while preserving a shared row order between both components.

This class synchronizes:

metadata, tabular data kept internally as pandas DataFrame
fragments, stored in a fragment backend, currently CSRFragmentCollection

Rows correspond to spectra. Metadata row i and fragment row i always describe the same spectrum. Operations such as slicing, filtering, sorting, dropping, and deduplication are applied to both metadata and fragments so that this alignment is preserved.

Compared with a plain list[Spectrum], this representation is intended to support efficient collection-level operations, including metadata-based filtering, fragment-based filtering, m/z range slicing, sorting, hashing, and summary statistics.

Individual rows can still be accessed as regular Spectrum objects. These objects are reconstructed from the stored metadata row and the corresponding fragment row.

Notes

The fragment backend may use an internal representation that differs from the original input spectra. In particular, the default CSR backend stores fragments as a binned sparse matrix. Reconstructed spectra therefore contain m/z values derived from the backend representation, for example bin centers, rather than necessarily the exact original input m/z values.

The central invariant of this class is:

len(metadata) == len(fragments) == n_spectra

and for every row index i:

metadata.iloc[i] corresponds to fragments.get_row(i).

Direct modifications of internal metadata or fragment storage should be avoided. Use collection-level methods such as filter, sort, drop, and add_metadata to preserve row alignment and invalidate cached values correctly.

__init__(spectra: list[Spectrum] | Generator[Spectrum, None, None], mz_precision=1e-06)[source]

apply_to_metadata_rows(func, *args, row_mask=None, inplace: bool = False, drop_missing_updates: bool = True, **kwargs)[source]

Apply a metadata function to selected rows and merge the result back.

This is a convenience wrapper around self.metadata.apply_to_rows. It only modifies metadata and does not change fragments.

bin_to_mz(bin_idx: ndarray | int) → ndarray[source]

Convert bin indices to mz values.

Uses the mz_precision of SpectraCollection and calculates the mz value at the center of the bin.

Parameters:: bin_idx – Bin indices/columns to convert.
Returns:: The mz values at the center of specified bins.
Return type:: np.ndarray

describe() → DataFrame[source]

Generate descriptive statistics for the spectra collection.

Calculates key metrics for spectra collection, including peak counts, total ion intensity, average m/z, and Shannon entropy based on peak intensities. It then computes summary statistics (count, mean, std, min, max, etc.) for the entire collection.

Returns:

pd.DataFrame: A DataFrame containing summary statistics for the: following columns: - ‘peak_counts’: Number of detected peaks per spectrum. - ‘intensity_sums’: Total ion current (TIC) per spectrum. - ‘intensity_entropy’: Shannon entropy of peak intensities,

quantifying the spectral complexity/information density.

drop(indices: list[int] | ndarray, inplace: bool = False)[source]

Removes specified rows (spectra) from both fragments and metadata.

Parameters:

indiceslist[int] | np.ndarray: Indices of the rows to remove.
inplacebool: Will return a new SpectraCollection, if True and the same if False. Defaults to False.

drop_duplicates(inplace: bool = False)[source]

Drops duplicates by spectra hashes.

Parameters:

inplacebool: Will return a new SpectraCollection, if True and the same if False. Defaults to False.

drop_empty_spectra(inplace: bool = False)[source]

Removes spectra without peaks.

Parameters:

inplacebool: Will return a new SpectraCollection, if True and the same if False. Defaults to False.

drop_metadata(columns: str | list[str], inplace: bool = False, errors: str = 'raise')[source]

Remove one or more metadata columns.

Spectrum fragments and the number/order of spectra are left unchanged.

Parameters:

columns – Metadata column name or list of column names to remove.
inplace – If True, modify this collection and return None. If False, return a new collection with the selected metadata columns removed.
errors – Error handling passed to pandas.DataFrame.drop(). Use "raise" (default) to raise a KeyError for missing columns or "ignore" to silently skip them.

Returns:

A new collection if inplace=False; otherwise None.

Return type:

SpectraCollection or None

filter(mask: ndarray | Series | list[bool], inplace: bool = False)[source]

Filters SpectraCollection by keeping only the spectra where the mask is True.

This method synchronizes the filtering of both fragments and metadata. It uses boolean indexing from NumPy and Pandas.

Parameters:

list[bool]) (mask (np.ndarray | pd.Series |) – of the same length as the collection. Rows where the mask is True will be kept; all others will be removed.
(bool) (inplace) – returns None. If False (default), returns a new filtered SpectraCollection instance.

Returns:

SpectraCollection | None – otherwise None.

Return type:

A new filtered instance if inplace is False,

Raises

ValueError: If the length of the mask does not match the number of spectra in the collection.

Example:

>>> # Filter by metadata
>>> filtered_coll = coll.filter(coll.metadata["ms_level"] == 2)
>>>
>>> # Filter by fragment properties
>>> coll.filter(coll.fragments.sum() > 500, inplace=True)
>>>
>>> # Using an external vectorized filter function
>>> mask = filter_min_peaks(coll, n_required=10)
>>> coll.filter(mask, inplace=True)

harmonize_metadata_columns(inplace: bool = False)[source]: Harmonize metadata column names to matchms key style.

mz_to_bin(mz: ndarray | float) → ndarray[source]

Convert mz values into bins.

Uses the mz_precision of SpectraCollection and maps mz values into integer bins by flooring them.

Parameters:: mz – The mz values to bin.
Returns:: Bin indices as np.int64.
Return type:: np.ndarray

sort(by: str | list[str], on: str = 'metadata', inplace: bool = False, **kwargs)[source]

Sorts SpectraCollection (fragments AND metadata) by either metadata keyword(s) or fragment function.

Parameters:

bystr | list[str]: Either metadata column name or method name in FragmentsProxy (e.g., ‘sum’).
onstr: ‘metadata’ (Standard) or ‘fragments’.
inplacebool: Will return a new, sorted SpectraCollection, if True and the same, sorted if False. Defaults to False.

to_json(file: str, export_style: str = 'matchms', append: bool = False) → None[source]

Export the spectra collection to a JSON file.

Parameters:

file – Path to the output file.
export_style – Metadata key style used during export. One of "matchms", "massbank", "nist", "riken", or "gnps". Default is "matchms".
append – JSON export does not support appending. If True, a ValueError is raised.

to_mgf(file: str, export_style: str = 'matchms', append: bool = False) → None[source]

Export the spectra collection to an MGF file.

Parameters:

file – Path to the output file.
export_style – Metadata key style used during export. One of "matchms", "massbank", "nist", "riken", or "gnps". Default is "matchms".
append – If True, append to an existing file. Default is False.

to_msp(file: str, export_style: str = 'matchms', append: bool = False) → None[source]

Export the spectra collection to an MSP file.

Parameters:

file – Path to the output file.
export_style – Metadata key style used during export. One of "matchms", "massbank", "nist", "riken", or "gnps". Default is "matchms".
append – If True, append to an existing file. Default is False.