Water Quality
The wq submodule contains datasets that represent surface water chemistry at various locations worldwide. Currently, it includes 16 water quality datasets, but we anticipate this number will increase in the future. The spatial and temporal coverage of these datasets are detailed in following table.
List of datasets
Dataset |
Class / Function Name |
Variables Covered |
Temporal Coverage |
Spatial Coverage |
Reference |
|---|---|---|---|---|---|
Busan Beach |
14 |
2018 - 2019 |
Busan, S.Korea |
||
Buzzards Bay |
16 |
1992 - 2018 |
Buzzards Bay (USA) |
||
CamelsChem |
28 |
1980 - 2018 |
Continental USA |
||
CamelsCHChem |
40 |
1980 - 2020 |
Swtizerland |
||
Surface Water Chemistry |
24 |
1960 - 2022 |
Global |
||
Global River Water Quality Archive |
42 |
1898 - 2020 |
Global |
||
water QUAlity, DIscharge and Catchment Attributes |
10 |
1950 - 2018 |
Germany |
||
river chemistry for US coasts |
21 |
1850 - 2020 |
USA |
||
Ecoli Mekong River |
10 |
2011 - 2021 |
Mekong river (Houay Pano) |
||
Ecoli Mekong River (Laos) |
10 |
2011 - 2021 |
Mekong River (Laos) |
||
Ecoli Houay Pano (Laos) |
10 |
2011 - 2021 |
Houay Pano (Laos) |
||
Global River Methane |
1 |
1973 - 2021 |
Global |
||
Oligotrend |
17 |
1986 - 2022 |
Global |
||
Sylt Roads |
15 |
1973 - 2019 |
Red Sea (Arctic) |
||
San Francisco Bay |
18 |
1969 - 2015 |
San Francisco (USA) |
||
Selune River, France |
5 |
2021 - 2022 |
Selune River, (France) |
||
Siberian Rivers Chemistry |
30 |
1991–2012 |
Siberian Rivers, (Russia) |
||
White Clay Creek |
2 |
1973 - 2019 |
White Clay Creek (USA) |
Functions and Classes
- class aqua_fetch.BuzzardsBay(path=None, **kwargs)[source]
Bases:
DatasetsWater quality measurements in Buzzards Bay from 1992 - 2018. For more details on data see Jakuba et al., data is downloaded from MBLWHOI Library
Examples
>>> from aqua_fetch import BuzzardsBay >>> ds = BuzzardsBay() >>> doc = ds.doc() >>> doc.shape (11092, 4) >>> chla = ds.chla() >>> chla.shape (1028, 10)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- class aqua_fetch.CamelsChem(path=None, **kwargs)[source]
Bases:
DatasetsWater Quality data from USA following the works of Sterle et al., 2024 . This dataset has 18 water chemistry parameters from 1980-01-01 - 2018-12-31. The data is is downloaded from hydroshare Out of 671 stations, 155 stations have no water quality data. The wet deposition data consist of 12 parameters from 1985 - 2018.
Examples
>>> from aqua_fetch import CamelsChem >>> dataset = CamelsChem(path='/path/to/dataset') >>> stns = dataset.stations() >>> len(stns) 671 >>> stns[0:10] ['1591400', '6350000', ... '11274500', '7295000'] >>> len(dataset.parameters) 28 >>> dataset.parameters ['cl_mg/l', 'na_mg/l', ... 'doc_mg/l'] ... get longitude and latitude of stations >>> coords = dataset.stn_coords() >>> coords.shape (115, 2) ... >>> data = dataset.fetch_atm_dep() # get atmospheric deposition data for all catchments >>> type(data) # the returned data is a dictionary with catchments names as keys dict ... >>> len(data) 671 ... >>> data = dataset.fetch_atm_dep(stations='1591400', parameters='cl') >>> data['1591400'].shape (34, 8) ... >>> data = dataset.fetch_atm_dep(stations=['1591400', '6350000'], parameters=['cl', 'na']) >>> data['1591400'].shape (34, 16) >>> data['6350000'].shape (34, 16)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- fetch(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') Dict[str, DataFrame][source]
fetches the data for the given stations and parameters
- Parameters:
- Returns:
dictionary of dataframes for each station
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> ds = CamelsChem(path='/path/to/data') >>> data = ds.fetch(stations=['1591400', '6350000'], parameters=['cl_mg/l', 'na_mg/l']) >>> data = ds.fetch('1591400', 'cl_mg/l')['1591400'] >>> data.shape # (55, 1) ... get all parameters for a station >>> data = ds.fetch('1591400')['1591400'] >>> data.shape # (55, 28) >>> all_data = ds.fetch() # get all parameters of all stations >>> len(all_data) # 516
- fetch_atm_dep(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') Dict[str, DataFrame][source]
fetches the data for the given stations and parameters
- Parameters:
- Returns:
dictionary of dataframes for each station
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> ds = CamelsChem(path='/mnt/datawaha/hyex/atr/data') ... get data for a single station and a single parameter >>> data = ds.fetch_atm_dep(stations='1591400', parameters='cl') >>> print(data['1591400'].shape) # (34, 8) ... get data for multiple stations and multiple parameters >>> data = ds.fetch_atm_dep(stations=['1591400', '6350000'], parameters=['cl', 'na']) >>> print(data['1591400'].shape) # (34, 16) >>> print(data['6350000'].shape) # (34, 16) .. get data for all stations and for all parameters >>> data = ds.fetch_atm_dep() >>> print(len(data)) # 671
- class aqua_fetch.CamelsCHChem(path=None, **kwargs)[source]
Bases:
DatasetsData of over 40 water quality parameters from 115 Swiss catchments following the work of Nascimento et al., 2025 The dataset is downloaded from zenodo . The water quality parameters are available as (discontinuous) timeseries from 1980-01-01 - 2020-12-31.
Examples
>>> from aqua_fetch import CamelsCHChem >>> dataset = CamelsCHChem(path='/path/to/data') >>> stns = dataset.stations() >>> len(stns) 115 ... find out names of stations >>> stns[0:10] ['2009', '2011', '2016', '2018', ... '2044'] ... get longitude and latitude of stations >>> coords = dataset.stn_coords() >>> coords.shape (115, 2) ... get catchment-averaged parameters for catchment with the name/id 2009 >>> data = dataset.fetch_catch_avg('2009') >>> type(data) # the return data is a dictionary with catchment name as key dict >>> len(data) 1 >>> data.keys() '2009' >>> data['2009'].shape (209, 32) ... get data for three catchments >>> data = dataset.fetch_catch_avg(['2009', '2011', '2018']) >>> data.keys() dict_keys(['2009', '2011', '2018']) >>> [val.shape for val in data.values()] [(209, 32), (209, 32), (209, 32)] >>> data['2009'].columns.tolist() ['cereal', 'maize', 'sugarbeet', ... 'gve_ha', 'delta_2h'] ... find out start and end dates >>> data['2009'].index[0], data['2009'].index[-1] (Timestamp('1970-01-01'), Timestamp('2020-12-15')) ... ... get water quality time series >>> data = dataset.fetch_wq_ts(stations=['2009', '2011']) >>> data['2009'].shape (14610, 4) >>> data['2011'].shape (14610, 4) >>> data['2011'].columns Index(['temp_sensor', 'pH_sensor', 'ec_sensor', 'O2C_sensor'], dtype='object') >>> data = dataset.fetch_wq_ts() >>> len(data) 115 >>> data['2009'].index[0], data['2009'].index[-1] (Timestamp('1981-01-01 00:00:00'), Timestamp('2020-12-31 00:00:00')) ... # get isotope data >>> data = dataset.fetch_isotope(stations=['2009', '2016']) >>> data['2009'].shape (452, 4) >>> data['2016'].shape (450, 4) >>> data['2009'].columns Index(['date_start', 'date_end', 'delta_2h', 'delta_18o'], dtype='object')
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- fetch(stations: str | List[str] = 'all', parameters: str | List[str] = 'all') Dict[str, DataFrame][source]
fetches the data for the given stations and parameters
- Parameters:
- Returns:
dictionary of dataframes for each station
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> ds = CamelsCHChem(path='/path/to/data') >>> data = ds.fetch(stations=['2009', '2011'], parameters='swisscrops') >>> print(data['2009'].shape) # (209, 32) >>> print(data['2011'].shape) # (209, 32)
- fetch_catch_avg(stations: str | List[str] = 'all') Dict[str, DataFrame][source]
fetches the catchment average data for the given stations. This covers agricultural, atmospheric deposition, landcover, livestock and rainwater isotopes data for each catchment. The agricultural and atmospheric deposition (1990-2020), landcover and livestock data is yearly but rain water isotope data has discontinuous timesteps.
- Parameters:
stations (Union[str, List[str]]) – list of stations to fetch data for
- Returns:
dictionary of dataframes for each station
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> ds = CamelsCHChem(path='/path/to/data') >>> data = ds.fetch_catch_avg(stations=['2009', '2011']) >>> print(data['2009'].shape) # (209, 32) >>> print(data['2011'].shape) # (209, 32)
- fetch_wq_ts(stations: str | List[str] = 'all', timestep: str = 'D') Dict[str, DataFrame][source]
fetches the water quality time series data for the given station(s) at daily (D) or hourly (H) timestep. This data consists of water temperature, pH, electrical conductivity and O2C parameters for the given station(s).
- Parameters:
- Returns:
dictionary of dataframes for each station
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> ds = CamelsCHChem(path='/path/to/data') >>> data = ds.fetch_wq_ts('2009')['2009'] >>> print(data.shape) # (14610, 4)
- class aqua_fetch.GRiMeDB(path=None, **kwargs)[source]
Bases:
DatasetsGlobal river database of methan concentrations and fluxes from 5029 stations of 305 rivers following Stanley et al., 2023
Examples
>>> from aqua_fetch import GRiMeDB >>> ds = GRiMeDB(path='/path/to/dataset') >>> ds.stations() >>> ds.streams >>> ds.stn_coords() >>> ds.shape 5029, 2 >>> conc = ds.concentrations(streams=['Indus River']) >>> conc.shape (2, 59) >>> conc = ds.concentrations(parameters=['Q', 'NO3', 'NH4', 'TN', 'SRP', 'TP', 'DOC']) >>> conc.shape (25052, 7) >>> fluxes = ds.fluxes() >>> fluxes.shape (7298, 52) >>> fluxes['Site_ID'].nunique() 1903 >>> sites = ds.sites() >>> sites['Site_ID'].nunique() 5029 >>> sites['Stream_Name'].nunique() 2722
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- concentrations(stations: str | List[str] = 'all', streams: str | List[str] = 'all', parameters: str | List[str] = 'all')[source]
Get concentrations data.
- Parameters:
stations (Union[str, List[str]], optional) – station ID or list of station IDs, by default “all”. If given, then
streamsmust not be given. Check .stations() method for available stations.streams (Union[str, List[str]], optional) – stream name or list of stream names, by default “all”. If given, then
stationsmust not be given. Check .streams attribute for available streams.parameters (Union[str, List[str]], optional) – parameters to return, by default “all”. Check .parameters attribute for available parameters.
- fluxes(stations: str | List[str] = 'all') DataFrame[source]
returns fluxes data as a
pandas.DataFrame
- class aqua_fetch.GRQA(download_source: bool = False, path=None, **kwargs)[source]
Bases:
DatasetsGlobal River Water Quality Archive following the work of Virro et al., 2021 . This dataset comprises of 42 parameters for 94955 sites across 116 countries.
Examples
>>> from aqua_fetch import GRQA >>> ds = GRQA(path="/mnt/datawaha/hyex/atr/data") >>> ds.parameters ['TPP', 'PON', 'TEMP', 'TSS', ...] >>> print(len(ds.parameters)) 42 >>> len(ds.countries) 116 >>> len(ds.stations()) 94955 >>> len(ds.parameters) >>> coords = ds.stn_coords() >>> coords.shape (94955, 2) >>> country = "Pakistan" >>> len(ds.fetch_parameter('TEMP', country=country)) 1324 >>> df = ds.fetch_parameter("TEMP", country=country) >>> print(df.shape) (1324, 38) >>> df = ds.fetch_parameter("NH4N", country=country) >>> print(df.shape) (28, 36)
- __init__(download_source: bool = False, path=None, **kwargs)[source]
- Parameters:
download_source (bool) – whether to download source data or not
- fetch_parameter(parameter: str = 'COD', site_name: List[str] | str = None, country: List[str] | str = None, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None) DataFrame[source]
- Parameters:
- Returns:
- Return type:
pd.DataFrame
Example
>>> from aqua_fetch import GRQA >>> dataset = GRQA() >>> df = dataset.fetch_parameter() fetch data for only one country >>> cod_pak = dataset.fetch_parameter("COD", country="Pakistan") fetch data for only one site >>> cod_kotri = dataset.fetch_parameter("COD", site_name="Indus River - at Kotri") we can find out the number of data points and sites available for a specific country as below >>> for para in dataset.parameters: >>> data = dataset.fetch_parameter(para, country="Germany") >>> if len(data)>0: >>> print(f"{para}, {df.shape}, {len(df['site_name'].unique())}")
- class aqua_fetch.Oligotrend(path=None, **kwargs)[source]
Bases:
DatasetsA global database of multi-decadal (1986-2023) timeseries of chlorophyll-a and 16 others including N and P, from 1846 unique monitoring locations across estuaries (n=238), lakes (n=687), and rivers (969). The datasets consists of 4.3 million observations and most timeseries cover the period 1986-2022 and comprise at least 15 years of Chl-a observations. For more details, see Minaudo et al., 2025 <https://doi.org/10.5194/essd-17-3411-2025>_. The data is fetched from EDI data portal.
Examples
>>> from aqua_fetch import Oligotrend >>> ds = Oligotrend(path='/path/to/data') get names of parameters in the dataset >>> ds.parameters() >>> len(ds.parameters()) 17 get list of stations in the dataset >>> ds.stations() >>> len(ds.stations()) 1846 >>> len(ds.lakes()) 685 >>> len(ds.rivers()) 924 >>> len(ds.estuaries()) 237 get parameters of a single station >>> data = ds.fetch_stn_parameters('lake_atlanticoceanseaboard_usa12721') >>> data.shape (303, 3) get all parameters for specific stations >>> data = ds.fetch_stns_parameters(['river_ebro_9027', 'river_elbe_elbe_10']) >>> data['river_ebro_9027'].shape (287, 8) >>> data['river_elbe_elbe_10'].shape (8154, 12) Get only 'chla' parameter for the stations >>> data1 = ds.fetch_stns_parameters(['river_ebro_9027', 'river_elbe_elbe_10'], ... parameters=['chla']) >>> data1['river_ebro_9027'].shape (177, 1) >>> data1['river_elbe_elbe_10'].shape (413, 1)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- fetch_stn_parameters(stn: str, parameters: str | List[str] = 'all')[source]
Examples
>>> stn_df = ds.fetch_stn_parameters('lake_atlanticoceanseaboard_usa12721') >>> stn_df.shape (303, 3)
- fetch_stns_parameters(stns: str | List[str], parameters: str | List[str] = 'all') Dict[str, DataFrame][source]
Fetches the parameters for the given stations.
- Parameters:
- Returns:
A dictionary with the station id as key and a dataframe of parameters as value.
- Return type:
Dict[str, pd.DataFrame]
Examples
>>> data = ds.fetch_stns_parameters(['river_ebro_9027', 'river_elbe_elbe_10']) >>> data['river_ebro_9027'].shape (287, 8) >>> data['river_elbe_elbe_10'].shape (8154, 12) >>> data = ds.fetch_stns_parameters(['river_ebro_9027', 'river_elbe_elbe_10'], 'chla') >>> data['river_ebro_9027'].shape (177, 1) >>> data['river_elbe_elbe_10'].shape (413, 1)
- get_stations(parameter: str, ecosystm: str = 'river') Series[source]
Returns a list of stations that have the specified parameter.
Examples
>>>> chla_stns = ds.get_stations(‘chla’) >>>> len(chla_stns) 969
- class aqua_fetch.Quadica(path=None, **kwargs)[source]
Bases:
DatasetsThis is dataset of 10 water quality parameters of Germany from 1386 stations from 1950 to 2018 at monthly timestep following the work of Ebeling et al., 2022 . The time-step is monthly and annual but the monthly timeseries data is not continuous. Following are the parameters available in this dataset:
Q : Discharge
NO3 : Nitrate
NO3N : Nitrate-N
NMin : Nitrogen mineralization
TN : Total Nitrogen
PO4 : Phosphate
PO4P : Phosphate-P
TP : Total Phosphorus
DOC : Dissolved Organic Carbon
TOC : Total Organic Carbon
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> len(ds.stations()) 1386 >>> coords = ds.stn_coords() >>> coords.shape (1386, 2) >>> df = dataset.wrtds_monthly() >>> df.shape (50186, 47) >>> df = dataset.wrtds_annual() >>> df.shape (4213, 46) >>> df = dataset.pet() >>> df.shape (828, 1386) >>> df = dataset.avg_temp() >>> df.shape (828, 1388) >>> df = dataset.precipitation() >>> df.shape (828, 1388) >>> df = dataset.catchment_attributes() >>> df.shape (1386, 112) >>> df = dataset.metadata() >>> df.shape (1386, 60) >>> df = dataset.monthly_medians() >>> df.shape (16629, 18) >>> df = dataset.annual_medians() >>> df.shape (24393, 18) >>> df = dataset.fetch_monthly() >>> df[0].shape (50186, 47)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- annual_medians() DataFrame[source]
Annual medians over the whole time series of water quality variables and discharge
- Returns:
a dataframe of shape (24393, 18)
- Return type:
pd.DataFrame
- avg_temp(stations: List[int] | int = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
monthly median average temperatures starting from 1950-01 to 2018-09
- Parameters:
stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09
- Returns:
a
pandas.DataFrameof shape (time_steps, stations). With default input arguments, the shape is (828, 1386)- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> df = dataset.avg_temp() # -> (828, 1388)
- catchment_attributes(parameters: List[str] | str = None, stations: List[int] | int = None) DataFrame[source]
Returns static physical catchment attributes in the form of dataframe.
- Parameters:
parameters (list/str, optional, (default=None)) – name/names of static attributes to fetch
stations (list/int, optional (default=None)) – name/names of stations whose static/physical parameters are to be read
- Returns:
a
pandas.DataFrameof shape (stations, parameters). With default input arguments, shape is (1386, 113)- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> cat_features = dataset.catchment_attributes() ... # get attributes of only selected stations >>> dataset.catchment_attributes(stations=[1,2,3])
- fetch_monthly(parameters: List[str] | str = None, stations: List[int] | int = 'all', median: bool = True, fnc: bool = True, fluxes: bool = True, precipitation: bool = True, avg_temp: bool = True, pet: bool = True, only_continuous: bool = True, cat_features: bool = True, max_nan_tol: int | None = 0) Tuple[DataFrame, DataFrame][source]
Fetches monthly concentrations of water quality parameters.
- Parameters:
parameters (str/list, optional (default=None)) –
name or names of water quality parameters to fetch. By default following parameters are considered
NO3NO3NTNNminPO4PO4PTPDOCTOC
stations (int/list, optional (default=None)) – name or names of stations whose data is to be fetched
median (bool, optional (default=True)) – whether to fetch median concentration values or not
fnc (bool, optional (default=True)) – whether to fetch flow normalized concentrations or not
fluxes (bool, optional (default=True)) – Setting this to true will add two parameters i.e. mean_Flux_FEATURE and mean_FNFlux_FEATURE
precipitation (bool, optional (default=True)) – whether to fetch average monthly precipitation or not
avg_temp (bool, optional (default=True)) – whether to fetch average monthly temperature or not
pet (bool, optional (default=True)) – whether to fether potential evapotranspiration data or not
only_continuous (bool, optional (default=True)) – If true, will return data for only those stations who have continuos monthly timeseries data from 1993-01-01 to 2013-01-01.
cat_features (bool, optional (default=True)) – whether to fetch catchment parameters or not.
max_nan_tol (int, optional (default=0)) – setting this value to 0 will remove the whole time-series with any missing values. If None, no time-series with NaNs values will be removed.
- Returns:
- two dataframes whose length is same but the columns are different
a
pandas.DataFrameof timeseries of parameters (stations*timesteps, dynamic_features)a
pandas.DataFrameof static parameters (stations*timesteps, catchment_features)
- Return type:
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> mon_dyn, mon_cat = dataset.fetch_monthly(max_nan_tol=None) ... # However, mon_dyn contains data for all parameters and many of which have ... # large number of nans. If we want to fetch data only related to TN without any ... # missing value, we can do as below >>> mon_dyn_tn, mon_cat_tn = dataset.fetch_monthly(parameters="TN", max_nan_tol=0) ... # if we want to find out how many catchments are included in mon_dyn_tn >>> len(mon_dyn_tn['OBJECTID'].unique()) ... # 25
- metadata() DataFrame[source]
fetches the metadata about the stations as pandas’ dataframe. Each row represents metadata about one station and each column represents one feature. The R2 and pbias are regression coefficients and percent bias of WRTDS models for each parameter.
- Returns:
a dataframe of shape (1386, 60)
- Return type:
pd.DataFrame
- monthly_medians(parameters: List[str] | str = None, stations: List[int] | int = None) DataFrame[source]
This function reads the c_months.csv file which contains the monthly medians over the whole time series of water quality variables and discharge
- Parameters:
parameters (list/str, optional, (default=None)) – name/names of parameters
stations (list/int, optional (default=None)) – stations for which
- Returns:
a dataframe of shape (16629, 18). 15 of the 18 columns represent a water chemistry parameter. 16629 comes from 1386*12 where 1386 is stations and 12 is months.
- Return type:
pd.DataFrame
- pet(stations: List[str] | str = 'all', st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
average monthly potential evapotranspiration starting from 1950-01 to 2018-09
- Returns:
a dataframe of shape (828, 1386), where 828 is the number of months from 1950-01 to 2018-09 and 1386 is the number of stations
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> df = dataset.pet() # -> (828, 1386)
- precipitation(stations: List[int] | int = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
sums of precipitation starting from 1950-01 to 2018-09
- Parameters:
stations – name of stations for which data is to be retrieved. By default, data for all stations is retrieved.
st (optional) – starting point of data. By default, the data starts from 1950-01
en (optional) – end point of data. By default, the data ends at 2018-09
- Returns:
a dataframe of shape (828, 1388)
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> df = dataset.precipitation() # -> (828, 1388)
- stn_coords() DataFrame[source]
Returns the coordinates of all the stations in the dataset in wgs84 projection.
- Returns:
A dataframe with columns ‘lat’, ‘long’
- Return type:
pd.DataFrame
- to_DataSet(target: str = 'TP', input_features: list = None, split: str = 'temporal', lookback: int = 24, **ds_args)[source]
This function prepares data for machine learning prediction problem. It returns an instance of ai4water.preprocessing.DataSetPipeline which can be given to model.fit or model.predict
- Parameters:
target (str, optional (default="TN")) – parameter to consider as target
input_features (list, optional) – names of input parameters
split (str, optional (default="temporal")) – if
temporal, validation and test sets are taken from the data of each station and then concatenated. Ifspatial, training validation and test is decided based upon stations.lookback (int)
**ds_args – key word arguments
- Returns:
an instance of DataSetPipeline
- Return type:
ai4water.preprocessing.DataSet
Example
>>> from aqua_fetch import Quadica ... # initialize the Quadica class >>> dataset = Quadica() ... # define the input parameters >>> inputs = ['median_Q', 'OBJECTID', 'avg_temp', 'precip', 'pet'] ... # prepare data for TN as target >>> dsp = dataset.to_DataSet("TN", inputs, lookback=24)
- wrtds_annual(parameters: str | list = None, st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
Annual median concentrations, flow-normalized concentrations, and mean fluxes estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability.
- Parameters:
parameters (optional)
st (optional) – starting point of data. By default, the data starts from 1992
en (optional) – end point of data. By default, the data ends at 2013
- Returns:
a dataframe of shape (4213, 46)
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> df = dataset.wrtds_annual()
- wrtds_monthly(parameters: str | list = None, stations: List[str] | str = 'all', st: str | int | DatetimeIndex = None, en: str | int | DatetimeIndex = None) DataFrame[source]
Monthly median concentrations, flow-normalized concentrations and mean fluxes of water chemistry parameters. These are estimated using Weighted Regressions on Time, Discharge, and Season (WRTDS) for stations with enough data availability. This data is available for total 140 stations. The data from all stations does not start and end at the same period. Therefore, some stations have more datapoints while others have less. The maximum datapoints for a station are 576 while smallest datapoints are 244.
- Parameters:
parameters (str/list, optional)
stations (int/list optional (default=None)) – name/names of satations whose data is to be retrieved.
st (optional) – starting point of data. By default, the data starts from 1992-09
en (optional) – end point of data. By default, the data ends at 2013-12
- Returns:
a dataframe of shape (50186, 47)
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import Quadica >>> dataset = Quadica() >>> df = dataset.wrtds_monthly()
- class aqua_fetch.RC4USCoast(path=None, *args, **kwargs)[source]
Bases:
DatasetsMonthly river water chemistry (N, P, SIO2, DO, … etc), discharge and temperature of 140 monitoring sites of US coasts from 1950 to 2020 following the work of Gomez et al., 2022.
Examples
>>> from aqua_fetch import RC4USCoast >>> dataset = RC4USCoast() >>> len(dataset.stations) 140 >>> len(dataset.parameters) 27 >>> stn_coords = dataset.stn_coords() >>> stn_coords.shape (140, 2)
- __init__(path=None, *args, **kwargs)[source]
- Parameters:
path – path where the data is already downloaded. If None, the data will be downloaded into the disk.
- fetch_chem(parameter, stations: List[int] | int | str = 'all', as_dataframe: bool = False, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None)[source]
Returns water chemistry parameters from one or more stations.
- Parameters:
stations (list, str) – name/names of stations from which the parameters are to be fetched
as_dataframe (bool (default=False)) – whether to return data as pandas.DataFrame or xarray.Dataset
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201
- Return type:
pd.DataFrame or xarray Dataset
Examples
>>> from aqua_fetch import RC4USCoast >>> ds = RC4USCoast() >>> data = ds.fetch_chem(['temp', 'do']) >>> data >>> data = ds.fetch_chem(['temp', 'do'], as_dataframe=True) >>> data.shape # this is a multi-indexed dataframe (119280, 4) >>> data = ds.fetch_chem(['temp', 'do'], st="19800101", en="20181230")
- fetch_q(stations: int | List[int] | str | ndarray = 'all', as_dataframe: bool = True, nv=0, st: int | str | DatetimeIndex = None, en: int | str | DatetimeIndex = None)[source]
returns discharge data
- Parameters:
stations – stations for which q is to be fetched
as_dataframe (bool (default=True)) – whether to return the data as pd.DataFrame or as xarray.Dataset
nv (int (default=0))
st – start time of data to be fetched. The default starting date is 19500101
en – end time of data to be fetched. The default end date is 20201201
Examples
>>> from aqua_fetch import RC4USCoast >>> ds = RC4USCoast() # get data of all stations as DataFrame >>> q = ds.fetch_q("all") >>> q.shape (852, 140) # where 140 is the number of stations # get data of only two stations >>> q = ds.fetch_q([1,10]) >>> q.shape (852, 2) # get data as xarray Dataset >>> q = ds.fetch_q("all", as_dataframe=False) >>> type(q) xarray.core.dataset.Dataset # getting data between specific periods >>> data = ds.fetch_q("all", st="20000101", en="20181230")
- property parameters: List[str]
returns names of parameters
Examples
>>> from aqua_fetch import RC4USCoast >>> ds = RC4USCoast() >>> len(ds.parameters) 27
- class aqua_fetch.RiverChemSiberia(path=None, **kwargs)[source]
Bases:
DatasetsA database of water chemistry in eastern Siberian rivers following Liu et al., 2022 . The dataset consists of meteorological data, water chemistry data, and shapefiles of 7 basins in eastern Siberia. The data is collected from 1991 to 2012. The dataset is available at figshare . Following parameters are available in the dataset:
LaLoCa2+Mg2+K+Na+Cl-SO42-HCO3-TDSpHRiverBasinSubbasinTannualTmonthlyPannualPmonthlyLithologyPermafrost typeIBDischargeOri_IDLiSrAsBaSi87Sr/86Sr¦Ä18O-H2O¦Ä2H-H2O
Examples
>>> from aqua_fetch import RiverChemSiberia >>> ds = RiverChemSiberia() >>> ds.stations() ['Selenga-Baikal', 'Angara', 'Lena', 'Eastern-Siberia', 'Kolyma', 'Yana', 'Indigirka'] >>> len(ds.parameters) 34
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- boundary() DataFrame[source]
Returns the boundary data of the water chemistry in eastern Siberian rivers.
- class aqua_fetch.SyltRoads(path=None, **kwargs)[source]
Bases:
DatasetsDataset of physico-hydro-chemical time series data at Sylt Roads from 1973 - 2019 following Rick et al., 2023 . Following parameters are available
locationDepth water [m]SalTemp [°C][PO4]3- [µmol/l][NH4]+ [µmol/l][NO2]- [µmol/l][NO3]- [µmol/l]Si(OH)4 [µmol/l]SPM [mg/l]pHO2 [µmol/l]Chl a [µg/l]DON [µmol/l]DOP [µmol/l]DIN [µmol/l]
Examples
>>> from aqua_fetch import SyltRoads >>> ds = SyltRoads()
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- fetch(parameters: str | List[str] = 'all') DataFrame[source]
Fetch the data from the dataset
- Parameters:
parameters (str or List[str], optional) – Parameters to fetch. Default is None which will fetch all parameters
- Returns:
DataFrame containing the data
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import SyltRoads >>> ds = SyltRoads() >>> df = ds.fetch() >>> df.shape (5710, 16) >>> len(ds.parameters) 16 >>> ds.fetch(['Sal', 'Temp [°C]', 'pH']).shape (5710, 3)
- class aqua_fetch.SanFranciscoBay(path=None, **kwargs)[source]
Bases:
DatasetsTime series of water quality parameters from 59 stations in San-Francisco from 1969 - 2015. For details on data see Cloern et al.., 2017 and Schraga et al., 2017. Following parameters are available:
DepthDiscrete_ChlorophyllRatio_DiscreteChlorophyll_PheopigmentCalculated_ChlorophyllDiscrete_OxygenCalculated_OxygenOxygen_Percent_SaturationDiscrete_SPMCalculated_SPMExtinction_CoefficientSalinityTemperatureSigma_tNitriteNitrate_NitriteAmmoniumPhosphateSilicate
Examples
>>> from aqua_fetch import SanFranciscoBay >>> ds = SanFranciscoBay() >>> data = ds.data() >>> data.shape (212472, 19) >>> stations = ds.stations() >>> len(stations) 59 >>> parameters = ds.parameters() >>> len(parameters) 18 ... # fetch data for station 18 >>> stn18 = ds.fetch(stations='18') >>> stn18.shape (13944, 18)
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- class aqua_fetch.SeluneRiver(path=None, **kwargs)[source]
Bases:
DatasetsDataset of physico-chemical variables measured at different levels, for a 2021 and 2022 for characterization of Hyporheic zone of Selune River, Manche, Normandie, France following Moustapha Ba et al., 2023 . The data is available at data.gouv.fr . The following variables are available:
water level
temperature
conductivity
oxygen
pressure
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- class aqua_fetch.SWatCh(remove_csv_after_download: bool = False, path: str | PathLike = None, **kwargs)[source]
Bases:
DatasetsThe Surface Water Chemistry (SWatCh) database of 27 variables from 26322 locations as introduced in Lobke et al., 2022 . It should be noted not all the variables are available for all the locations. Following are the variables available in the dataset:
Total Phosphorus, mixed forms
Sulfate
pH
Temperature, water
Chloride
Magnesium
Calcium
Sodium
Potassium
Aluminum
Nitrate
Nitrite
Fluoride
Hardness, carbonate
Iron
Ammonium
Organic carbon
Bicarbonate
Orthophosphate
Gran acid neutralizing capacity
Alkalinity, total
Inorganic carbon
Carbonate
Alkalinity, carbonate
Hardness, non-carbonate
Carbon Dioxide, free CO2
Alkalinity, Phenolphthalein (total hydroxide+1/2 carbonate)
Examples
Examples
>>> from aqua_fetch import SWatCh >>> ds = SWatCh() >>> df = ds.fetch() >>> df.shape (3901296, 6) >>> len(ds.parameters) 22 >>> len(ds.sites) 26322 >>> coords = ds.stn_coords() >>> coords.shape (26322, 2)
- __init__(remove_csv_after_download: bool = False, path: str | PathLike = None, **kwargs)[source]
- Parameters:
remove_csv_after_download (bool (default=False)) – if True, the csv will be removed after downloading and processing.
- fetch(parameters: list | str = None, station_id: list | str = None, station_names: list | str = None) DataFrame[source]
- Parameters:
parameters (str/list (default=None)) – Names of parameters to fetch. By default,
name,value,val_unit,location,lat, andlongare read.station_id (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then
station_namesshould not be given.station_names (str/list (default=None)) – name/names of station id for which the data is to be fetched. By default, the data for all stations is fetched. If given, then
station_idshould not be given.
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import SWatCh >>> ds = SWatCh() >>> df = ds.fetch() >>> df.shape (3901296, 6) >>> st_name = "Jordan Lake" >>> df = df[df['location'] == st_name] >>> df.shape (4, 6)
- property names: dict
tells the names of parameters in this class and their original names in SWatCh dataset in the form of a python dictionary
- class aqua_fetch.WhiteClayCreek(path=None, **kwargs)[source]
Bases:
DatasetsTime series of water quality parameters from White Clay Creek.
chl-a : 2001 - 2012
Dissolved Organic Carbon : 1977 - 2017
- __init__(path=None, **kwargs)[source]
- Parameters:
name – str (default=None) name of dataset
units – str, (default=None) the unit system being used
path – str (default=None) path where the data is available (manually downloaded). If None, it will be downloaded
processes – int number of processes to use for parallel processing
verbosity – int determines the amount of information to be printed
remove_zip – bool whether to remove the zip files after unz
- aqua_fetch.busan_beach(inputs: list = None, target: list | str = 'tetx_coppml') DataFrame[source]
Loads the Antibiotic resitance genes (ARG) data from a recreational beach in Busan, South Korea along with environment variables.
The data is in the form of mutlivariate time series and was collected over the period of 2 years during several precipitation events. The frequency of environmental data is 30 mins while that of ARG is discontinuous. The data and its pre-processing is described in detail in Jang et al., 2021
- Parameters:
inputs –
features to use as input. By default all environmental data is used which consists of following parameters
tide_cm
wat_temp_c
sal_psu
air_temp_c
pcp_mm
pcp3_mm
pcp6_mm
pcp12_mm
wind_dir_deg
wind_speed_mps
air_p_hpa
mslp_hpa
rel_hum
target –
feature/features to use as target/output. By default tetx_coppml is used as target. Logically one or more from following can be considered as target
ecoli
16s
inti1
Total_args
tetx_coppml
sul1_coppml
blaTEM_coppml
aac_coppml
Total_otus
otu_5575
otu_273
otu_94
- Returns:
a
pandas.DataFramewith inputs and target and indexed with pandas.DateTimeIndex- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import busan_beach >>> dataframe = busan_beach() >>> dataframe.shape (1446, 14) >>> dataframe = busan_beach(target=['tetx_coppml', 'sul1_coppml']) >>> dataframe.shape (1446, 15)
See usage here for more details.
- aqua_fetch.ecoli_mekong(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, overwrite=False) DataFrame[source]
E. coli data from Mekong river (Houay Pano) area from 2011 to 2021 Boithias et al., 2022 .
- Parameters:
st (optional) – starting time. The default starting point is 2011-05-25 10:00:00
en (optional) – end time, The default end point is 2021-05-25 15:41:00
parameters (str, optional) –
names of features to use. use
allto get all features. By default following input features are selectedstation_namename of station/catchment where the observation was madeTtemperatureECelectrical conductanceDOpercentdissolved oxygen concentrationDOdissolved oxygen saturationpHpHORPoxidation-reduction potentialTurbidityturbidityTSStotal suspended sediment concentrationE-coli_4dilutionsEschrechia coli concentration
overwrite (bool) – whether to overwrite the downloaded file or not
- Returns:
with default parameters, the shape is (1602, 10)
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import ecoli_mekong >>> ecoli_data = ecoli_mekong() >>> ecoli_data.shape (1602, 10)
- aqua_fetch.ecoli_mekong_laos(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, station_name: str = None, overwrite=False) DataFrame[source]
coli data from Mekong river (Northern Laos).
- Parameters:
- Returns:
with default parameters, the shape is (1131, 10)
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import ecoli_mekong_laos >>> ecoli = ecoli_mekong_laos() >>> ecoli.shape (1131, 10)
- aqua_fetch.ecoli_houay_pano(st: str | Timestamp | int = '20110101', en: str | Timestamp | int = '20211231', parameters: str | list = None, overwrite=False) DataFrame[source]
coli data from Mekong river (Houay Pano) area.
- Parameters:
st (optional) – starting time. The default starting point is 2011-05-25 10:00:00
en (optional) – end time, The default end point is 2021-05-25 15:41:00
parameters (str, optional) –
names of features to use. use
allto get all features. By default following input features are selectedstation_namename of station/catchment where the observation was madeTtemperatureECelectrical conductanceDOpercentdissolved oxygen concentrationDOdissolved oxygen saturationpHpHORPoxidation-reduction potentialTurbidityturbidityTSStotal suspended sediment concentrationE-coli_4dilutionsEschrechia coli concentrationoverwrite (bool) – whether to overwrite the downloaded file or not
- Returns:
with default parameters, the shape is (413, 10)
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import ecoli_houay_pano >>> ecoli = ecoli_houay_pano() >>> ecoli.shape (413, 10)
- aqua_fetch.ecoli_mekong_2016(st: str | Timestamp | int = '20160101', en: str | Timestamp | int = '20161231', parameters: str | list = None, overwrite=False) DataFrame[source]
coli data from Mekong river from 2016 from 29 catchments
- Parameters:
- Returns:
with default parameters, the shape is (58, 10)
- Return type:
pd.DataFrame
Examples
>>> from aqua_fetch import ecoli_mekong_2016 >>> ecoli = ecoli_mekong_2016() >>> ecoli.shape (58, 10)