Pulling and caching market data¶

Grab bars from Yahoo Finance and save them to DuckDB. That's the headline use case — this page walks it end-to-end in plain English, then shows the same pattern for other providers and stores.

60-second tour — YF into DuckDB¶

from fundcloud.data import YF, DuckDB

YF(["SPY", "AAPL"]).sync_to(DuckDB("warehouse.duckdb"), key="us_eq")

Three things just happened:

Fundcloud asked Yahoo Finance for daily bars for SPY and AAPL.
It wrote them to warehouse.duckdb on disk, in a table named us_eq.
If us_eq already existed, overlapping days were deduplicated by timestamp — so you can run the same line every morning without producing duplicate rows.

Read it back whenever you need the data:

from fundcloud.data import DuckDB

duck = DuckDB("warehouse.duckdb")
bars = duck.read("us_eq", start="2024-01-01", columns=["close"])

start / end accept strings or pd.Timestamp. columns=["close"] trims the frame to just the close field across all symbols.

Running it every day¶

For a scheduled job that keeps the cache fresh, ask the store where it left off yesterday and pull forward from there:

import pandas as pd
from fundcloud.data import YF, DuckDB

duck = DuckDB("warehouse.duckdb")
last = duck.last_index("us_eq")                         # most recent bar we have
start = last + pd.Timedelta(days=1) if last else None   # None = full history on first run

YF(["SPY", "AAPL"]).sync_to(duck, key="us_eq", start=start)

sync_to defaults to mode="upsert", which dedupes by timestamp. If you pass a start that overlaps with what's already cached, nothing breaks — overlapping rows are replaced, not duplicated.

Swap DuckDB for Parquet¶

The second argument to sync_to is the storage backend. Parquet works the same way — files on disk instead of one DB file:

from fundcloud.data import YF, Parquet

YF(["SPY", "AAPL"]).sync_to(Parquet("data/"), key="us_eq")

Parquet("data/") keeps one .parquet per key under data/. DuckDB("warehouse.duckdb") keeps one table per key inside a single DuckDB file. Pick whichever matches your workflow — the rest of your code is unchanged.

Other data providers¶

Every provider uses the same pattern. Swap YF for whichever one you have keys for:

Provider	Class	Install	Auth
Yahoo Finance	`YF`	`uv add 'fundcloud[data-yf]'`	none
FinancialModelingPrep	`FMP`	`uv add 'fundcloud[data-fmp]'`	`FMP_API_KEY` env var
Alpha Vantage	`AV`	`uv add 'fundcloud[data-av]'`	`ALPHAVANTAGE_API_KEY` env var
Binance	`Binance`	`uv add 'fundcloud[data-bn]'`	none

Adjusted vs raw equity prices¶

The three equity providers (YF, FMP, AV) default to adjusted close prices — i.e. dividends and splits folded into a continuous total-return series. This is the right default for backtests; raw prices would give you fake "dividend day" drops.

Pass adjust=False if you genuinely need the as-traded price:

total_return = YF("SGOV").read()                       # adjusted (default)
as_traded    = YF("SGOV", adjust=False).read()         # raw, with dividend dips

total_return = FMP("AAPL").read()                      # uses FMP's `adjClose`
as_traded    = FMP("AAPL", adjust=False).read()        # FMP's raw `close`

total_return = AV("IBM").read()                        # `..._ADJUSTED` endpoint
as_traded    = AV("IBM", adjust=False).read()          # `TIME_SERIES_DAILY` (free-tier)

Binance doesn't take adjust — crypto doesn't have dividends or corporate actions to adjust for.

Default window — 1 year if you don't specify one¶

Network providers can return decades of history. To keep accidental calls cheap, a bare read() defaults to one year:

YF("SPY").read()                       # last ~365 days
YF("SPY").read(start="2010-01-01")     # explicit — full history since 2010
YF("SPY").read(end="2024-12-31")       # 2023-12-31 → 2024-12-31

The default applies to every network backend (YF, FMP, AV, Binance). Cache backends (Parquet, DuckDB, Memory, CSV) are not affected — a bare .read("us_eq") on them returns whatever is cached, which is the more useful default for local data.

from fundcloud.data import FMP, AV, Binance, DuckDB

duck = DuckDB("warehouse.duckdb")

FMP(["AAPL", "MSFT"]).sync_to(duck, key="us_tech")       # needs FMP_API_KEY
AV("IBM").sync_to(duck, key="ibm_av")                     # needs ALPHAVANTAGE_API_KEY
Binance(["BTCUSDT"], interval="1h").sync_to(duck, key="btc_hourly")

Other places to cache¶

Same sync_to signature, different destination:

Store	Class	Good for
DuckDB	`DuckDB("file.duckdb")`	single-file warehouse, SQL over cached data
Parquet	`Parquet("folder/")`	one file per key, cloud / DVC sync
In-memory	`Memory()`	tests, notebooks, short-lived scripts
CSV	`CSV("file.csv")`	read-only by default; use Parquet/DuckDB for writes

Reading what you've cached¶

All cache stores expose the same small read API:

duck = DuckDB("warehouse.duckdb")

duck.keys()                           # ['us_eq', 'us_tech', 'btc_hourly']
duck.exists("us_eq")                  # True
duck.last_index("us_eq")              # Timestamp('2026-04-18')

# Time-slice + column-prune read
window = duck.read("us_eq", start="2024-01-01", end="2024-03-31", columns=["close"])

# Copy one key to another store
duck.sync_to(Parquet("snapshots/"), key="us_eq")

# Remove a key
duck.delete("us_eq")

Column naming — same shape from every provider¶

Each provider returns OHLCV in the same canonical shape, so downstream code doesn't have to special-case yfinance vs FMP vs Alpha Vantage:

What you get	Example
Lowercase field names	`open`, `high`, `low`, `close`, `volume`
Snake-cased extras	`adj_close` (from yfinance's `Adj Close`, FMP's `adjClose`)
Canonical order	`open → high → low → close → volume → adj_close`
Two-level columns: `(field, symbol)`	`bars[("close", "SPY")]`

Apply the same normalisation to your own frames (e.g. a CSV with weird headers) with the helpers Fundcloud uses internally:

from fundcloud.data import normalize_ohlcv_columns, canonicalize_ohlcv_order

cleaned = canonicalize_ohlcv_order(normalize_ohlcv_columns(my_csv_frame))

Many datasets at once — Catalog¶

When you're pulling three, five, ten datasets regularly, wiring each by hand gets noisy. Catalog is the bookkeeping layer: register each dataset once with its source and settings, then refresh them all with one call.

from fundcloud.data import Catalog, DuckDB, YF, Binance, FMP

cat = Catalog(store=DuckDB("warehouse.duckdb"))
cat.register("us_eq",  YF(["SPY", "AAPL"]),              tags=("equity",))
cat.register("crypto", Binance(["BTCUSDT"], interval="1h"))
cat.register("macro",  FMP(["CPI"]))

cat.refresh_all()                       # pull new rows for every dataset
cat.load("us_eq", start="2024-01-01")   # read from DuckDB, no network call

cat.describe()                          # DataFrame: one row per dataset

Per-dataset refresh settings¶

Each dataset can carry defaults so you don't have to pass them on every call:

cat.register(
    "us_eq",
    YF(["SPY"]),
    refresh_kwargs={"start": "2010-01-01", "lookback": "5D"},
)

start — earliest date to pull on the first run. Ignored on subsequent refreshes (the cache watermark takes over).
end — latest date. Usually omitted to mean "today".
lookback — how far back to re-pull each refresh. Useful when upstream corrects recent bars (stock splits, exchange revisions) — set it to a few days so corrections flow through. Default is 0.

Save the catalog to a file¶

Catalog.to_spec() returns a plain dict — easy to round-trip through YAML for a config-driven pipeline:

spec = cat.to_spec()
cat2 = Catalog.from_spec(DuckDB("warehouse.duckdb"), spec)

Write modes — when to use which¶

Every write (whether you call .write() directly or via sync_to / Catalog.refresh) accepts a mode argument:

mode	What it does	When to use
`overwrite`	Replace everything under that key	Initial load, full re-sync
`upsert`	Append + dedupe by timestamp	Daily refresh (default for `sync_to`)
`append`	Concatenate raw, no dedupe	You've validated no overlap is possible
`error`	Fail if key already exists	One-shot bootstrap, refuse to clobber

Read-only safety¶

All network providers (YF, FMP, AV, Binance) are read-only — the .write() method raises ReadOnlyError on them. You can't accidentally try to push data back to Yahoo.

For your local stores, set read_only=True to get the same guarantee on a production read path:

from fundcloud.data import Parquet

prod = Parquet("data/", read_only=True)
prod.read("us_eq")                     # ✅ fine
prod.write("us_eq", fresh_df)          # ❌ raises ReadOnlyError

Reference — the `Backend` protocol¶

Everything above — YF, DuckDB, Parquet, Memory, CSV, FMP, AV, Binance — implements the same small protocol. You only need to read this section if you're writing your own backend or curious what sync_to is really doing.

class Backend(Protocol):
    name: ClassVar[str]                # short id ("yf", "duckdb", ...)
    read_only: bool

    # Available on every backend
    def read(self, key=None, *, start=None, end=None, columns=None) -> DataFrame: ...
    def keys(self) -> list[str]: ...
    def exists(self, key: str) -> bool: ...
    def last_index(self, key=None) -> Timestamp | None: ...

    # On read-write backends; raises ReadOnlyError when read_only=True
    def write(self, key, df, *, mode='overwrite') -> None: ...
    def delete(self, key) -> None: ...

    # One-liner source → sink transfer (read + write rolled together)
    def sync_to(self, sink, *, key=None, start=None, end=None, mode='upsert') -> DataFrame: ...

Full auto-generated class reference: Data API reference.