Pulling and caching market data¶
Grab bars from Yahoo Finance and save them to DuckDB. That's the headline use case — this page walks it end-to-end in plain English, then shows the same pattern for other providers and stores.
60-second tour — YF into DuckDB¶
from fundcloud.data import YF, DuckDB
YF(["SPY", "AAPL"]).sync_to(DuckDB("warehouse.duckdb"), key="us_eq")
Three things just happened:
- Fundcloud asked Yahoo Finance for daily bars for SPY and AAPL.
- It wrote them to
warehouse.duckdbon disk, in a table namedus_eq. - If
us_eqalready existed, overlapping days were deduplicated by timestamp — so you can run the same line every morning without producing duplicate rows.
Read it back whenever you need the data:
from fundcloud.data import DuckDB
duck = DuckDB("warehouse.duckdb")
bars = duck.read("us_eq", start="2024-01-01", columns=["close"])
start / end accept strings or pd.Timestamp. columns=["close"]
trims the frame to just the close field across all symbols.
Running it every day¶
For a scheduled job that keeps the cache fresh, ask the store where it left off yesterday and pull forward from there:
import pandas as pd
from fundcloud.data import YF, DuckDB
duck = DuckDB("warehouse.duckdb")
last = duck.last_index("us_eq") # most recent bar we have
start = last + pd.Timedelta(days=1) if last else None # None = full history on first run
YF(["SPY", "AAPL"]).sync_to(duck, key="us_eq", start=start)
sync_to defaults to mode="upsert", which dedupes by timestamp. If
you pass a start that overlaps with what's already cached, nothing
breaks — overlapping rows are replaced, not duplicated.
Swap DuckDB for Parquet¶
The second argument to sync_to is the storage backend. Parquet works
the same way — files on disk instead of one DB file:
Parquet("data/") keeps one .parquet per key under data/.
DuckDB("warehouse.duckdb") keeps one table per key inside a single
DuckDB file. Pick whichever matches your workflow — the rest of your
code is unchanged.
Other data providers¶
Every provider uses the same pattern. Swap YF for whichever one you
have keys for:
| Provider | Class | Install | Auth |
|---|---|---|---|
| Yahoo Finance | YF |
uv add 'fundcloud[data-yf]' |
none |
| FinancialModelingPrep | FMP |
uv add 'fundcloud[data-fmp]' |
FMP_API_KEY env var |
| Alpha Vantage | AV |
uv add 'fundcloud[data-av]' |
ALPHAVANTAGE_API_KEY env var |
| Binance | Binance |
uv add 'fundcloud[data-bn]' |
none |
Adjusted vs raw equity prices¶
The three equity providers (YF, FMP, AV) default to adjusted
close prices — i.e. dividends and splits folded into a continuous
total-return series. This is the right default for backtests; raw
prices would give you fake "dividend day" drops.
Pass adjust=False if you genuinely need the as-traded price:
total_return = YF("SGOV").read() # adjusted (default)
as_traded = YF("SGOV", adjust=False).read() # raw, with dividend dips
total_return = FMP("AAPL").read() # uses FMP's `adjClose`
as_traded = FMP("AAPL", adjust=False).read() # FMP's raw `close`
total_return = AV("IBM").read() # `..._ADJUSTED` endpoint
as_traded = AV("IBM", adjust=False).read() # `TIME_SERIES_DAILY` (free-tier)
Binance doesn't take adjust — crypto doesn't have dividends or
corporate actions to adjust for.
Default window — 1 year if you don't specify one¶
Network providers can return decades of history. To keep accidental
calls cheap, a bare read() defaults to one year:
YF("SPY").read() # last ~365 days
YF("SPY").read(start="2010-01-01") # explicit — full history since 2010
YF("SPY").read(end="2024-12-31") # 2023-12-31 → 2024-12-31
The default applies to every network backend (YF, FMP, AV,
Binance). Cache backends (Parquet, DuckDB, Memory, CSV) are
not affected — a bare .read("us_eq") on them returns whatever is
cached, which is the more useful default for local data.
from fundcloud.data import FMP, AV, Binance, DuckDB
duck = DuckDB("warehouse.duckdb")
FMP(["AAPL", "MSFT"]).sync_to(duck, key="us_tech") # needs FMP_API_KEY
AV("IBM").sync_to(duck, key="ibm_av") # needs ALPHAVANTAGE_API_KEY
Binance(["BTCUSDT"], interval="1h").sync_to(duck, key="btc_hourly")
Other places to cache¶
Same sync_to signature, different destination:
| Store | Class | Good for |
|---|---|---|
| DuckDB | DuckDB("file.duckdb") |
single-file warehouse, SQL over cached data |
| Parquet | Parquet("folder/") |
one file per key, cloud / DVC sync |
| In-memory | Memory() |
tests, notebooks, short-lived scripts |
| CSV | CSV("file.csv") |
read-only by default; use Parquet/DuckDB for writes |
Reading what you've cached¶
All cache stores expose the same small read API:
duck = DuckDB("warehouse.duckdb")
duck.keys() # ['us_eq', 'us_tech', 'btc_hourly']
duck.exists("us_eq") # True
duck.last_index("us_eq") # Timestamp('2026-04-18')
# Time-slice + column-prune read
window = duck.read("us_eq", start="2024-01-01", end="2024-03-31", columns=["close"])
# Copy one key to another store
duck.sync_to(Parquet("snapshots/"), key="us_eq")
# Remove a key
duck.delete("us_eq")
Column naming — same shape from every provider¶
Each provider returns OHLCV in the same canonical shape, so downstream code doesn't have to special-case yfinance vs FMP vs Alpha Vantage:
| What you get | Example |
|---|---|
| Lowercase field names | open, high, low, close, volume |
| Snake-cased extras | adj_close (from yfinance's Adj Close, FMP's adjClose) |
| Canonical order | open → high → low → close → volume → adj_close |
Two-level columns: (field, symbol) |
bars[("close", "SPY")] |
Apply the same normalisation to your own frames (e.g. a CSV with weird headers) with the helpers Fundcloud uses internally:
from fundcloud.data import normalize_ohlcv_columns, canonicalize_ohlcv_order
cleaned = canonicalize_ohlcv_order(normalize_ohlcv_columns(my_csv_frame))
Many datasets at once — Catalog¶
When you're pulling three, five, ten datasets regularly, wiring each
by hand gets noisy. Catalog is the bookkeeping layer: register each
dataset once with its source and settings, then refresh them all with
one call.
from fundcloud.data import Catalog, DuckDB, YF, Binance, FMP
cat = Catalog(store=DuckDB("warehouse.duckdb"))
cat.register("us_eq", YF(["SPY", "AAPL"]), tags=("equity",))
cat.register("crypto", Binance(["BTCUSDT"], interval="1h"))
cat.register("macro", FMP(["CPI"]))
cat.refresh_all() # pull new rows for every dataset
cat.load("us_eq", start="2024-01-01") # read from DuckDB, no network call
cat.describe() # DataFrame: one row per dataset
Per-dataset refresh settings¶
Each dataset can carry defaults so you don't have to pass them on every call:
start— earliest date to pull on the first run. Ignored on subsequent refreshes (the cache watermark takes over).end— latest date. Usually omitted to mean "today".lookback— how far back to re-pull each refresh. Useful when upstream corrects recent bars (stock splits, exchange revisions) — set it to a few days so corrections flow through. Default is 0.
Save the catalog to a file¶
Catalog.to_spec() returns a plain dict — easy to round-trip through
YAML for a config-driven pipeline:
Write modes — when to use which¶
Every write (whether you call .write() directly or via sync_to /
Catalog.refresh) accepts a mode argument:
| mode | What it does | When to use |
|---|---|---|
overwrite |
Replace everything under that key | Initial load, full re-sync |
upsert |
Append + dedupe by timestamp | Daily refresh (default for sync_to) |
append |
Concatenate raw, no dedupe | You've validated no overlap is possible |
error |
Fail if key already exists | One-shot bootstrap, refuse to clobber |
Read-only safety¶
All network providers (YF, FMP, AV, Binance) are read-only —
the .write() method raises ReadOnlyError on them. You can't
accidentally try to push data back to Yahoo.
For your local stores, set read_only=True to get the same guarantee
on a production read path:
from fundcloud.data import Parquet
prod = Parquet("data/", read_only=True)
prod.read("us_eq") # ✅ fine
prod.write("us_eq", fresh_df) # ❌ raises ReadOnlyError
Reference — the Backend protocol¶
Everything above — YF, DuckDB, Parquet, Memory, CSV, FMP,
AV, Binance — implements the same small protocol. You only need to
read this section if you're writing your own backend or curious what
sync_to is really doing.
class Backend(Protocol):
name: ClassVar[str] # short id ("yf", "duckdb", ...)
read_only: bool
# Available on every backend
def read(self, key=None, *, start=None, end=None, columns=None) -> DataFrame: ...
def keys(self) -> list[str]: ...
def exists(self, key: str) -> bool: ...
def last_index(self, key=None) -> Timestamp | None: ...
# On read-write backends; raises ReadOnlyError when read_only=True
def write(self, key, df, *, mode='overwrite') -> None: ...
def delete(self, key) -> None: ...
# One-liner source → sink transfer (read + write rolled together)
def sync_to(self, sink, *, key=None, start=None, end=None, mode='upsert') -> DataFrame: ...
Full auto-generated class reference: Data API reference.