Cursor-based pagination for large historical bar pulls

How do you pull ten years of one-minute bars for a few hundred symbols without your script timing out, OOM-ing on a laptop, or silently dropping rows when a split lands in the middle of the range? The honest answer is that the offset-based pagination most developers reach for first will betray you on all three counts, and the fix is to lean on cursor-based pagination the way SiftingIO's historical endpoints expose it.

This post walks through the failure modes of naive pagination, the shape of a cursor response on /v1/<asset_class>/bars, and a small Python loop that streams bars to disk while respecting rate-limit headers. The same pattern applies whether you're pulling AAPL one-minute bars, EUR/USD ticks, or BTC/USDT trades; the unified schema means the loop doesn't change.

Why offset pagination breaks on long ranges#

A classic ?page=N&size=1000 pattern looks fine on a 10,000-row pull and falls apart on a 5-million-row pull. The server has to skip N times 1000 rows before returning anything, which gets quadratically expensive as the offset grows. By page 500 the request is doing more work to discard rows than to return them, and the gateway will start cutting you off at 30 seconds.

There's a worse problem. Bars get adjusted retroactively when corporate actions are confirmed. If a split on AAPL clears while you're on page 47 of 200, the row that was at offset 47,000 a minute ago is now at offset 47,001, and your next page either skips a bar or returns a duplicate. You won't notice until a backtest comes back with a 0.0001 percent return discrepancy and you spend a week looking for the bug.

Cursor pagination sidesteps both. The server hands you an opaque token that points to a specific row by primary key, not by ordinal position. The next page picks up exactly where the previous one ended, regardless of inserts or adjustments elsewhere in the table.

The cursor response shape#

A call to the historical bars endpoint with a date range that exceeds one page returns a next_cursor field alongside the data. The exact resource name lives in /docs, but the shape is consistent across asset classes:

curl -H "Authorization: Bearer $SIFTING_KEY" \
  "https://api.sifting.io/v1/equities/bars?symbol=AAPL&interval=1m&from=2024-01-01&to=2024-12-31&limit=5000"

The response carries the bars and a cursor when there's more to fetch:

{
  "symbol": "AAPL",
  "asset_class": "equity",
  "interval": "1m",
  "bars": [
    {"ts": "2024-01-02T14:30:00Z", "o": 187.15, "h": 187.42, "l": 187.02, "c": 187.31, "v": 1842310},
    {"ts": "2024-01-02T14:31:00Z", "o": 187.31, "h": 187.55, "l": 187.20, "c": 187.48, "v": 612045}
  ],
  "next_cursor": "eyJ0cyI6IjIwMjQtMDEtMDJUMTQ6MzI6MDBaIn0",
  "has_more": true
}

When has_more is false, next_cursor is null and you stop. Pass the cursor back as ?cursor=<token> on the next request and the server resumes from the exact bar after the last one returned.

A streaming loop in Python#

The goal here is to pull a year of one-minute bars for AAPL and write them to a Parquet file without holding everything in memory. Pandas plus pyarrow is enough; no special SDK required.

import os
import time
import requests
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

API = "https://api.sifting.io/v1/equities/bars"
KEY = os.environ["SIFTING_KEY"]

def fetch_bars(symbol, start, end, interval="1m", limit=5000):
    cursor = None
    while True:
        params = {
            "symbol": symbol,
            "interval": interval,
            "from": start,
            "to": end,
            "limit": limit,
        }
        if cursor:
            params["cursor"] = cursor
        r = requests.get(API, params=params,
                         headers={"Authorization": f"Bearer {KEY}"})
        if r.status_code == 429:
            wait = int(r.headers.get("Retry-After", "5"))
            time.sleep(wait)
            continue
        r.raise_for_status()
        data = r.json()
        yield data["bars"]
        if not data.get("has_more"):
            break
        cursor = data["next_cursor"]

writer = None
schema = None
for batch in fetch_bars("AAPL", "2024-01-01", "2024-12-31"):
    df = pd.DataFrame(batch)
    table = pa.Table.from_pandas(df)
    if writer is None:
        schema = table.schema
        writer = pq.ParquetWriter("aapl_2024_1m.parquet", schema)
    writer.write_table(table)
if writer:
    writer.close()

The generator yields one page at a time, each page goes straight to disk, and peak memory stays bounded by limit rather than by the size of the year. On a laptop this runs in a few minutes for a single symbol and stays under 100 MB resident. For a watchlist of 500 symbols, wrap the outer loop in a thread pool with a small concurrency (4 to 8 workers) and respect the X-RateLimit-Remaining header so you don't get a 429 cascade halfway through.

Common pitfalls#

Three things bite developers on their first long pull.

First, the from and to parameters are inclusive on both ends. If you split a year into monthly chunks and ask for from=2024-01-01&to=2024-01-31 followed by from=2024-01-31&to=2024-02-29, the bar at midnight on the 31st shows up twice. Either make the chunk boundaries exclusive (from=2024-02-01) or dedupe on the timestamp after the fact. Cursor pagination avoids this entirely because the cursor encodes the last-seen primary key, not a calendar boundary.

Second, the cursor token is opaque and tied to the original query parameters. Changing interval or symbol mid-loop invalidates the cursor and the next request returns a 400 with a message about a malformed cursor, not the more obvious "query mismatch". If you need to pull multiple symbols, run separate loops per symbol; do not try to share a cursor across them.

Third, watch the X-RateLimit-Remaining and X-RateLimit-Reset headers. The free tier and lower paid tiers will return 429s well before your pull finishes if you fire requests in a tight loop. The example above checks Retry-After on a 429, which is the defensive minimum. A better pattern is to inspect X-RateLimit-Remaining on every response and sleep proactively when it drops below a threshold, so you never trip a 429 and waste a round trip.

When to reach for batch exports instead#

If you find yourself running the same multi-year, multi-symbol pull on a schedule, the per-request loop is the wrong tool. Historical batch exports are designed for that case and deliver a compressed file rather than tens of thousands of paginated responses. The cursor pattern in this post is for ad-hoc and incremental pulls; the batch endpoint is for the once-a-day or once-a-week refresh.

For the API shape, query parameters, and current rate-limit ceilings per plan, read the docs.