Skip to main content

Technical Deep Dive

How Argus Pulse Uses TimescaleDB to Process APC Data at Scale

March 2026

Transit agencies collect millions of stop-level boarding and alighting records every year. Storing that data is one problem. Querying it fast enough to produce NTD reports, validate quality, and surface route-level trends is a different problem entirely. This article explains how Argus Pulse uses TimescaleDB on PostgreSQL 16 to handle both.

The Problem with APC Data

Let's be direct about the situation most transit agencies find themselves in. You have automatic passenger counters on your buses. Every day, those sensors generate a record for every stop on every trip on every route. For a mid-size agency, that is 50 million or more rows per year of stop-level observation data. Each row carries boardings, alightings, onboard load, GPS coordinates, door open and close times, dwell seconds, deviation from schedule, and segment-level mileage.

That data sits somewhere. Maybe in flat files on an SFTP server. Maybe in a general-purpose relational database that was not designed for time-series workloads. And when your NTD analyst needs to calculate Unlinked Passenger Trips, Passenger Miles Traveled, Revenue Miles, and Revenue Hours for a federal submission, the query takes minutes. Or it times out. Or someone exports to a spreadsheet and does it manually.

This is not a hypothetical. This is what we hear from transit planning directors, IT managers, and procurement teams every time we talk about APC data. The volume is not the hard part. The hard part is making that volume queryable, validatable, and reportable without building a custom data warehouse from scratch.

Why TimescaleDB

Argus Pulse stores all stop-level APC observations in a TimescaleDB hypertable running on PostgreSQL 16. TimescaleDB is an open-source time-series extension for PostgreSQL. It is not a separate database. It runs inside PostgreSQL, uses standard SQL, and works with existing PostgreSQL tooling, backups, and monitoring.

The reason we chose TimescaleDB over a standalone time-series database or a generic PostgreSQL table is straightforward. APC data is time-series data. Every observation is anchored to a service date. Queries almost always filter by date range. And the access pattern is append-heavy: new data arrives daily, and historical data is rarely modified. TimescaleDB is built for exactly this pattern. It partitions data into chunks by time interval, prunes irrelevant chunks at query time, and compresses older data automatically.

Critically, it does all of this while remaining PostgreSQL. Your DBA already knows how to manage it. Your backup strategy does not change. Your compliance team does not need to evaluate a new database product. From their perspective, it is PostgreSQL with an extension.

The Core Schema: stop_observations

The central table in Argus Pulse is stop_observations. This is a TimescaleDB hypertable partitioned by service_date with a one-month chunk interval. One row represents one vehicle at one stop on one trip on one date.

Each record carries the data that downstream reporting needs: boardings, alightings, onboard load, scheduled and actual arrival times, dwell seconds, GPS coordinates, segment miles, segment hours, cumulative miles, cumulative hours, passenger miles, passenger hours, deviation from schedule in seconds, and an on-time performance classification (early, on-time, or late).

stop_observations hypertable (simplified)

service_date         DATE NOT NULL        -- partition key, 1-month chunks
trip_id              VARCHAR(64)          -- matched to GTFS schedule
route_id             VARCHAR(64)          -- FK to routes
stop_id              VARCHAR(64)          -- FK to stops
stop_sequence        INTEGER              -- order within trip
vehicle_id           VARCHAR(64)          -- APC sensor source
boardings            INTEGER              -- door count in
alightings           INTEGER              -- door count out
onboard_load         INTEGER              -- running passenger count
scheduled_arrival    TIME                 -- from GTFS stop_times
actual_arrival       TIMESTAMPTZ          -- from APC timestamp
deviation_seconds    INTEGER              -- actual minus scheduled
otp_status           VARCHAR(16)          -- EARLY, ON_TIME, LATE
segment_miles        DOUBLE PRECISION     -- distance this segment
passenger_miles      DOUBLE PRECISION     -- load x segment miles
dwell_seconds        INTEGER              -- door open duration
gps_latitude         DOUBLE PRECISION
gps_longitude        DOUBLE PRECISION

The composite primary key is (id, service_date), which is required for TimescaleDB hypertable partitioning. Five indexes cover the common query patterns: trip lookups, route-date range scans, stop-level drilldowns, vehicle filtering, and route-date-sequence ordering for sequential trip reconstruction.

When a query asks for all stop observations on Route 10 between January 1 and March 31, TimescaleDB automatically prunes the chunks for April through December. The query only touches three monthly partitions instead of scanning the entire table. For a table with 50 million rows per year, this is the difference between a query that returns in milliseconds and one that takes minutes.

Two-Tier Storage: Stops and Trips

Pulse separates atomic stop-level data from aggregated trip-level data. The stop_observations hypertable holds the raw sensor grain: 50+ million rows per year. A second table, trip_observations, holds pre-aggregated trip-level metrics: total boardings, total alightings, max load, revenue miles, revenue hours, passenger miles, stops served, and validation flags. This table has roughly 1 to 3 million rows per year.

This separation matters because different questions need different grains. When a route planner wants to see which stops have the highest boardings on Route 10, that is a stop-level query against the hypertable. When the NTD analyst needs total UPT by route by month, that is a trip-level aggregation. Running both queries against the same 50-million-row table would be wasteful. The two-tier approach lets each table optimize for its access pattern.

Trip observations also carry a validation_flags column stored as JSONB. This column holds an array of flag codes produced by the business rules engine during processing: INSUFFICIENT_STOPS, INSUFFICIENT_BOARDINGS, LOW_GPS_QUALITY, and others. The is_valid boolean on each trip observation is the result of evaluating all applicable rules. Only valid trips feed into factored ridership and NTD calculations.

Configurable Validation Rules

Every transit agency has different tolerances for data quality. One agency flags a trip as suspect if fewer than two stops were served. Another sets that threshold at five. The OTP threshold for "late" might be 300 seconds at one agency and 600 at another.

Pulse stores all validation rules in a business_rules table with JSONB configuration. Each rule has a name, a category (stop validation, trip validation, OTP, diagnostics, route configuration), a version number, and a JSON config object that holds the actual thresholds. Rules are loaded into memory at application startup and applied during data processing without per-record database lookups.

Six default rule sets ship with the platform: minimum dwell time validation (default 3 seconds), trip validity criteria (minimum stops served, minimum boardings, GPS quality floor), express route handling (separate factoring for express services), deadhead filtering (pullout, pull-in, layover, training, deadhead trip exclusion), on-time performance thresholds (early at -60 seconds, late at +300 seconds, applied at timepoints only), and data quality diagnostics (zero-count alerts, negative load alerts, excessive load caps, maximum boardings per stop).

Every rule change is versioned and recorded in an audit log with the previous value, new value, who changed it, and when. If an agency needs to understand why their Q3 NTD numbers differ from Q2, they can trace it back to a specific rule change on a specific date.

Data Quality Scanning

An automated alert scanner runs hourly against the previous day's data. It checks for eight anomaly types: zero-count trips (valid trips with no boardings and no alightings), negative passenger loads (physically impossible sensor readings), missing stop sequences (gaps in the expected stop order within a trip), excessive load (above a configurable threshold, default 80 passengers), excessive per-stop boardings (above a configurable threshold, default 100), trip validity failures, deadhead trips incorrectly marked as revenue, and excessive schedule deviation.

Each alert is stored with its severity (info, warning, or critical), the affected route, trip, vehicle, and stop, a human-readable message, and a JSONB details field with the specific values that triggered the alert. Alerts are de-duplicated: the scanner will not create a second alert for the same trip, date, and alert type if an unresolved alert already exists. Analysts can resolve alerts with notes, and the resolution is tracked with who resolved it and when.

Factoring and NTD Rollups

Not every scheduled trip gets an APC observation. Sensors fail. Vehicles get substituted. Trips get missed. The NTD requires agencies to account for this gap through data factoring: scaling observed ridership to represent the full scheduled service.

Pulse calculates factored ridership per route per day using a straightforward formula. The factoring ratio is the number of scheduled trips divided by the number of valid observed trips. Factored boardings equal the raw boardings multiplied by the factoring ratio. The same ratio applies to passenger miles traveled. Missed trips are calculated as scheduled trips minus actual trips plus unscheduled trips.

The factoring step runs as a post-processing operation, not inline during ingestion. This is a deliberate architectural choice. It means an agency can re-run factoring with different parameters (after correcting a business rule, for example) without re-ingesting the underlying APC data. Raw trip observations are never modified by the factoring process.

NTD monthly summaries aggregate factored data by route, month, and mode (motor bus, commuter bus, or bus rapid transit). Each monthly summary includes UPT, factored UPT, PMT, revenue miles, revenue hours, scheduled trips, actual trips, valid trips, and data yield percentage. Annual summaries roll up from monthly. The entire chain is traceable: any number in an NTD submission can be traced back through the factoring layer, through the trip observations, down to the individual stop-level sensor readings that produced it.

APC Vendor Integration

Pulse ingests raw APC data via automated SFTP connections. The platform supports data from major APC hardware vendors including Clever Devices, Dilax, Eurotech, and Infodev. Clever Devices systems export raw APC data from CleverReports using the Raw-APC-Data-Schema format. Other vendors follow similar file-based export patterns that the ingestion pipeline normalizes into the common stop_observations schema.

GTFS schedule data is imported separately from a standard GTFS ZIP feed. The importer parses routes, stops, calendar entries, trips, and stop times, then derives patterns (route variants by direction and shape) and pattern-stop sequences with distance-from-origin calculations. Trip matching links observed APC trips to their scheduled counterparts using route, time, and pattern alignment.

Every file ingestion is logged with the file name, file size, record counts (total, processed, failed), processing status, and any error details. SFTP deliveries are tracked separately with delivery type (daily stop data, daily trip data, diagnostics, or NTD exports), retry counts, and delivery timestamps.

The Backend Stack

Argus Pulse runs on Kotlin with Quarkus 3 and Vert.x for reactive, non-blocking I/O. The database layer uses Hibernate Reactive with Panache for ORM operations and a reactive PostgreSQL client for direct SQL execution. Flyway manages 18 versioned database migrations that build the schema incrementally. Authentication uses JWT tokens with OAuth2, stored in HTTP-only cookies, with optional TOTP-based multi-factor authentication. The REST API exposes 24+ endpoints covering trips, stops, ridership aggregations, NTD metrics, diagnostics, configuration, data import, and export.

The system runs in containers: TimescaleDB on PostgreSQL 16, the Quarkus backend, and a Next.js frontend. All configuration is externalized through environment variables. No proprietary infrastructure dependencies. No vendor lock-in on the database layer. If an agency's IT team wants to inspect the database, it is PostgreSQL. Standard tooling, standard backups, standard monitoring.

What This Means for Transit Agencies

If you are evaluating APC data processing platforms, here is what matters. Your agency generates time-series data. The system that stores and queries that data should be optimized for time-series access patterns, not adapted from a general-purpose relational model after the fact. Monthly partitioning, automatic chunk pruning, and time-aware indexing are not features you should have to build yourself.

Your validation rules should be configurable without code changes. Your NTD numbers should be traceable from the annual summary down to the individual sensor reading. Your factoring methodology should be re-runnable without re-ingesting raw data. And the database underneath all of it should be something your existing team already knows how to operate.

That is what Argus Pulse delivers. Not a black box. Not a proprietary data warehouse. PostgreSQL 16 with TimescaleDB, a well-defined schema, configurable business rules, and a clear processing pipeline from sensor to NTD submission.

Learn more about Argus Pulse for your transit agency.