Skip to main content
The io_pipeline module contains internal helpers for transforming raw JSON data into structured DataFrames. These functions handle column renaming, type coercion, and data validation.
This module contains internal implementation details. The API is subject to change. Most users should use the high-level Session API instead.

Overview

The I/O pipeline follows this flow:

Functions

_validate_json_payload

def _validate_json_payload(
    path: str,
    data: dict[str, Any]
) -> dict[str, Any]
Validate raw JSON payload using Pydantic schemas if validation is enabled in the global config. Parameters:
  • path: Resource path for error context (e.g., “laps/VER/19_tel.json”)
  • data: Raw JSON dictionary
Returns:
  • Validated JSON dictionary
Raises:
  • InvalidDataError: If validation fails
This function uses the global config singleton. The underlying implementation in async_fetch.py accepts a config parameter, but the exported version in io_pipeline.py uses the global config automatically.

_extract_driver_codes

def _extract_driver_codes(drivers: list[dict] | None) -> set[str]
Extract set of driver codes from drivers payload. Parameters:
  • drivers: List of driver dictionaries from session JSON, or None
Returns:
  • Set of 3-letter driver codes (e.g., )
Example:
drivers = [
    {"driver": "VER", "dn": "33", "team": "Red Bull Racing"},
    {"driver": "HAM", "dn": "44", "team": "Mercedes"}
]
codes = _extract_driver_codes(drivers)
# Returns: {"VER", "HAM"}

_extract_driver_info_map

def _extract_driver_info_map(
    drivers: list[dict] | None
) -> dict[str, dict]
Extract driver metadata from drivers payload into a lookup map. Parameters:
  • drivers: List of driver dictionaries from session JSON, or None
Returns:
  • Dictionary mapping driver codes to raw metadata dictionaries containing:
    • driver: 3-letter driver code
    • dn: Driver number (as string)
    • team: Team name
    • first_name: Driver’s first name
    • last_name: Driver’s last name
    • team_color: Hex color code
    • headshot_url: URL to driver photo
Example:
drivers = [
    {
        "driver": "VER",
        "dn": "33",
        "team": "Red Bull Racing",
        "first_name": "Max",
        "last_name": "Verstappen",
        "team_color": "#3671C6",
        "headshot_url": "https://..."
    }
]
info_map = _extract_driver_info_map(drivers)
# Returns: {"VER": {"driver": "VER", "dn": "33", ...}}
The returned dictionary contains raw JSON keys (snake_case), not the renamed DataFrame columns (PascalCase). Column renaming happens in _process_lap_df.

_create_lap_df

def _create_lap_df(
    lap_data: dict,
    driver: str,
    team: str,
    lib: str
) -> DataFrame
Create a DataFrame from raw lap data JSON with driver and team metadata. Parameters:
  • lap_data: Dictionary of lap data arrays (not a list of dicts). Keys are internal JSON field names like "lap", "time", "s1", etc.
  • driver: 3-letter driver code (e.g., “VER”)
  • team: Team name (e.g., “Red Bull Racing”)
  • lib: DataFrame library to use (“pandas” or “polars”)
Returns:
  • DataFrame with raw lap timing data (before column renaming)
Raw columns created (before renaming):
  • lap: Lap number (1-indexed)
  • time: Lap time in seconds
  • s1, s2, s3: Sector times
  • compound: Tire compound
  • life: Tire age in laps
  • stint: Stint number
  • pb: Personal best flag
  • vi1, vi2, vfl, vst: Speed trap values
  • status: Track status code
  • pos: Position at lap end
  • del: Lap deleted flag
  • delR: Deletion reason
  • ff1G: FastF1 generated flag
  • Driver: Driver code (added by this function)
  • Team: Team name (added by this function)
Example:
# 2021 Belgian GP Race - Verstappen lap data
lap_data = {
    "lap": [1, 2, 3],
    "time": [132.765, 108.901, 107.523],
    "s1": [44.123, 35.234, 34.987],
    "s2": [48.234, 38.123, 37.891],
    "s3": [40.408, 35.544, 34.645],
    "compound": ["INTERMEDIATE", "INTERMEDIATE", "INTERMEDIATE"],
    "life": [1, 2, 3],
    "stint": [1, 1, 1]
}
df = _create_lap_df(lap_data, "VER", "Red Bull Racing", "pandas")
This function does NOT rename columns. Raw JSON keys are preserved. Use _process_lap_df to apply column renaming and type coercion.

_create_session_df

def _create_session_df(
    data: dict[str, Any],
    rename_map: dict[str, str],
    lib: str
) -> DataFrame
Create a DataFrame from session-level data (weather, race control messages, etc.). Parameters:
  • data: Raw data dictionary with arrays
  • rename_map: Column rename mapping (e.g., WEATHER_RENAME_MAP or RACE_CONTROL_RENAME_MAP)
  • lib: DataFrame library to use (“pandas” or “polars”)
Returns:
  • DataFrame with renamed columns according to the provided rename map
Example:
from tif1.core_utils.constants import WEATHER_RENAME_MAP

weather_data = {
    "wT": [0, 60, 120],
    "wAT": [18.5, 18.7, 18.9],
    "wTT": [22.1, 22.3, 22.5]
}
df = _create_session_df(weather_data, WEATHER_RENAME_MAP, "pandas")
# Columns: Time, AirTemp, TrackTemp

_process_lap_df

def _process_lap_df(
    lap_df: DataFrame,
    lib: str
) -> DataFrame
Post-process lap DataFrame by renaming columns, applying type coercion, and reordering columns. Parameters:
  • lap_df: Raw lap DataFrame from _create_lap_df
  • lib: DataFrame library (“pandas” or “polars”)
Returns:
  • Processed DataFrame with:
    • Renamed columns (snake_case → PascalCase)
    • Proper data types (timedelta for lap times, float64 for numeric, etc.)
    • Categorical types for Driver, Team, Compound, TrackStatus (pandas only by default)
    • FastF1-compatible column order
Transformations applied:
  1. Column renaming via LAP_RENAME_MAP (e.g., "lap""LapNumber", "time""LapTime")
  2. Type coercion:
    • LapTime: float seconds → timedelta64[ns]
    • Time, Sector1Time, etc.: float seconds → timedelta64[ns]
    • Numeric columns: → float64
    • Boolean columns: → bool
  3. Add LapTimeSeconds column (float representation of LapTime)
  4. Apply categorical types (pandas only, unless polars_lap_categorical config is enabled)
  5. Reorder columns to match FastF1 convention
Column order (FastF1-compatible):
[
    "Time", "Driver", "DriverNumber", "LapTime", "LapNumber",
    "Stint", "PitOutTime", "PitInTime",
    "Sector1Time", "Sector2Time", "Sector3Time",
    "Sector1SessionTime", "Sector2SessionTime", "Sector3SessionTime",
    "SpeedI1", "SpeedI2", "SpeedFL", "SpeedST",
    "IsPersonalBest", "Compound", "TyreLife", "FreshTyre",
    "Team", "LapStartTime", "LapStartDate",
    "TrackStatus", "Position", "Deleted", "DeletedReason",
    "FastF1Generated", "IsAccurate",
    "WeatherTime", "AirTemp", "Humidity", "Pressure", "Rainfall",
    "TrackTemp", "WindDirection", "WindSpeed",
    "LapTimeSeconds", "QualifyingSession"
]

Column naming conventions

The I/O pipeline transforms raw JSON keys to FastF1-compatible column names:
JSON KeyDataFrame ColumnTypeDescription
lapLapNumberfloat64Lap number (1-indexed)
timeLapTimetimedelta64[ns]Lap time
s1Sector1Timetimedelta64[ns]Sector 1 time
s2Sector2Timetimedelta64[ns]Sector 2 time
s3Sector3Timetimedelta64[ns]Sector 3 time
compoundCompoundstr/categoryTire compound (SOFT, MEDIUM, HARD, INTERMEDIATE, WET)
lifeTyreLifefloat64Tire age in laps
stintStintfloat64Stint number
pbIsPersonalBestboolPersonal best lap flag
vi1SpeedI1float64Speed trap 1 (km/h)
vi2SpeedI2float64Speed trap 2 (km/h)
vflSpeedFLfloat64Finish line speed (km/h)
vstSpeedSTfloat64Speed trap (km/h)
statusTrackStatusstr/categoryTrack status code
posPositionfloat64Position at lap end
delDeletedbooleanLap deleted flag
delRDeletedReasonstrReason for deletion
ff1GFastF1GeneratedboolFastF1 generated data flag
sesTTimetimedelta64[ns]Session time at lap end
dNumDriverNumberstrDriver number
poutPitOutTimetimedelta64[ns]Pit out time
pinPitInTimetimedelta64[ns]Pit in time
The complete mapping is defined in LAP_RENAME_MAP in src/tif1/core_utils/constants.py. Both validated (snake_case) and raw (abbreviated) JSON keys are supported.

Library Support

The pipeline supports both pandas and polars libraries:
# Create DataFrame with pandas lib
lap_data = {"lap": [1, 2], "time": [90.5, 89.2]}
df_pandas = _create_lap_df(lap_data, "VER", "Red Bull Racing", "pandas")
processed = _process_lap_df(df_pandas, "pandas")

# Create DataFrame with polars lib
df_polars = _create_lap_df(lap_data, "VER", "Red Bull Racing", "polars")
processed = _process_lap_df(df_polars, "polars")
Library-specific optimizations:
  • pandas: Uses pd.DataFrame(data, copy=False) for zero-copy construction
  • polars: Uses pl.DataFrame(data, strict=False) with schema inference
  • pandas: Applies categorical types by default for Driver, Team, Compound, TrackStatus
  • polars: Categorical types disabled by default (enable with polars_lap_categorical config)

Data Validation

When validate_data is enabled in config, _validate_json_payload validates raw JSON using Pydantic schemas:
  1. Required fields: Ensures all required fields are present in JSON
  2. Type checking: Validates data types match schema definitions
  3. Value ranges: Checks values are within expected ranges
  4. Referential integrity: Validates driver codes, lap numbers, etc.
Example validation error:
from tif1 import InvalidDataError

try:
    validated = _validate_json_payload("laps/VER", invalid_data)
except InvalidDataError as e:
    print(e)
    # InvalidDataError: Invalid data at laps/VER
    #   - Missing required field: lap
    #   - Invalid type for time: expected float, got str
Validation is controlled by the validate_data config option. When disabled, raw JSON is passed through without validation for maximum performance.

Performance Considerations

The I/O pipeline is heavily optimized for speed:
  • Zero-copy construction: Uses copy=False in pandas, strict=False in polars
  • Batch processing: Processes all laps at once, not row-by-row
  • Vectorized operations: Uses numpy/pandas vectorization for type coercion
  • Minimal allocations: Reuses arrays where possible, avoids intermediate copies
  • Lazy categorical: Categorical types applied only when beneficial
Typical performance (pandas lib):
  • Process 50 laps: ~2-5ms
  • Process 1000 laps: ~20-40ms
  • Full session (20 drivers × 50 laps): ~100-200ms
For maximum performance, disable validation (validate_data=False) and use pandas. Polars is faster for very large datasets (>10k laps) but has higher overhead for small datasets.

Internal Implementation

The pipeline maintains two sets of column names:
  • JSON keys: Abbreviated keys like "lap", "s1", "vi1" (raw) or snake_case like "lap_number", "sector_1_time" (validated)
  • DataFrame columns: PascalCase like "LapNumber", "Sector1Time", "SpeedI1"
Renaming happens in _process_lap_df() using LAP_RENAME_MAP from core_utils/constants.py. The map supports both raw and validated JSON keys for maximum compatibility.
The pipeline coerces types to ensure FastF1 compatibility:
  • Lap times (float seconds) → timedelta64[ns]
  • Session times (float seconds) → timedelta64[ns]
  • Lap numbers → float64 (not int, to allow NaN)
  • Boolean flags → bool (fillna False for non-nullable)
  • Deleted flag → boolean (nullable bool)
  • Categorical data → category (pandas only by default)
  • Driver numbers → str (not int, to preserve leading zeros)
Missing values are handled gracefully:
  • Numeric fields: NaN (pandas) or null (polars)
  • String fields: empty string or null
  • Boolean fields: False (fillna applied)
  • Deleted field: null (nullable boolean)
  • Timedelta fields: NaT (not-a-time)
The pipeline never raises errors for missing optional fields. Only validation (when enabled) can raise InvalidDataError for missing required fields.
_create_lap_df normalizes mismatched array lengths (required in Python 3.12+):
  • Calculates max length across all arrays
  • Pads short arrays with None values
  • Replicates scalar values to match max length
This ensures both pandas and polars can construct DataFrames without errors.
Last modified on March 5, 2026