Hard Real-Time Edge AI for Automotive Inspection

Operating Edge AI at Scale - Versioning, Shadow Mode, Monitoring, and Drift

Deploying a model to a production line is not the finish line it is the start of a continuous operational cycle of monitoring, validation, and controlled evolution. In automotive quality control, this cycle carries real stakes: a model that silently degrades allows defects to escape into the supply chain; a model update that introduces over-rejection generates scrap, throughput loss, and operator distrust.

‍

Part 5 closes the series with the full operational layer.

‍

We cover a two-layer model registry that separates source models from compiled device-class artifacts; shadow and canary deployment patterns that let you test new models on live production traffic with zero risk to line decisions; five specific, quantitative rollback triggers; and a three-layer monitoring framework that spans device health, inference-path performance, and model drift detection.

‍

This is the operational architecture that makes edge AI sustainable at fleet scale not just on day one.

Model versioning, shadow mode, and rollback

In automotive, a flawed model can either leak defects (safety/brand risk) or cause over-rejection (throughput and cost risk), so you must combine strong versioning with risk-free deployment patterns.

Model registry and metadata

Use a central registry (custom or via ML platform) that stores a two-layer artifact structure — separating the source model from its compiled, device-specific deployment artifacts. Conflating these two layers is one of the most common causes of deployment failures and traceability gaps in production edge AI programmes: a source model and its compiled edge artifact are different objects with different versioning, compatibility, and lifecycle concerns, and the registry must treat them as such.

Layer 1 — Source model record

‍

The source model record captures everything about the model as produced by the training pipeline, independent of any deployment target:

Model identifier and lineage: A unique identifier (e.g., defect-detector-v18) with an explicit parent version pointer (defect-detector-v17) enabling full lineage traversal. Every model must be traceable back to its origin training run.

Training provenance: Training dataset snapshot reference (immutable dataset version ID, not a mutable path), labeling run ID, annotation schema version, and the specific data split used for training vs. validation vs. test. Without immutable dataset references, reproducing a model from its registry entry is impossible.

Evaluation metrics per product family: Precision, recall, false-reject rate, and F1 per defect class, broken down by product family and line variant. A model that performs well on BIW weld inspection may degrade significantly on a different part geometry metrics must be scoped accordingly.

Source artifact : The canonical exported source format: typically ONNX with explicit opset version recorded (e.g., defect-detector-v18_opset17.onnx). This is the input to all downstream compilation steps and must be stored immutably — never overwritten.

Deployment status: Per site/line state machine: staged → shadow → canary → production → rolled-back, with timestamps and operator IDs for each transition.

Layer 2 : Compiled edge artifact record (one per target device class)

‍‍

For each source model, the registry maintains one or more compiled edge artifact records one per supported target device class.

‍

These are distinct registry entries, not attachments to the source model record, because they have independent versioning, compatibility constraints, and deployment lifecycles:

Compiled artifact : The device-class-specific runtime artifact produced from the source model:
- TensorRT serialised engine (.engine) for NVIDIA Jetson or x86 GPU targets — built with explicit TRT version, CUDA version, cuDNN version, and GPU architecture flags (e.g., sm_87 for Jetson Orin).
  ‍
- OpenVINO IR (.xml + .bin) for Intel IPC targets converted with explicit OpenVINO version and target hardware flag (e.g., CPU, GPU, MYRIAD).
  ‍
- ONNX Runtime pre-optimised session cache for targets where TRT/OpenVINO are not used.

Calibration profile: For INT8 quantised artifacts, the calibration dataset reference, calibration method (e.g., entropy, percentile, minmax), and per-layer scale/zero-point tensors must be stored alongside the compiled artifact. A TensorRT INT8 engine without its calibration provenance cannot be reproduced, and its accuracy characteristics cannot be audited or re-validated after a hardware change.

Supported accelerator class: Explicit hardware target specification: device family (e.g., Jetson Orin NX 8GB), GPU architecture (e.g., Ampere, sm_87), and JetPack/driver version it was compiled for. A TRT engine compiled for sm_87 will not run on sm_80 (Jetson AGX Xavier) the registry must prevent cross-class deployment at the artifact resolution step.

Runtime compatibility manifest: A machine-readable compatibility record specifying the exact runtime stack the compiled artifact requires:

registry/artifacts/defect-detector-v18/jetson-orin-nx-8gb/compatibility.yaml 

source_model: defect-detector-v18_opset17.onnx 

source_model_sha256: a3f1c8...  

compiled_artifact: defect-detector-v18_orin- 

nx_trt8.6_cuda12.0.engine  

artifact_sha256: b7e2d4...  

accelerator_class: jetson-orin-nx-8gb  

gpu_architecture: sm_87  

runtime: 

tensorrt: "8.6.1"  

cuda: "12.0"  

cudnn: "8.9"  

ort: "1.17.3" # if using TRT EP via ORT  

jetpack: "6.0"  

driver_min: "535.86" # minimum compatible NVIDIA driver  

quantization:  

precision: int8  

calibration_ref: cal-dataset-bih-weld-v3  

calibration_sha256: c9a1f2... validation:  

hil_rig: lab-jetson-orin-nx-01  

hil_run_id: hil-2026-03-27-0412  

p99_latency_ms: 38.4  

gpu_mem_peak_mb: 2841  

passed: true

‍

Deployment status per site/line : Independent of the source model's deployment status. A source model may be in production on Line A while its compiled artifact for a new device class is still in staged on Line B. The two statuses must be tracked separately.

‍

‍Why this two-layer structure matters in practice:

Shadow deployment on a live line

‍

Shadow deployment lets you test a new model on real production traffic with zero effect on line decisions. The active model continues to drive all PLC actuation; the candidate model runs in parallel on the same input frames and its predictions are logged for offline analysis.

‍

No candidate prediction ever reaches the IO adapter or the PLC during shadow mode — this is a non-negotiable invariant.

‍

The three operational steps are:

‍Mirror the input — At inference time, the same preprocessed frame (already assigned its FrameContext correlation ID from acquisition) is passed to both the active session and the candidate session. Both run on the local GPU; the additional inference cost is the candidate model's latency, which must fit within the station's available compute headroom without pushing the active model's p99 latency over its actuation budget. ‍
‍
Log both predictions — Every frame produces a shadow log entry pairing the active and candidate scores, labels, and latencies under the shared frame ID. This log is the primary artefact of the shadow run. ‍
‍
Analyse before promotion — The shadow log is reviewed for score divergence, label disagreement rate, tail latency behaviour, and defect-class-specific miss patterns before the candidate is eligible for canary promotion.
‍
What to log in shadow mode — and what is realistic to retain
Shadow logging in a vision-inspection environment requires explicit decisions about what data is captured and how long it is kept. The instinct to log everything — including raw frames — is understandable but almost always impractical:
‍
- At 30 fps with a 1080p colour camera, uncompressed frames consume approximately 180 MB/s. Even compressed at 10:1, retaining all frames for a 10-minute shadow run requires ~10 GB of local storage per run. At 100 parts per minute across an 8-hour shift, full-frame retention requires ~290 GB per day — far beyond the typical local disk capacity of an edge node, and expensive to transfer to the cloud.
  The practical approach is a tiered logging strategy that retains everything you need for analysis and nothing you do not:

This tiered approach ensures that the frames most valuable for shadow analysis — disagreements and boundary cases — are always retained, while the high-volume, high-agreement frames that carry little analytical value are discarded after their score record is written.

‍

Configuration — complete and self-contained

‍

Rather than presenting a partial YAML snippet that assumes undefined state, the shadow configuration is best expressed as a complete, validated config file that the inference service loads at startup and the deployment orchestrator updates during mode transitions:

‍

#/etc/auto-qc/model-config.yaml 

#Managed by deployment orchestrator — do not edit manually. 

#Validated against schema at service startup; invalid config prevents startup. 

 

active:  

model_name: "defect-detector-v17"  

artifact_path: "/opt/models/defect-detector-v17_orin-nx_trt8.6.engine"  

threshold: 0.52  

device_class: "jetson-orin-nx-8gb" 

candidate:  

model_name: "defect-detector-v18"  

artifact_path: "/opt/models/defect-detector-v18_orin-nx_trt8.6.engine"  

threshold: 0.52  

device_class: "jetson-orin-nx-8gb" 

#Candidate is only loaded and run if mode is 'shadow' or 'candidate_active'. 

#If artifact_path does not exist at startup, service starts in active_only mode 

#and logs a warning — it does not fail to start. 

mode: "shadow" 

#active_only — candidate session not loaded; no shadow logging. 

# shadow — both sessions loaded; active drives PLC; candidate logged only. 

# candidate_active — candidate drives PLC; active session retained for rollback. 

# Transition to candidate_active requires explicit orchestrator command; 

# inference service cannot self-promote. 

shadow_logging:  

score_log: true # always on; negligible cost  

divergence_frames: true # retain compressed frame on label disagreement boundary_band_low: 0.40 # retain frame if either score falls in [0.40, 0.60] boundary_band_high: 0.60  

full_frame_capture: false # disabled by default; enable only for burst capture retention_days: 30  

max_disk_mb: 4096 # shadow log evicts oldest entries if limit reached

‍

Implementation — complete and self-consistent

‍

The following implementation is complete and handles all mode and session state combinations explicitly — there are no assumed globals or undefined fallback paths

‍

#shadow_inference.py 

import time  

import threading  

from dataclasses import dataclass, field  

from typing import Optional  

import numpy as np  

import onnxruntime as ort  

from frame_context import FrameContext  

from shadow_logger import ShadowLogger 

@dataclass  

class ModelHandle:  

model_name: str  

session: ort.InferenceSession  

input_name: str  

output_name: str  

threshold: float 

@dataclass  

class ShadowConfig:  

mode: str          # active_only | shadow | candidate_active  

active: ModelHandle  

candidate: Optional[ModelHandle]        # None when mode is active_only  

score_log: bool = True  

divergence_frames: bool = True 

 boundary_band: tuple = (0.40, 0.60)  

full_frame_capture: bool = False 

@dataclass  

class InferenceResult:  

label: str  

score: float  

latency_ms: float  

model_name: str 

def _run_session(handle: ModelHandle, frame: np.ndarray) -> tuple[float, float]:  

"""Run a single session; return (score, latency_ms)."""  

start = time.perf_counter()  

scores = handle.session.run(  

[handle.output_name], {handle.input_name: frame}  

)  

latency_ms = (time.perf_counter() - start) * 1000.0  

# Assuming scores[0] contains the scalar float we need  

return float(scores[0][0] if isinstance(scores, list) else scores), latency_ms 

def infer(  

frame: np.ndarray,  

ctx: FrameContext,  

config: ShadowConfig,  

logger: ShadowLogger, 

 ) -> InferenceResult:  

"""  

Run inference according to current shadow config.  

Only the active model drives the return value in shadow mode.  

Candidate results are logged but never returned to the caller.  

"""  

active_score, active_latency = _run_session(config.active, frame)  

active_label = "defective" if active_score >= config.active.threshold else "ok" 

 

# --- Shadow / candidate path --- 
if config.mode in ("shadow", "candidate_active") and config.candidate is not None: 
    cand_score, cand_latency = _run_session(config.candidate, frame) 
    cand_label = ( 
        "defective" if cand_score >= config.candidate.threshold else "ok" 
    ) 
 
    # Determine whether to retain the frame image alongside the score log 
    retain_frame = False 
    if config.divergence_frames and active_label != cand_label: 
        retain_frame = True 
         
    # FIX: Properly unpack the tuple bounds for comparison 
    if (config.boundary_band[0] <= active_score <= config.boundary_band[1] or 
        config.boundary_band[0] <= cand_score <= config.boundary_band[1]): 
        retain_frame = True 
         
    if config.full_frame_capture: 
        retain_frame = True 
 
    logger.log_shadow_pair( 
        frame_ctx=ctx, 
        active_score=active_score, 
        active_label=active_label, 
        active_latency_ms=active_latency, 
        active_model=config.active.model_name, 
        cand_score=cand_score, 
        cand_label=cand_label, 
        cand_latency_ms=cand_latency, 
        cand_model=config.candidate.model_name, 
        retain_frame=retain_frame, 
        frame=frame if retain_frame else None, 
    ) 
 
# --- Return value: active in shadow mode; candidate in candidate_active mode --- 
# NOTE: mode transition to candidate_active is set by the orchestrator via 
# config reload (/reload-config endpoint) — the inference service cannot 
# self-promote. This prevents accidental promotion during a shadow run. 
if config.mode == "candidate_active" and config.candidate is not None: 
    # Use cand_score/cand_label/cand_latency from the block above 
    return InferenceResult( 
        label=cand_label, 
        score=cand_score, 
        latency_ms=cand_latency, 
        model_name=config.candidate.model_name, 
    ) 
 
return InferenceResult( 
    label=active_label, 
    score=active_score, 
    latency_ms=active_latency, 
    model_name=config.active.model_name, 
)

Canary and automated rollback

‍

When shadow-mode analysis confirms that the candidate model meets promotion criteria acceptable divergence rate, no regression on critical defect classes, latency within budget on HIL promote to canary.

‍

Canary promotion routes a bounded subset of the fleet (e.g., one station on one line, or one shift's worth of production) to the candidate model while the remainder continues on the active model.

‍

The canary window is a live production trial under controlled exposure: the candidate drives real PLC actuation on real parts, and its behaviour is measured against five distinct rollback trigger categories, each of which has different detection mechanisms, response times, and risk profiles.

‍

Canary scope and duration

‍

Scope the canary to the smallest unit of the fleet that gives statistically meaningful volume typically one station running one product family for a minimum of one full shift (8 hours) or a defined part count (e.g., 10,000 inspected parts), whichever comes first.

‍

Extend the canary window if production volume is low or if the product mix during the canary period does not represent the full variant range the model will encounter in production.

‍

The five rollback trigger categories

‍

Generic "SLO breach" language is not precise enough to act on in an automotive inspection context. Each failure mode has a different detection signal, a different urgency, and a different appropriate response. Treat these as five separate monitoring channels, each with its own alert threshold and rollback policy:

1. Latency breach

‍

What it is: The candidate model's p99 end-to-end inference latency — measured from frame acquisition to PLC write — exceeds the station's documented actuation budget.

‍

Why it is distinct: A model that is more accurate but slower may cause missed rejections not because of wrong predictions but because the decision arrives after the part has passed the actuator. Latency degradation can also be gradual — the model performs within budget on a cold device but drifts over budget as the GPU thermals rise during a full shift.

‍

Detection: Prometheus histogram on p99_latency_ms per station, measured continuously during the canary window. Compare against the HIL-validated p99 baseline recorded at promotion time.

‍

Rollback threshold and policy:

‍

latency_breach: 
  trigger: p99_latency_ms > actuation_budget_ms  # station-specific, from design doc 
  sustained_window: 60s       # breach must persist for 60s to exclude transient spikes 
  immediate_trigger: p99_latency_ms > actuation_budget_ms * 1.5  # hard ceiling — instant rollback 
  action: rollback_to_active 
  notify: ops_team, quality_team

2. Confidence-distribution anomaly

‍

What it is: The candidate model's output score distribution shifts significantly from its shadow-mode baseline scores cluster near 0 or 1 when they previously spread across the distribution, or the mean score drifts upward or downward without a corresponding change in ground-truth defect rate.

‍

Why it is distinct: Confidence-distribution shifts often precede visible accuracy degradation they are an early warning that the model is encountering inputs outside its training distribution (e.g., a lighting change, a fixture adjustment, or a new part variant introduced without retraining). Acting on distribution shifts before they manifest as missed defects or over-rejection is the difference between a proactive canary and a reactive incident.

‍

Detection: Track the rolling score distribution histogram (p10, p25, p50, p75, p90) per defect class during the canary window. Compare against the shadow-mode baseline distribution using a statistical distance metric (e.g., KL divergence or Population Stability Index). A PSI > 0.2 on any defect class is conventionally treated as a significant distribution shift requiring investigation.

‍

Rollback threshold and policy:

‍

confidence_distribution_anomaly: 
  metric: population_stability_index   # computed per defect class, rolling 30-min window 
  warning_threshold: 0.10              # flag for investigation — do not yet rollback 
  rollback_threshold: 0.20             # significant shift — suspend canary, revert to active 
  action: suspend_canary_and_investigate 
  notify: ml_team, quality_team 
  # NOTE: PSI breach triggers investigation, not blind rollback — the shift may indicate 
  # a genuine process change (e.g., new part batch) rather than model failure. 
  # Quality team must adjudicate before full rollback or canary continuation.

3. Over-rejection

‍

What it is: The candidate model rejects a significantly higher proportion of parts than the active model on the same product family, without a corresponding confirmed increase in actual defect rate.

‍

Why it is distinct: Over-rejection has direct, measurable commercial and operational impact: scrap cost, rework cost, line throughput reduction, and operator confidence erosion. In automotive programmes, a sudden increase in reject rate is immediately visible to production supervisors and will generate pressure to override or disable the inspection system — making over-rejection a threat not just to quality but to the long-term viability of the AI inspection programme.

‍

Detection: Track reject_rate_pct per station per product family in a rolling window during the canary window. Compare against the active model's reject rate on the same product family over the preceding 5 shifts (the rolling baseline). A reject rate increase beyond the threshold that cannot be explained by a confirmed upstream process change triggers rollback.

‍

Rollback threshold and policy:

‍

over_rejection: 
  metric: reject_rate_pct 
  baseline_window: 5_shifts           # rolling baseline from active model 
  rollback_threshold_relative: +15%   # candidate reject rate > baseline + 15% 
  rollback_threshold_absolute: +3pct  # or absolute increase > 3 percentage points 
  sustained_window: 30min             # must persist for 30 minutes to exclude shift-start variation 
  action: rollback_to_active 
  notify: ops_team, quality_team, production_supervisor 
  # NOTE: Before rollback executes, system checks whether a confirmed upstream 
  # process change (e.g., new material batch, tooling change) was logged in the 
  # MES during the canary window. If yes, alert is escalated for human adjudication 
  # rather than automatic rollback.

4. Defect leakage

‍

What it is: The candidate model misses defects that the active model would have caught — confirmed by downstream quality events: re-inspection station escapes, end-of-line measurement failures, or customer-reported field escapes traceable to parts inspected during the canary window.

‍

Why it is distinct: Defect leakage is the highest-severity rollback trigger in an automotive context it represents parts with confirmed defects that passed inspection and entered the supply chain or reached end customers. Unlike over-rejection, which is a cost and throughput problem, defect leakage is a safety, warranty, and regulatory compliance problem. It must trigger an immediate rollback with no sustained-window grace period, and the incident must be escalated to quality engineering regardless of the decision on the model.

‍

Detection: Requires a feedback loop from downstream quality gates back to the inspection system typically via the MES or a dedicated quality event bus. Parts are tracked by carrier ID or part serial number; a downstream escape event is joined to the canary window inspection log via the frame correlation ID established at acquisition.

‍

Rollback threshold and policy:

‍

defect_leakage: 
  trigger: confirmed_escape_count >= 1  # zero tolerance — any confirmed escape triggers immediate rollback 
  sustained_window: none                # immediate — no grace period 
  action: immediate_rollback_to_active 
  notify: ml_team, quality_team, production_supervisor, quality_manager 
  post_rollback: mandatory_incident_review 
  # NOTE: Confirmed escape = downstream quality event joined to a frame inspected 
  # by the candidate model during the canary window. Unconfirmed escapes (suspect 
  # but not yet verified) trigger a canary suspension pending investigation.

5. System-health failure

‍

What it is: The edge node running the candidate model exhibits infrastructure-level degradation GPU memory exhaustion, thermal throttling, process crashes, watchdog timeouts, or disk saturation that is not present on nodes running the active model.

‍

Why it is distinct: System-health failures indicate that the candidate model or its serving configuration is incompatible with the production hardware environment in a way that was not caught by HIL validation for example, a memory leak in the candidate's session management, higher sustained GPU memory consumption that causes OOM under thermal load, or a larger model footprint that causes disk pressure on nodes with smaller SSDs.

‍

Detection: Prometheus gauges on GPU utilisation, GPU memory, CPU utilisation, thermal zone temperature, process restart count, and disk utilisation. Compare canary nodes against active-model nodes of the same device class during the same time window.

‍

Rollback threshold and policy:

‍

system_health_failure: 
  triggers: 
    - metric: gpu_memory_used_mb 
      threshold: 90%_of_total          # or absolute: 7372 MB on 8GB device 
      sustained_window: 5min 
    - metric: thermal_throttle_active 
      threshold: true 
      sustained_window: 2min 
    - metric: inference_process_restarts 
      threshold: 1                     # any restart during canary is a signal 
      sustained_window: none           # immediate 
    - metric: disk_used_pct 
      threshold: 85% 
      sustained_window: 10min 
  action: rollback_to_active 
  notify: ops_team, ml_team

Applying AWS MLOps rollback guidance to edge inspection risk

‍

AWS MLOps guidance explicitly identifies canary, shadow, and blue-green deployment strategies alongside three rollback options: revert to prior model, fallback to heuristics, and roll forward to a patched version.

‍

Applied to automotive edge inspection, these map to concrete responses:

The key principal AWS guidance establishes and which applies directly to edge inspection is that rollback strategies must be defined, tested, and rehearsed before the canary begins, not designed during an incident. On an automotive line, a rollback decision made under production pressure without a pre-defined playbook will be made incorrectly.

‍

The five trigger categories above, with their specific thresholds and actions, are the pre-defined playbook.

Monitoring: from device health to data drift

Effective monitoring for edge AI inspection requires three distinct, non-overlapping layers each answering a different operational question, owned by a different team, and acting on a different time horizon.

‍

Conflating them into a single "monitoring" bucket makes it harder to identify which layer is signalling a problem and who should respond.

Layer 1 : Device Health

‍

Device health monitoring covers the physical and OS-level state of each edge node. For automotive factory deployments this is especially important for fanless industrial PCs and Jetson modules operating in environments with welding heat, vibration, and dust ingress — conditions that cause thermal throttling and disk saturation long before they cause outright hardware failure.

‍

# monitoring.py — Layer 1: device health metrics 
from prometheus_client import Counter, Gauge, Histogram, start_http_server 
GPU_UTIL = Gauge( 
    "autoqc_gpu_utilization_pct", 
    "GPU utilization percentage (0–100)", 
) 
GPU_MEM_USED = Gauge( 
    "autoqc_gpu_memory_used_mb", 
    "GPU memory currently in use (MB)", 
) 
CPU_UTIL = Gauge( 
    "autoqc_cpu_utilization_pct", 
    "CPU utilization percentage (0–100)", 
) 
THERMAL_ZONE = Gauge( 
    "autoqc_thermal_zone_celsius", 
    "Thermal zone temperature in degrees Celsius", 
    ["zone"],                    # e.g. "gpu", "cpu", "board" 
) 
DISK_USED_PCT = Gauge( 
    "autoqc_disk_used_pct", 
    "Disk utilization percentage (0–100)", 
    ["mount"],                   # e.g. "/", "/opt/models" 
) 
PROC_RESTARTS = Counter( 
    "autoqc_process_restarts_total", 
    "Total inference process restarts since node boot", 
) 
def record_device_health( 
    gpu_util_pct: float, 
    gpu_mem_used_mb: float, 
    cpu_util_pct: float, 
    thermal_readings: dict[str, float],   # e.g. {"gpu": 72.5, "board": 61.0} 
    disk_readings: dict[str, float],      # e.g. {"/": 42.3, "/opt/models": 61.7} 
) -> None: 
    """Update all device health gauges. Call on a regular polling interval.""" 
    GPU_UTIL.set(gpu_util_pct) 
    GPU_MEM_USED.set(gpu_mem_used_mb) 
    CPU_UTIL.set(cpu_util_pct) 
    for zone, temp in thermal_readings.items(): 
        THERMAL_ZONE.labels(zone=zone).set(temp) 
    for mount, pct in disk_readings.items(): 
        DISK_USED_PCT.labels(mount=mount).set(pct)

‍

Export these metrics on a local Prometheus scrape port. Display on a local cell-level dashboard so line supervisors can see device health without cloud connectivity. Forward to a central fleet dashboard for cross-site visibility when connectivity is available.

‍

Layer 2 Inference-Path Performance

‍

Inference-path monitoring covers the end-to-end timing and throughput of the inspection pipeline — from frame acquisition through to PLC write.

‍

This is where you validate in production that your p99 latency budget is being met on every cycle, and where you detect per-stage bottlenecks before they compound into actuation failures. Instrument at each stage boundary using the FrameContext timestamps assigned at acquisition (as defined in Step 1 of the edge inference stack):

‍

# monitoring.py — Layer 2: inference-path performance metrics 
from prometheus_client import Counter, Histogram 
from frame_context import FrameContext 
# Buckets aligned to automotive actuation budgets (1–200 ms range). 
# Fine resolution below 50 ms where budget breaches are most consequential. 
STAGE_LATENCY = Histogram( 
    "autoqc_stage_latency_ms", 
    "Per-stage pipeline latency in milliseconds", 
    ["stage"],                   # see VALID_STAGES below 
    buckets=[1, 2, 5, 10, 20, 30, 50, 75, 100, 150, 200, 500], 
) 
VALID_STAGES = frozenset({ 
    "acquisition_to_preprocess", 
    "preprocess", 
    "inference", 
    "decision_to_plc_write", 
    "end_to_end", 
}) 
PREDICTIONS = Counter( 
    "autoqc_predictions_total", 
    "Total predictions by outcome label and model version", 
    ["label", "model_version"],  # label values: "ok", "defective", "error", "timeout" 
) 
STALE_DROPS = Counter( 
    "autoqc_stale_decision_drops_total", 
    "Decisions discarded by the IO adapter due to staleness threshold breach", 
    ["station_id"], 
) 
def init_metrics(port: int = 9100) -> None: 
    """Start the Prometheus HTTP scrape server on the given port.""" 
    start_http_server(port) 
def record_inference( 
    ctx: FrameContext, 
    label: str, 
    model_version: str, 
) -> None: 
    """ 
    Record per-stage latencies from FrameContext timestamps and 
    increment the prediction counter. 
    All FrameContext timestamps are in nanoseconds (int). 
    Latency values are converted to milliseconds before observation. 
    """ 
    ns_to_ms = 1_000_000.0 
    STAGE_LATENCY.labels(stage="acquisition_to_preprocess").observe( 
        (ctx.preprocess_start_ns - ctx.hw_timestamp_ns) / ns_to_ms 
    ) 
    STAGE_LATENCY.labels(stage="preprocess").observe( 
        (ctx.preprocess_end_ns - ctx.preprocess_start_ns) / ns_to_ms 
    ) 
    STAGE_LATENCY.labels(stage="inference").observe( 
        (ctx.inference_end_ns - ctx.inference_start_ns) / ns_to_ms 
    ) 
    STAGE_LATENCY.labels(stage="decision_to_plc_write").observe( 
        (ctx.plc_write_ts_ns - ctx.decision_ts_ns) / ns_to_ms 
    ) 
    STAGE_LATENCY.labels(stage="end_to_end").observe( 
        (ctx.plc_write_ts_ns - ctx.hw_timestamp_ns) / ns_to_ms 
    ) 
    PREDICTIONS.labels( 
        label=label, 
        model_version=model_version, 
    ).inc() 
def record_stale_drop(station_id: str) -> None: 
    """Increment stale-decision drop counter for the given station.""" 
    STALE_DROPS.labels(station_id=station_id).inc()

‍

Display per-stage latency histograms (p50/p95/p99) on both the local cell dashboard and the central fleet dashboard. Alert on p99 end-to-end latency approaching the actuation budget threshold — this is a leading indicator of missed rejections, not a lagging one.

Layer 3 Model and Data Behaviour

‍

Model and data behaviour monitoring covers whether the model's predictions remain accurate and well-calibrated as real-world production inputs evolve. This layer cannot be fully automated it requires ground-truth feedback from downstream quality events, human review of distribution anomalies, and a defined escalation path to retraining when drift is confirmed.

‍

Ownership is shared between ML engineering and quality engineering.

‍

Model behaviour :track per defect class, per product family, per shift:

Score distribution histogram: Record the rolling distribution of model output scores (p10, p25, p50, p75, p90) per defect class. Sharp shifts scores clustering near 0.5, or the mean drifting significantly indicate the model is encountering inputs outside its training distribution. Compare against the shadow-mode baseline established at promotion time.

Prediction class distribution : Track the proportion of frames classified as each label over time. A sudden increase in a specific defect class (e.g., weld porosity) may indicate a genuine upstream process problem or model drift both require investigation, with different responses.

Confidence boundary rate : The proportion of frames where the model score falls within the uncertainty band (e.g., 0.40–0.60). A rising boundary rate is a leading indicator of accuracy degradation before it manifests in precision/recall metrics.

Label feedback and in-field accuracy:

‍
Where the line has a re-inspection station, end-of-line measurement system, or CMM, tie inspection decisions back to downstream ground-truth outcomes using the frame correlation ID established at acquisition:

Join edge pass/fail decisions to downstream quality events by frame ID or part carrier ID.

Compute in-field precision and recall per defect class on the joined subset even a 5–10% sample rate provides statistically meaningful accuracy estimates.

Track false-reject rate trends per product family rising false rejects are the earliest quantitative signal of over-rejection risk before the canary rollback threshold is reached.

Data drift and retraining triggers : three concrete quantitative signals:

Population Stability Index (PSI) > 0.20 on any input feature distribution sustained over a full shift.

In-field precision or recall falls more than 3 percentage points below the model's validated evaluation metrics on any critical defect class.

Confidence boundary rate rises above 15% of total frames on a product family the model was trained on.

In hybrid architectures this layer is operationally easier because sampled frames, score logs, and feature statistics stream to the central data lake drift detection runs centrally without additional edge tooling. In edge-only architectures, drift monitoring requires periodic manual export of score logs and representative frame samples for offline analysis.

Model behavior and label feedback

Where possible, tie inspection decisions back to downstream quality events:

Re-inspection stations or end-of-line measurement systems can provide ground truth for a subset of parts.

Compare live model predictions against this feedback to compute in-field precision/recall and drift.

Monitor:

Class distribution shifts (e.g., sudden increase in a particular weld defect type).

Confidence histograms over time—sharp shifts can indicate process changes or lighting degradation.

Data drift and retraining triggers

When feature distributions (e.g., brightness, contrast, defect morphology) drift significantly from the training dataset, schedule retraining or at least a data review.

‍

Hybrid architectures make this easier because sampled images and features are already streaming to a central data lake.

Designing for intermittent connectivity and harsh conditions

Automotive plants are noisy RF and electrical environments, with welding, large motors, and maintenance activities causing frequent micro-outages and transient network issues. Your architecture should assume:

Inference runs 24/7 without cloud: All models and configs required for normal operation are present locally, with no runtime dependency on cloud APIs.

Store-and-forward telemetry: Metrics and image samples buffer locally (disk or embedded TSDB) and sync to the cloud via a dedicated agent with retries and backoff.

Resilient updates: Artifacts download to a dedicated staging location never directly into the active path and are checksum-verified and smoke-tested before activation. This keeps the active model untouched at every stage; a power loss during download or testing leaves the node running the current production model with no intervention required.
- Activation is atomic: the switch from old to new model is performed via a single symlink rename (ln -sfn), which resolves to one kernel-level rename(2) call. There is no window during which neither the old nor the new artifact is available. The previous active model is archived as last-known-good before the switch never after so it is always available as a rollback target.
  ‍
- On every startup, the inference service runs a boot-time validation pass before opening any port or accepting frames: it loads the active artifact, runs a synthetic inference pass, and verifies the output. If validation fails due to a corrupt artifact, a driver mismatch, or an incomplete activation the service automatically repoints the active symlink to the last-known-good model and retries. If last-known-good also fails, the service must not start and must emit a structured error requiring manual intervention. A node with no validated model must never silently pass parts.

A simple sync agent pattern:

Watches a local directory for new .jsonl (metrics) and .tar.gz (image batches).

Compresses and uploads bundles to the cloud gateway when a threshold is reached or a timer fires.

Marks bundles as “sent” in a small local DB; deletion is only allowed once the cloud acknowledges ingestion.

Idempotency and duplicate-safe ingestion on the cloud side

‍
In factory OT environments, network interruptions frequently occur mid-upload the bundle is transmitted, the cloud ingests it, but the acknowledgement never reaches the edge node.

‍

The sync agent correctly treats the missing acknowledgement as a failure and retries the upload on the next cycle. Without idempotency on the cloud side, this retry delivers a duplicate bundle that is ingested a second time creating duplicate metric entries, double-counted defect events, and inflated inference volumes that corrupt fleet-wide analytics and make incident reconstruction unreliable.

‍

The cloud ingestion endpoint must therefore be designed to be idempotent by bundle ID: every bundle is assigned a unique, deterministic ID at creation time on the edge node (e.g., a SHA-256 of the bundle contents, or a structured ID combining station ID, timestamp, and sequence number).

‍

The cloud gateway checks this ID against a deduplication store before processing:

If the bundle ID has not been seen before ingest, record the ID, return acknowledgement.

If the bundle ID has already been ingested return acknowledgement immediately without re-processing. The edge agent receives its acknowledgement, marks the bundle as sent, and moves on.

This pattern ensures that any number of retries produces exactly one ingested record per bundle, regardless of how many times the upload is attempted. The deduplication store requires only the bundle ID and ingestion timestamp a lightweight entry that can be retained for a rolling window (e.g., 7 days) covering the maximum realistic retry period before being expired.

‍

Without this guarantee, connectivity instability which is normal, not exceptional, on a factory OT network becomes a source of data quality corruption that compounds silently over time and is expensive to reconcile after the fact.

End-to-end blueprint: from cloud training to edge control

Putting it all together, a reference blueprint for an automotive quality control system looks like this:

Data flows upward (metrics + samples) for learning and analysis.

Models and configs flow downward in controlled, versioned deployments with shadow/canary/rollback capabilities.

The real-time control loop remains strictly local, meeting 10–50 ms response time requirements even at 100+ parts per minute.

Technical design checklist (factory‑floor ready)

When you design or review an automotive AI quality‑control system, validate these items explicitly.

Latency budget defined and enforced:
Document line speed, part spacing, actuator travel time, and a hard upper bound for inspection latency (for example, p99 end‑to‑end ≤ 50 ms), then verify it with real load tests on the target hardware.

Edge inference for all inline decisions:
Ensure all pass/fail decisions that drive ejectors or line stops are executed on edge hardware; reserve cloud services strictly for training, analytics, fleet coordination, and configuration—not the primary real‑time control loop.

Hardware sized for workload and environment:
Use industrial PCs or Jetson‑class modules with sufficient GPU/Tensor cores, thermal headroom, ingress protection, and IO (GigE Vision, digital IO, fieldbus) for your camera rates and model complexity.

End‑to‑end CI/CD pipeline in place:
Implement automated tests (unit, regression on golden images, performance), reproducible multi‑arch builds, signed artifacts, and push‑button deployment to edge nodes via k3s, IoT Edge, or a similar orchestrator.

Model lifecycle and governance formalized:
Maintain a central model registry with a two-layer artifact structure separating source models from compiled, device-class-specific deployment artifacts with the following non-negotiable properties: ‍
- ‍Versioned artifacts with promotion criteria: Every source model and its compiled edge artifacts progress through an explicit state machine: staging → shadow → canary → production → rolled-back. No artifact reaches production without passing HIL validation gates at each promotion step. Promotion decisions are recorded with operator ID, timestamp, and the HIL run ID that authorised the transition.
  ‍‍
- Runtime compatibility locked per compiled artifact: Each compiled artifact (TensorRT engine, OpenVINO IR, ORT cache) carries a machine-readable compatibility manifest specifying the exact runtime stack it was built for: ORT version, CUDA, cuDNN, TensorRT, JetPack, and minimum driver version. The deployment orchestrator validates this manifest against the target node's installed stack before staging the artifact a mismatch is a hard block, not a warning.
  ‍‍
- Signed artifacts : Every model artifact is signed at the point of registry entry (e.g., SHA-256 manifest, cosign signature) and the signature is verified by the edge node's deployment agent before the artifact is moved from staging to the active path. An artifact whose signature cannot be verified is discarded without activation. This prevents deployment of tampered, corrupted, or misrouted artifacts and satisfies software supply chain integrity requirements under IATF 16949 and similar frameworks. ‍
  ‍
- Compiled artifact traceability not just source model traceability: The audit trail must record which compiled artifact ran on which station and when not just which source model version was deployed. A source model version (e.g., defect-detector-v18) may have multiple compiled artifacts for different device classes, each built with different quantization, calibration, and runtime parameters. Recording only the source model version leaves the audit trail ambiguous: two stations running defect-detector-v18 may have been running different TRT engines with different INT8 calibration profiles, producing different inference behaviour on identical inputs. The registry must log: compiled artifact ID, device class, station ID, activation timestamp, deactivation timestamp, and the operator or orchestrator that performed the deployment.
  ‍
  A complete audit query for any production incident must be able to answer: "Which compiled artifact, built from which source model, validated on which HIL rig, signed by which registry entry, was active on Station A at 14:15 on March 27 — and what was its runtime compatibility manifest?" If your registry cannot answer this query end-to-end, your governance is incomplete.

Safe rollout patterns and deterministic rollback:
Standardize on shadow deployments followed by tightly scoped canaries, with SLO‑driven automatic rollback (latency, false reject/miss rates, drift) and a simple, documented path for manual rollback during incidents.

Comprehensive, multi‑layer monitoring:
Instrument device health (CPU/GPU, thermals, disk), latency histograms, prediction distributions, and drift indicators across all three monitoring layers device health, inference-path performance, and model/data behaviour. Dashboards and alerts must be available at two independent visibility levels, and the local level must function without any dependency on WAN connectivity:
‍‍
- ‍Local visibility cell and plant level (WAN-independent): Each edge node runs a local Prometheus scrape endpoint and a locally hosted dashboard (e.g., Grafana served from the edge node or a plant-floor server on the OT network). Line supervisors and process engineers can see device health, inference latency, reject rates, and active model version at the cell or line level without requiring internet connectivity or cloud access. Local alerts visual on the HMI, audible at the station, or pushed over the plant's internal network — fire independently of WAN state. WAN loss must not blind plant operators to inspection health. A monitoring architecture where all alerting routes through the cloud means that the moment the factory loses internet connectivity exactly when an operator most needs situational awareness the monitoring goes dark. Local visibility is therefore a safety and operational requirement, not a convenience feature. ‍
  ‍
- Central visibility fleet level (WAN-dependent, non-critical path): Aggregated metrics forward to a central time-series database and fleet dashboard when connectivity is available, enabling cross-line, cross-plant analytics: comparative reject rates, fleet-wide latency trends, model version distribution across stations, and drift indicators per product family. Central dashboards provide the ML engineering and operations management view. Their unavailability during WAN outages is acceptable because local visibility at the cell level continues uninterrupted but they must resume automatically when connectivity is restored, with no manual re-sync required.
  Alerts must be tiered to match their visibility level: latency budget breaches, process restarts, and thermal throttling alert locally in real time; fleet-wide drift indicators and cross-line reject rate anomalies alert centrally on a shift-level cadence. No alert that requires immediate line-operator action should depend on a cloud routing path.

Connectivity‑aware and failure‑aware design:
Use store‑and‑forward telemetry, resilient OTA updates (staging + checksum + smoke tests), and clearly defined behavior under partial failures or WAN loss so that inline inspection continues safely even when the cloud does not.

If these aspects are explicitly addressed in your design documents and implementation, you can move from fragile cloud‑centric PoCs to robust, low‑latency edge deployments that remain reliable under real automotive factory‑floor conditions.

Closing the Loop: From Edge AI Architecture to Production Reality

Across this five-part series, we have traced the full lifecycle of edge AI for automotive quality control — from the physical constraints that make cloud AI structurally incompatible with inline inspection, through architecture selection, inference stack design, deployment engineering, and fleet-scale operations.

‍

The consistent thread across all five parts is this: edge AI reliability is not a property of any single component — it is an emergent property of the entire system, designed with explicit latency budgets, defined failure semantics, enforced architectural boundaries, and continuous operational discipline.

‍

A correct model with a brittle deployment pipeline will fail. A robust pipeline with poor observability will degrade silently. The programmes that succeed are those that treat every layer — hardware, software, MLOps, and operations — as a first-class engineering concern from day one.

‍

If you are starting a new edge AI inspection programme, begin with Part 1 and let the physical constraints drive your architecture. If you are operating an existing programme and hitting operational challenges, the monitoring framework in Part 5 and the deployment engineering patterns in Part 4 are the right entry points.

Sonal Dwevedi & Tharun Mathew

Part 5: Hard Real-Time Edge AI for Automotive Inspection: Designing the Inference and Control-Plane Split