Hard Real-Time Edge AI for Automotive Inspection

Deployment Engineering for Edge AI: CI/CD, Rollouts, and the Inference Service

A well-designed inference stack running on the right hardware is necessary but it only delivers value if it can be built, tested, deployed, and updated reliably across an entire fleet of edge devices.

‍

That is the deployment engineering problem, and it is where many edge AI programmes stumble.

‍

Part 4 covers the full deployment engineering layer: the FastAPI inference service that wraps the ONNX Runtime with hot-reload, timeout handling, and observability; the CI/CD pipeline that validates model artifacts before they are ever eligible for promotion; the hardware-in-the-loop gate that enforces latency compliance on real target hardware; and the orchestrator decision that determines how desired-state manifests are delivered and enforced across your edge fleet. Each section provides concrete, implementable patterns not just guidance

Concrete edge inference service (FastAPI + ONNX Runtime)

The FastAPI inference service below is designed as an internal control surface for modularity it provides a clean, testable interface between the preprocessing stage and the inference runtime within the edge node's local process boundary. It is not the interface used for pass/fail decisions to the PLC.

‍

As established in Step 5, the PLC integration layer uses a dedicated IO adapter with a deterministic write loop, bounded queues, and staleness checks none of which are appropriate to implement over HTTP.

‍

If you are reading this service as the real-time path to the PLC, it is not: the decision output from this service feeds the IO adapter, which owns the PLC communication boundary.

‍

app/edge_inference_service.py 

import os  

import time  

import threading  

from pathlib import Path  

from contextlib import contextmanager 

import numpy as np  

import onnxruntime as ort  

from fastapi import FastAPI, HTTPException, status  

from pydantic import BaseModel, Field, field_validator 

from monitoring import init_metrics, record_inference 

--------------------------------------------------------------------------- 

Configuration — all values from environment, with safe defaults 

--------------------------------------------------------------------------- 

MODEL_DIR = Path(os.getenv("MODEL_DIR", "/opt/models"))  

ACTIVE_MODEL = os.getenv("ACTIVE_MODEL", "defect-detector-v17")  

ORT_THREADS = int(os.getenv("ORT_THREADS", "2"))  

METRICS_PORT = int(os.getenv("METRICS_PORT", "9100"))  

INFER_TIMEOUT_S = float(os.getenv("INFER_TIMEOUT_S", "0.040")) # 40 ms hard ceiling 

Expected input dimensions — must match training export 

INPUT_CHANNELS = 3  

INPUT_HEIGHT = 512  

INPUT_WIDTH = 512  

EXPECTED_NUMEL = INPUT_CHANNELS * INPUT_HEIGHT * INPUT_WIDTH # 786,432 elements 

Tensor value bounds — reject inputs outside training distribution 

TENSOR_MIN = float(os.getenv("TENSOR_MIN", "0.0"))  

TENSOR_MAX = float(os.getenv("TENSOR_MAX", "1.0")) 

--------------------------------------------------------------------------- 

Session management — supports hot reload without service restart 

--------------------------------------------------------------------------- 

_session_lock = threading.Lock()  

_active_session: ort.InferenceSession | None = None  

_active_model_name: str = "" 

def _build_session(model_name: str) -> ort.InferenceSession:  

model_path = MODEL_DIR / f"{model_name}.onnx"  

if not model_path.exists():  

raise FileNotFoundError(f"Model artifact not found: {model_path}")  

opts = ort.SessionOptions()  

opts.intra_op_num_threads = ORT_THREADS  

opts.graph_optimization_level =  

ort.GraphOptimizationLevel.ORT_ENABLE_ALL  

return ort.InferenceSession(  

model_path.as_posix(),  

sess_options=opts,  

providers=["CUDAExecutionProvider", "CPUExecutionProvider"], 

 ) 

def load_session(model_name: str) -> None:  

"""Load or hot-reload the inference session under a write lock."""  

global _active_session, _active_model_name  

new_session = _build_session(model_name) # build outside lock — expensive with _session_lock:  

_active_session = new_session  

_active_model_name = model_name 

def get_session() -> tuple[ort.InferenceSession, str]:  

"""Return current session and model name under a read lock."""  

with _session_lock: 

 if _active_session is None:  

raise RuntimeError("Inference session not initialised")  

return _active_session, _active_model_name 

Initial load at startup 

load_session(ACTIVE_MODEL) 

--------------------------------------------------------------------------- 

Application 

--------------------------------------------------------------------------- 

app = FastAPI(title="Edge Inference Service")  

init_metrics(port=METRICS_PORT) 

--------------------------------------------------------------------------- 

Request / Response models with bounded validation 

--------------------------------------------------------------------------- 

class InferenceRequest(BaseModel):  

# Flattened CHW float32 tensor: must be exactly C x H x W elements  

tensor: list[float] = Field( 

 ...,  

min_length=EXPECTED_NUMEL,  

max_length=EXPECTED_NUMEL,  

description=f"Flattened CHW float32 tensor, exactly {EXPECTED_NUMEL} elements " f" 

({INPUT_CHANNELS}x{INPUT_HEIGHT}x{INPUT_WIDTH})",  

)  

threshold: float = Field(default=0.5, ge=0.0, le=1.0) 

 correlation_id: str | None = Field(default=None, max_length=128) 

@field_validator("tensor") 
@classmethod 
def check_value_bounds(cls, v: list[float]) -> list[float]: 
    arr = np.asarray(v, dtype=np.float32) 
    if arr.min() < TENSOR_MIN or arr.max() > TENSOR_MAX: 
        raise ValueError( 
            f"Tensor values must be in [{TENSOR_MIN}, {TENSOR_MAX}]; " 
            f"got min={arr.min():.4f}, max={arr.max():.4f}. " 
            "Ensure preprocessing normalization matches training." 
        ) 
    return v 
  

class InferenceResponse(BaseModel):  

defective: bool  

score: float  

latency_ms: float  

model_version: str  

correlation_id: str | None = None 

class ReloadRequest(BaseModel):  

model_name: str = Field(..., min_length=1, max_length=128) 

--------------------------------------------------------------------------- 

Inference endpoint 

--------------------------------------------------------------------------- 

@app.post("/infer", response_model=InferenceResponse)  

def infer(req: InferenceRequest) -> InferenceResponse:  

session, model_name = get_session()  

input_name = session.get_inputs()[0].name  

output_name = session.get_outputs()[0].name 

 

img = np.asarray(req.tensor, dtype=np.float32).reshape( 
    1, INPUT_CHANNELS, INPUT_HEIGHT, INPUT_WIDTH 
) 
 
# --- Timeout-bounded inference via thread + event --- 
result_holder: dict = {} 
exc_holder:    dict = {} 
 
def _run() -> None: 
    try: 
        result_holder["scores"] = session.run([output_name], {input_name: img})[0] 
    except Exception as exc: 
        exc_holder["error"] = exc 
 
t = threading.Thread(target=_run, daemon=True) 
start = time.perf_counter() 
t.start() 
t.join(timeout=INFER_TIMEOUT_S) 
latency_ms = (time.perf_counter() - start) * 1000.0 
 
if t.is_alive(): 
    # Inference did not complete within budget — circuit open 
    record_inference(latency_ms, label="timeout", model_version=model_name) 
    raise HTTPException( 
        status_code=status.HTTP_503_SERVICE_UNAVAILABLE, 
        detail=f"Inference exceeded {INFER_TIMEOUT_S * 1000:.0f} ms timeout. " 
               "Downstream adapter must apply configured fail-safe.", 
    ) 
 
if "error" in exc_holder: 
    record_inference(latency_ms, label="error", model_version=model_name) 
    raise HTTPException( 
        status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, 
        detail=f"Inference runtime error: {exc_holder['error']}", 
    ) 
 
defect_score = float(result_holder["scores"][0][0]) 
defective    = defect_score >= req.threshold 
label        = "defective" if defective else "ok" 
record_inference(latency_ms, label=label, model_version=model_name) 
 
return InferenceResponse( 
    defective=defective, 
    score=defect_score, 
    latency_ms=latency_ms, 
    model_version=model_name, 
    correlation_id=req.correlation_id, 
) 
  

--------------------------------------------------------------------------- 

Session reload endpoint — hot-swap model without service restart 

--------------------------------------------------------------------------- 

@app.post("/reload-model", status_code=status.HTTP_200_OK)  

def reload_model(req: ReloadRequest) -> dict:  

"""  

Hot-reload a new model artifact without restarting the service. Called by the deployment orchestrator after staging and validating a new model version. In-flight requests complete against the old session before the lock is acquired.  

"""  

try:  

load_session(req.model_name)  

except FileNotFoundError as exc:  

raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail=str(exc))  

return {"status": "reloaded", "active_model": req.model_name} 

--------------------------------------------------------------------------- 

Health and readiness 

--------------------------------------------------------------------------- 

@app.get("/healthz")  

def health() -> dict:  

_, model_name = get_session()  

return {"status": "ok", "active_model": model_name}

‍

This pattern is compatible with Jetson, industrial x86 PCs, and other accelerators, and it keeps the control surface small and testable.

CI/CD pipelines for edge deployments

To move from lab script to repeatable deployment, treat edge inspection as software-plus-model, with the same rigour as other production services. For automotive edge AI specifically, the pipeline must go further than typical software CI/CD: it must validate latency and resource behaviour on the actual target hardware class before any artifact is eligible for promotion to a production line.

CI: build, test, and package

‍

A robust CI pipeline should:

Run unit tests for preprocessing and decision logic.

Run regression tests on a curated golden dataset (e.g., historical BIW weld defects), with explicit pass/fail thresholds on precision, recall, and false-reject rate per defect class.

Build multi-arch containers (amd64/arm64) tagged with both code version and model version the two must be independently traceable.

Sign images and model artifacts (e.g., cosign, SHA-256 manifest) for integrity verification at the edge before deployment.

Model artifact compatibility checks treat these as a promotion gate, not a post-deployment surprise
Before a model artifact is packaged for deployment, the CI pipeline must validate compatibility across the full accelerator stack for each target device class. The components that must be version-locked and explicitly validated together are:

The CI pipeline should maintain an explicit compatibility matrix per supported device class (e.g., Jetson Orin NX 8GB + JetPack 6.0, x86 IPC + CUDA 12.2 + TRT 8.6) and run a model import smoke test against each matrix entry before the artifact is promoted. A model that passes regression on the build server but fails to load on the target device class is a deployment failure, not a model failure and it is entirely preventable at CI time.

‍

# .github/workflows/edge-ci.yml (excerpt) 
- name: Validate model artifact 
  run: | 
    # Record model version and attach artifact 
    echo "MODEL_VERSION=$(cat model_registry/active_version.txt)" >> $GITHUB_ENV 
    mkdir -p dist/models 
    cp artifacts/defect-detector.onnx \ 
      dist/models/defect-detector-${MODEL_VERSION}.onnx 
    # Validate ONNX opset compatibility against target ORT version 
    python scripts/validate_onnx_opset.py \ 
      --model dist/models/defect-detector-${MODEL_VERSION}.onnx \ 
      --target-ort-version 1.17.3 
    # Run compatibility smoke test per device class 
    python scripts/smoke_test_compat.py \ 
      --model dist/models/defect-detector-${MODEL_VERSION}.onnx \ 
      --device-matrix config/device_compatibility_matrix.yaml 
    # Sign artifact 
    sha256sum dist/models/defect-detector-${MODEL_VERSION}.onnx \ 
      > dist/models/defect-detector-${MODEL_VERSION}.onnx.sha256

‍

The CI pipeline should maintain an explicit compatibility matrix per supported device class (e.g., Jetson Orin NX 8GB + JetPack 6.0, x86 IPC + CUDA 12.2 + TRT 8.6) and run a model import smoke test against each matrix entry before the artifact is promoted.

‍

A model that passes regression on the build server but fails to load on the target device class is a deployment failure, not a model failure and it is entirely preventable at CI time

‍

#.github/workflows/edge-ci.yml (excerpt) 

name: Validate model artifact run: | 

#Record model version and attach artifact 

echo "MODEL_VERSION=$(cat model_registry/active_version.txt)" >> $GITHUB_ENV mkdir -p dist/models cp artifacts/defect-detector.onnx  

dist/models/defect-detector-${MODEL_VERSION}.onnx 

#Validate ONNX opset compatibility against target ORT version 

python scripts/validate_onnx_opset.py  

--model dist/models/defect-detector-${MODEL_VERSION}.onnx  

--target-ort-version 1.17.3 

# Run compatibility smoke test per device class 

python scripts/smoke_test_compat.py  

--model dist/models/defect-detector-${MODEL_VERSION}.onnx  

--device-matrix config/device_compatibility_matrix.yaml 

#Sign artifact 

sha256sum dist/models/defect-detector-${MODEL_VERSION}.onnx  

> dist/models/defect-detector-${MODEL_VERSION}.onnx.sha256

‍

‍Hardware-in-the-loop (HIL) validation the gate between lab and factory

‍

Passing unit tests and regression tests on a CI build server is necessary but not sufficient for automotive edge AI. A model that meets accuracy thresholds on a CPU-based build server may exceed its latency budget on a loaded Jetson Orin NX running at thermal steady state, or consume enough GPU memory to cause OOM failures when multiple inference sessions are co-resident. The gap between lab benchmark and factory behaviour is one of the most common causes of deployment failures in production edge AI programmes.

‍

Promotion to the factory staging environment must require a hardware-in-the-loop (HIL) validation stage executed on a representative hardware rig a device of the same SKU, OS image, driver stack, and thermal class as the production edge nodes:

Latency validation : Run the full inference pipeline (preprocessing → inference → decision logic) at production frame rate for a sustained period (minimum 10 minutes) on the HIL rig. Measure and record p50, p95, and p99 end-to-end latency. Promotion is blocked if p99 latency exceeds the station's documented actuation budget.

Resource utilisation validation : Record peak and sustained GPU utilisation, GPU memory consumption, CPU utilisation, and thermal headroom during the HIL run. Flag any run that causes GPU memory allocation failures, thermal throttling, or CPU saturation above defined thresholds all of which indicate the artifact will degrade under production load conditions.

Concurrent workload simulation : On nodes that run multiple services (e.g., inference service + sync agent + local metrics scraper), validate that the new model does not cause resource contention that pushes other services into failure modes.

Power-loss recovery test : Simulate a power loss during model artifact staging and verify that the node recovers to the last-known-good model, not a partially-written artifact. This validates the resilient update mechanism described in the connectivity section.
HIL rigs do not need to be physically located in the factory a lab rig of matching hardware with a representative synthetic frame feed is sufficient. The key requirement is hardware-class fidelity: the same Jetson SKU, the same JetPack version, the same ORT/CUDA/TRT stack as the production device class.

#Promotion gate — must pass before artifact enters staging registry 

hil_validation:  

latency:  

p99_max_ms: 45  # derived from station  

actuation budget  

p95_max_ms: 30  

duration_minutes: 10  

frame_rate_fps: 30 resources:  

gpu_memory_max_mb: 3072  

gpu_util_sustained_max_pct: 80  

thermal_throttle_allowed: false recovery:  

power_loss_test: required  

expected_recovery_model: last_known_good

CD :orchestrated rollout to edge: why the orchestrator choice matters

‍

On the factory side, a deployment orchestrator manages desired-state manifests, artifact delivery, and rollout sequencing across the edge fleet. The choice of orchestrator is not interchangeable : k3s, Azure IoT Edge, and custom agents solve fundamentally different problems across four dimensions that matter in automotive factory environments:

‍

Selection guidance:

Choose k3s when your team operates Kubernetes elsewhere, you want a consistent deployment model across cloud and edge, and you are willing to layer GitOps tooling for fleet state management. Best fit for larger stations with multiple co-resident workloads.

Choose Azure IoT Edge when your cloud control plane is Azure, device identity and certificate lifecycle management are compliance requirements, and you need native fleet-level desired-state management without building it yourself. Best fit for medium-to-large fleets under centralised IT governance.

Choose a custom agent when the station is simple (single-container inference service, no co-resident workloads), the fleet is small and homogeneous, operational simplicity is paramount, or your security architecture requires full control over every component in the deployment path. Best fit for highly constrained devices or environments with strict software supply chain requirements.
Whichever orchestrator you choose, the deployment behaviour must satisfy the same requirements: watch for desired-state manifests, pull and validate artifacts (checksum, smoke test, HIL gate status), deploy as canary or blue-green while monitoring SLO metrics, and roll back automatically if guardrails are breached.

Key Takeaways

Deploying edge AI reliably requires treating the model artifact and the serving code as jointly versioned, jointly tested software not separately managed artefacts. The CI pipeline must validate model-runtime compatibility per device class before promotion.

‍

The HIL gate must enforce p99 latency compliance on real hardware, not just on build servers. The orchestrator must be chosen based on fleet size, cloud platform alignment, and operational complexity tolerance not on what is most familiar.

‍

The most expensive deployment failures in production edge AI are those that pass every lab test and then fail silently on the first production line because the target hardware class, thermal state, or co-resident workload was never included in the validation gate. HIL fidelity is non-negotiable.

‍

Deployment is not the end of the lifecycle : it is the beginning of operations.

‍

Part 5 closes the series with the full operational layer: a two-tier model registry with compiled artifact traceability, shadow and canary deployment patterns for zero-risk model promotion, five concrete rollback triggers with quantitative thresholds, and a three-layer monitoring framework that covers device health, inference-path performance, and data drift direction.

Sonal Dwevedi & Tharun Mathew

Part 4: Hard Real-Time Edge AI for Automotive Inspection: Designing the Inference and Control-Plane Split