
Inline pass/fail decisions stay resident on the edge - deterministic, PLC-integrated, and cloud-independent. The cloud handles model lifecycle, rollout orchestration, and governance asynchronously, never touching the control path. The article compares edge-only and edge-plus-cloud hybrid patterns across their operational consequences, covering latency budgets, PLC integration, CI/CD with hardware-in-the-loop validation, canary rollout, typed rollback triggers, and resilient OTA updates.
A well-designed inference stack running on the right hardware is necessary but it only delivers value if it can be built, tested, deployed, and updated reliably across an entire fleet of edge devices.
That is the deployment engineering problem, and it is where many edge AI programmes stumble.
Part 4 covers the full deployment engineering layer: the FastAPI inference service that wraps the ONNX Runtime with hot-reload, timeout handling, and observability; the CI/CD pipeline that validates model artifacts before they are ever eligible for promotion; the hardware-in-the-loop gate that enforces latency compliance on real target hardware; and the orchestrator decision that determines how desired-state manifests are delivered and enforced across your edge fleet. Each section provides concrete, implementable patterns not just guidance
The FastAPI inference service below is designed as an internal control surface for modularity it provides a clean, testable interface between the preprocessing stage and the inference runtime within the edge node's local process boundary. It is not the interface used for pass/fail decisions to the PLC.
As established in Step 5, the PLC integration layer uses a dedicated IO adapter with a deterministic write loop, bounded queues, and staleness checks none of which are appropriate to implement over HTTP.
If you are reading this service as the real-time path to the PLC, it is not: the decision output from this service feeds the IO adapter, which owns the PLC communication boundary.
app/edge_inference_service.py
import os
import time
import threading
from pathlib import Path
from contextlib import contextmanager
import numpy as np
import onnxruntime as ort
from fastapi import FastAPI, HTTPException, status
from pydantic import BaseModel, Field, field_validator
from monitoring import init_metrics, record_inference
---------------------------------------------------------------------------
Configuration — all values from environment, with safe defaults
---------------------------------------------------------------------------
MODEL_DIR = Path(os.getenv("MODEL_DIR", "/opt/models"))
ACTIVE_MODEL = os.getenv("ACTIVE_MODEL", "defect-detector-v17")
ORT_THREADS = int(os.getenv("ORT_THREADS", "2"))
METRICS_PORT = int(os.getenv("METRICS_PORT", "9100"))
INFER_TIMEOUT_S = float(os.getenv("INFER_TIMEOUT_S", "0.040")) # 40 ms hard ceiling
Expected input dimensions — must match training export
INPUT_CHANNELS = 3
INPUT_HEIGHT = 512
INPUT_WIDTH = 512
EXPECTED_NUMEL = INPUT_CHANNELS * INPUT_HEIGHT * INPUT_WIDTH # 786,432 elements
Tensor value bounds — reject inputs outside training distribution
TENSOR_MIN = float(os.getenv("TENSOR_MIN", "0.0"))
TENSOR_MAX = float(os.getenv("TENSOR_MAX", "1.0"))
---------------------------------------------------------------------------
Session management — supports hot reload without service restart
---------------------------------------------------------------------------
_session_lock = threading.Lock()
_active_session: ort.InferenceSession | None = None
_active_model_name: str = ""
def _build_session(model_name: str) -> ort.InferenceSession:
model_path = MODEL_DIR / f"{model_name}.onnx"
if not model_path.exists():
raise FileNotFoundError(f"Model artifact not found: {model_path}")
opts = ort.SessionOptions()
opts.intra_op_num_threads = ORT_THREADS
opts.graph_optimization_level =
ort.GraphOptimizationLevel.ORT_ENABLE_ALL
return ort.InferenceSession(
model_path.as_posix(),
sess_options=opts,
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
def load_session(model_name: str) -> None:
"""Load or hot-reload the inference session under a write lock."""
global _active_session, _active_model_name
new_session = _build_session(model_name) # build outside lock — expensive with _session_lock:
_active_session = new_session
_active_model_name = model_name
def get_session() -> tuple[ort.InferenceSession, str]:
"""Return current session and model name under a read lock."""
with _session_lock:
if _active_session is None:
raise RuntimeError("Inference session not initialised")
return _active_session, _active_model_name
Initial load at startup
load_session(ACTIVE_MODEL)
---------------------------------------------------------------------------
Application
---------------------------------------------------------------------------
app = FastAPI(title="Edge Inference Service")
init_metrics(port=METRICS_PORT)
---------------------------------------------------------------------------
Request / Response models with bounded validation
---------------------------------------------------------------------------
class InferenceRequest(BaseModel):
# Flattened CHW float32 tensor: must be exactly C x H x W elements
tensor: list[float] = Field(
...,
min_length=EXPECTED_NUMEL,
max_length=EXPECTED_NUMEL,
description=f"Flattened CHW float32 tensor, exactly {EXPECTED_NUMEL} elements " f"
({INPUT_CHANNELS}x{INPUT_HEIGHT}x{INPUT_WIDTH})",
)
threshold: float = Field(default=0.5, ge=0.0, le=1.0)
correlation_id: str | None = Field(default=None, max_length=128)
@field_validator("tensor")
@classmethod
def check_value_bounds(cls, v: list[float]) -> list[float]:
arr = np.asarray(v, dtype=np.float32)
if arr.min() < TENSOR_MIN or arr.max() > TENSOR_MAX:
raise ValueError(
f"Tensor values must be in [{TENSOR_MIN}, {TENSOR_MAX}]; "
f"got min={arr.min():.4f}, max={arr.max():.4f}. "
"Ensure preprocessing normalization matches training."
)
return v
class InferenceResponse(BaseModel):
defective: bool
score: float
latency_ms: float
model_version: str
correlation_id: str | None = None
class ReloadRequest(BaseModel):
model_name: str = Field(..., min_length=1, max_length=128)
---------------------------------------------------------------------------
Inference endpoint
---------------------------------------------------------------------------
@app.post("/infer", response_model=InferenceResponse)
def infer(req: InferenceRequest) -> InferenceResponse:
session, model_name = get_session()
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
img = np.asarray(req.tensor, dtype=np.float32).reshape(
1, INPUT_CHANNELS, INPUT_HEIGHT, INPUT_WIDTH
)
# --- Timeout-bounded inference via thread + event ---
result_holder: dict = {}
exc_holder: dict = {}
def _run() -> None:
try:
result_holder["scores"] = session.run([output_name], {input_name: img})[0]
except Exception as exc:
exc_holder["error"] = exc
t = threading.Thread(target=_run, daemon=True)
start = time.perf_counter()
t.start()
t.join(timeout=INFER_TIMEOUT_S)
latency_ms = (time.perf_counter() - start) * 1000.0
if t.is_alive():
# Inference did not complete within budget — circuit open
record_inference(latency_ms, label="timeout", model_version=model_name)
raise HTTPException(
status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
detail=f"Inference exceeded {INFER_TIMEOUT_S * 1000:.0f} ms timeout. "
"Downstream adapter must apply configured fail-safe.",
)
if "error" in exc_holder:
record_inference(latency_ms, label="error", model_version=model_name)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail=f"Inference runtime error: {exc_holder['error']}",
)
defect_score = float(result_holder["scores"][0][0])
defective = defect_score >= req.threshold
label = "defective" if defective else "ok"
record_inference(latency_ms, label=label, model_version=model_name)
return InferenceResponse(
defective=defective,
score=defect_score,
latency_ms=latency_ms,
model_version=model_name,
correlation_id=req.correlation_id,
)
---------------------------------------------------------------------------
Session reload endpoint — hot-swap model without service restart
---------------------------------------------------------------------------
@app.post("/reload-model", status_code=status.HTTP_200_OK)
def reload_model(req: ReloadRequest) -> dict:
"""
Hot-reload a new model artifact without restarting the service. Called by the deployment orchestrator after staging and validating a new model version. In-flight requests complete against the old session before the lock is acquired.
"""
try:
load_session(req.model_name)
except FileNotFoundError as exc:
raise HTTPException(status_code=status.HTTP_404_NOT_FOUND, detail=str(exc))
return {"status": "reloaded", "active_model": req.model_name}
---------------------------------------------------------------------------
Health and readiness
---------------------------------------------------------------------------
@app.get("/healthz")
def health() -> dict:
_, model_name = get_session()
return {"status": "ok", "active_model": model_name}
This pattern is compatible with Jetson, industrial x86 PCs, and other accelerators, and it keeps the control surface small and testable.
To move from lab script to repeatable deployment, treat edge inspection as software-plus-model, with the same rigour as other production services. For automotive edge AI specifically, the pipeline must go further than typical software CI/CD: it must validate latency and resource behaviour on the actual target hardware class before any artifact is eligible for promotion to a production line.
A robust CI pipeline should:
The CI pipeline should maintain an explicit compatibility matrix per supported device class (e.g., Jetson Orin NX 8GB + JetPack 6.0, x86 IPC + CUDA 12.2 + TRT 8.6) and run a model import smoke test against each matrix entry before the artifact is promoted. A model that passes regression on the build server but fails to load on the target device class is a deployment failure, not a model failure and it is entirely preventable at CI time.
# .github/workflows/edge-ci.yml (excerpt)
- name: Validate model artifact
run: |
# Record model version and attach artifact
echo "MODEL_VERSION=$(cat model_registry/active_version.txt)" >> $GITHUB_ENV
mkdir -p dist/models
cp artifacts/defect-detector.onnx \
dist/models/defect-detector-${MODEL_VERSION}.onnx
# Validate ONNX opset compatibility against target ORT version
python scripts/validate_onnx_opset.py \
--model dist/models/defect-detector-${MODEL_VERSION}.onnx \
--target-ort-version 1.17.3
# Run compatibility smoke test per device class
python scripts/smoke_test_compat.py \
--model dist/models/defect-detector-${MODEL_VERSION}.onnx \
--device-matrix config/device_compatibility_matrix.yaml
# Sign artifact
sha256sum dist/models/defect-detector-${MODEL_VERSION}.onnx \
> dist/models/defect-detector-${MODEL_VERSION}.onnx.sha256

The CI pipeline should maintain an explicit compatibility matrix per supported device class (e.g., Jetson Orin NX 8GB + JetPack 6.0, x86 IPC + CUDA 12.2 + TRT 8.6) and run a model import smoke test against each matrix entry before the artifact is promoted.
A model that passes regression on the build server but fails to load on the target device class is a deployment failure, not a model failure and it is entirely preventable at CI time
#.github/workflows/edge-ci.yml (excerpt)
name: Validate model artifact run: |
#Record model version and attach artifact
echo "MODEL_VERSION=$(cat model_registry/active_version.txt)" >> $GITHUB_ENV mkdir -p dist/models cp artifacts/defect-detector.onnx
dist/models/defect-detector-${MODEL_VERSION}.onnx
#Validate ONNX opset compatibility against target ORT version
python scripts/validate_onnx_opset.py
--model dist/models/defect-detector-${MODEL_VERSION}.onnx
--target-ort-version 1.17.3
# Run compatibility smoke test per device class
python scripts/smoke_test_compat.py
--model dist/models/defect-detector-${MODEL_VERSION}.onnx
--device-matrix config/device_compatibility_matrix.yaml
#Sign artifact
sha256sum dist/models/defect-detector-${MODEL_VERSION}.onnx
> dist/models/defect-detector-${MODEL_VERSION}.onnx.sha256
Hardware-in-the-loop (HIL) validation the gate between lab and factory
Passing unit tests and regression tests on a CI build server is necessary but not sufficient for automotive edge AI. A model that meets accuracy thresholds on a CPU-based build server may exceed its latency budget on a loaded Jetson Orin NX running at thermal steady state, or consume enough GPU memory to cause OOM failures when multiple inference sessions are co-resident. The gap between lab benchmark and factory behaviour is one of the most common causes of deployment failures in production edge AI programmes.
Promotion to the factory staging environment must require a hardware-in-the-loop (HIL) validation stage executed on a representative hardware rig a device of the same SKU, OS image, driver stack, and thermal class as the production edge nodes:
#Promotion gate — must pass before artifact enters staging registry
hil_validation:
latency:
p99_max_ms: 45 # derived from station
actuation budget
p95_max_ms: 30
duration_minutes: 10
frame_rate_fps: 30 resources:
gpu_memory_max_mb: 3072
gpu_util_sustained_max_pct: 80
thermal_throttle_allowed: false recovery:
power_loss_test: required
expected_recovery_model: last_known_good
On the factory side, a deployment orchestrator manages desired-state manifests, artifact delivery, and rollout sequencing across the edge fleet. The choice of orchestrator is not interchangeable : k3s, Azure IoT Edge, and custom agents solve fundamentally different problems across four dimensions that matter in automotive factory environments:
Selection guidance:

Deploying edge AI reliably requires treating the model artifact and the serving code as jointly versioned, jointly tested software not separately managed artefacts. The CI pipeline must validate model-runtime compatibility per device class before promotion.
The HIL gate must enforce p99 latency compliance on real hardware, not just on build servers. The orchestrator must be chosen based on fleet size, cloud platform alignment, and operational complexity tolerance not on what is most familiar.
The most expensive deployment failures in production edge AI are those that pass every lab test and then fail silently on the first production line because the target hardware class, thermal state, or co-resident workload was never included in the validation gate. HIL fidelity is non-negotiable.
Deployment is not the end of the lifecycle : it is the beginning of operations.
Part 5 closes the series with the full operational layer: a two-tier model registry with compiled artifact traceability, shadow and canary deployment patterns for zero-risk model promotion, five concrete rollback triggers with quantitative thresholds, and a three-layer monitoring framework that covers device health, inference-path performance, and data drift direction.