
Most legacy modernisation programs fail not because of bad tools, but because AI is thrown at code with no context: thousands of files, zero documentation, tribal knowledge lost, incident history scattered across ticketing systems. To make AI coding assistants reliable in these environments, you have to build technical knowledge layers that encode not just what the code does, but why it exists and how it behaves in production. This article goes deep into how to structure repositories, documentation, runbooks, and incident logs so that AI assistants can reason about legacy systems in cross‑industry IT / DevOps landscapes (banking, manufacturing, telecom, government) instead of just autocompleting syntax.
Legacy environments don’t just suffer from missing documentation; they suffer from architectural amnesia and decades of semantic drift between what the system was supposed to do and what the code actually does today. Business rules have been patched for new products, regulations, and edge cases, often without updating specs or diagrams, so the implementation gradually diverges from the original intent that lives in old design docs or retired engineers’ heads.
On paper, many of these systems still “work,” but the semantics of fields, workflows, and failure modes have shifted: flags repurposed for new meanings, tables overloaded with legacy and modern data, feature toggles that became permanent, and hotfixes that encode regulatory or commercial constraints no one remembers explicitly. This semantic drift is especially acute in regulated domains like banking, insurance, and telecom, where compliance changes accumulate as ad‑hoc code paths rather than clearly documented domain models.
Traditional AI coding assistants are trained to perform token prediction, not institutional memory reconstruction. Without access to the real, current ground truth regulatory constraints, contractual obligations, operational invariant they will happily propose syntactically correct changes that violate subtle but critical business rules encoded in legacy conditionals, batch schedules, or data flows. In multi‑repo monoliths with unknown dependencies and outdated frameworks, this means a “safe‑looking” refactor in one service can silently break settlement flows, reporting obligations, or SLAs three systems away.
That’s why naive “AI over code” approaches routinely fail in modernisation projects: they operate on text tokens stripped of the evolving semantics that give those tokens meaning in the current business, regulatory, and operational context. Without engineered knowledge layers that reattach code to up‑to‑date domain intent, incident history, and compliance constraints, AI assistants are guessing in the dark.
A technical knowledge layer is everything you build around the code that lets humans and machines understand system behavior across time, not just at a single file. In legacy environments, you can think of this as a stack of structured artifacts:
.jpg)
AI agents and coding assistants can then be attached through retrieval‑augmented generation (RAG), embedding indexes, and code understanding tools that process these layers as a unified context graph.
Before you wire any AI assistant into your IDE or CI pipeline, you need a knowledge architecture that decides what “context” actually means in your shop. Experience from legacy modernisation programs shows that teams that treat this as an architecture problem, not a documentation afterthought, get faster and safer outcomes.
Two design principles have emerged as critical:
From there you define your knowledge entities (service, feature, incident, runbook, config, schema, dashboard) and relationships (service‑uses‑schema, incident‑references‑service, runbook‑resolves‑incident‑type) so they can be indexed and retrieved in a consistent way.
A practical way to expose knowledge layers to AI is to standardize the filesystem layout across legacy systems and surface crosslinks in machine‑readable form. Even if your code lives in multiple repos, each repo (or simulated mono‑view) should follow a predictable structure.
A typical pattern for a legacy modernisation workspace:
/legacy-platform/
├── /services/
│ ├── /billing-service/
│ │ ├── src/
│ │ ├── tests/
│ │ ├── docs/
│ │ │ ├── feature-maps/
│ │ │ ├── adr/
│ │ ├── runbooks/
│ │ ├── config/
│ │ ├── incidents/
/inventory-batch/
│ ├── src
│ ├── tests
│ ├── docs
│ └── runbooks
├── /infra/
│ ├── terraform/
│ ├── k8s/
│ ├── legacy-hosts/
/global-docs/
│ ├── domain-model/
│ ├── context-maps/
│ ├── glossary/
/observability/
│ ├── dashboards/
│ ├── alerts/
│ ├── slo/
The key is to make relationships explicit and queryable:
services/billing-service/service.yml
apiVersion: platform.example.com/v1
kind: ServiceMetadata
metadata:
name: billing-service
labels:
domain: payments
tier: tier-1
language: java
annotations:
repoUrl: https://git.example.com/legacy/billing-service
runbookUrl: https://wiki.example.com/runbooks/billing-db-failure dashboardUrl: https://grafana.example.com/d/abc123/billing
spec:
# NEW: explicit blast radius metadata for change risk
blast_radius:
level: high
label: high_impact_on_settlement_engine
rationale: >
Failures or regressions can block settlement, revenue recognition,
and regulatory reporting for all card transactions.
description: >
Handles invoicing, payment capture, and refunds for all customer orders.
owner:
team: payments-squad
slack_channel: "#team-payments"
email: payments-squad@example.com
tier: tier-1
lifecycle: production
dependencies:
- service: customer-service
type: hard
description: Reads customer billing addresses
- service: payment-gateway
type: hard
description: Charges credit cards
slos:
- name: availability
target: 99.9
window: 30d
indicator: http_availability
deployments:
- environment: production
cluster: prod-cluster-1
namespace: billing
replicas: 4
This structure enables code understanding tools and RAG pipelines to retrieve not just nearby code, but the right runbooks, incidents, and architectural context when answering an AI query.
Most legacy systems suffer from documentation debt: if docs exist, they usually explain “how the function works” at best, not “why this behavior is necessary for the business.” For AI assistants, this is fatal models are already good at reconstructing local syntax, but they cannot infer tacit business rules without explicit hints.

Effective intent‑oriented documentation for AI has several layers:
1. Functional specifications derived from code: AI‑assisted reverse engineering tools can read sprawling legacy codebases, identify architectures and inter‑module dependencies, and generate functional specifications that turn “black box” code into blueprints. These specs should describe preconditions, postconditions, side effects, and business invariants in domain language, not just parameter lists.
2. Characterization tests as executable intent: For legacy modules, "characterization tests" that capture current behavior are often more important than ideal unit tests. Using AI to propose initial test skeletons and documentation, then validating them with these tests, creates an executable contract that both humans and AI can rely on when refactoring. These tests effectively serve as the primary defensive layer when an AI assistant suggests a refactor, ensuring that any proposed change preserves existing, business‑critical behavior
3. Architecture Decision Records (ADRs): ADRs encode why certain design choices were made (e.g., “Why this batch job runs at 02:00,” “Why currency conversion is cached in memory”). When indexed, they give AI assistants high value signals about constraints like regulatory requirements, performance tradeoffs, or historical incidents that shaped the current design.
Characterization test example
Characterization tests capture current (often messy) behavior, so refactoring and AI suggestions can be validated against real expectations.
# tests/test_legacy_discount_engine.py
import pytest
from legacy.discount_engine import calculate_discount
@pytest.mark.parametrize(
"customer_tier, cart_total, expected",
[
("GOLD", 100.00, 0.15), # 15% for GOLD customers
("SILVER", 100.00, 0.10), # 10% for SILVER
("BRONZE", 100.00, 0.05), # 5% default
("GOLD", 49.99, 0.00), # No discount below 50
],
) def test_calculate_discount_characterization(customer_tier, cart_total, expected):
"""
Characterization tests: document the current behavior of the legacy discount engine before any refactoring.
"""
result = calculate_discount(customer_tier, cart_total)
assert result == expected
4. Scenario driven narratives: Short narratives describing real workflows (“how an insurance claim passes through the system”) anchored with sequence diagrams and links to code, APIs, and data schemas help AI map textual descriptions to implementation artifacts. These are especially useful when paired with replay or video‑to‑code tools that extract flows from real user sessions.
When you feed these layers into your RAG index, prompts can ask the AI not only “What does this function do?” but “Is this change consistent with historical intent and documented invariants for this feature?
Runbooks are where organizations already write down procedures, but they are often unstructured, out of date, or spread across wikis and PDFs. Modern incident management best practices treat runbooks as living, actionable documents with clear steps, decision points, verification checks, and links to tooling.
For AI, runbooks should:
An example runbook structure that works well for both humans and AI:
#DATABASE_FAILURE_RUNBOOK
service_id: billing-service
severity: P1
primary_dashboards: [grafana:billing-db-latency, grafana:billing-db-errors] related_incidents: [INC-2024-0321, INC-2023-1107]
##Verification
Check error rate on grafana:billing-db-errors
Run kubectl logs on pods with label app=billing-api
Validate connectivity from app pods to DB_BILLING via nc command
##Initial Response
If connection pool exhaustion suspected, verify max_connections vs current usage
If lock contention, inspect pg_locks for long-running transactions
##Recovery Options
Temporarily increase pool size (see config path config/db/pool.yml)
Scale read replicas (run script scripts/scale_billing_replicas.sh)
##Verification
Ensure error rate < 0.1% for 15 minutes
Validate no data inconsistencies in reconciliation dashboard
With structured templates like this, AI agents can automatically suggest candidate runbooks when an incident occurs and, more importantly, map operational patterns back to code and configuration hotspots.
Incident records are a gold mine of implicit business and system behavior—but only if they are recorded in a consistent, machine-readable format. DevOps incident management guidance recommends chronological timelines with standardized fields (timestamp, actor, action, result) and clear severity classifications.
Best practices for AI‑ready incident logging:
Incident event JSON schema
{
"$id": "https://example.com/schemas/incident-event.schema.json",
"type": "object",
"required": ["id", "incident_id", "timestamp", "source", "type", "description"],
"properties": {
"id": { "type": "string" },
"incident_id": { "type": "string" },
"timestamp": { "type": "string", "format": "date-time" },
"source": {
"type": "string",
"enum": ["monitoring", "deployment", "chat", "status_page", "runbook", "manual"]
},
"type": {
"type": "string",
"enum": [
"alert_fired",
"alert_acknowledged",
"alert_resolved",
"deployment_started",
"deployment_completed",
"rollback_initiated",
"escalation",
"action_taken",
"decision_made",
"root_cause_identified"
]
},
"actor": {
"type": "object",
"required": ["type", "id"],
"properties": {
"type": { "type": "string", "enum": ["human", "system", "automation"] },
"id": { "type": "string" },
"name": { "type": "string" }
}
},
"description": { "type": "string" },
"metadata": { "type": "object" },
"confidence": {
"type": "string",
"enum": ["automated", "manual", "inferred"],
"default": "manual"
}
}
}
Timeline event types (TypeScript)
// incident-timeline-collector.ts
export type EventSource =
| "monitoring"
| "deployment"
| "chat"
| "status_page"
| "runbook"
| "manual";
export type EventType =
| "alert_fired"
| "alert_acknowledged"
| "alert_resolved"
| "deployment_started"
| "deployment_completed"
| "rollback_initiated"
| "escalation"
| "action_taken"
| "decision_made"
| "root_cause_identified";
export interface TimelineEvent {
id: string;
incidentId: string;
timestamp: Date;
source: EventSource;
type: EventType;
description: string;
actor?: {
type: "human" | "system" | "automation";
id: string;
name: string;
};
metadata: Record<string, unknown>;
confidence: "automated" | "manual" | "inferred";
}
An example timeline entry format that tools and AI can both consume:
[2025-11-09T15:23:00Z] actor=db-team
action=identified_connection_pool_exhaustion result=increased_pool_limit_by_50_percent
service_id=billing-service
incident_id=INC-2025-1109
Over time, AI agents can analyze hundreds of such incidents to detect recurrent failure modes, predict probable root causes for new alerts, and suggest safer defaults or refactoring candidates in the code.
Once code, docs, runbooks, and incidents are structured, you can expose them to AI coding assistants and operator copilots via retrieval augmented generation (RAG) and context engines. RAG integrates your actual knowledge sources repositories, wikis, schema docs, incident KBs so AI uses your truth instead of guessing from generic training data.
.jpg)
A typical architecture for legacy environments:
1. Ingestion pipelines
2. Context engine / knowledge graph
3. AI assistant integration
Simple ingestion script example (Python) \
# ingest_legacy_service.py
import glob
from pathlib import Path
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
ROOT = Path("legacy-platform/services/billing-service")
def load_documents():
patterns = [
"src/**/*.java",
"docs/**/*.md",
"runbooks/**/*.md",
"incidents/**/*.md",
"config/**/*.yml",
]
for pattern in patterns:
for path in glob.glob(str(ROOT / pattern), recursive=True):
yield TextLoader(path).load()[0]
docs = list(load_documents())
splitter = RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=200,
separators=["\n## ", "\n# ", "\n\n", "\n"]
)
chunks = splitter.split_documents(docs)
vector_store = FAISS.from_documents(chunks, OpenAIEmbeddings())
vector_store.save_local("indexes/billing-service")
Retrieval + answer synthesis example
# answer_with_context.py
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
vector_store = FAISS.load_local(
"indexes/billing-service",
OpenAIEmbeddings(),
allow_dangerous_deserialization=True,
)
retriever = vector_store.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-4.1")
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type="stuff",
verbose=False,
)
question = "Is it safe to change the discount rules for GOLD customers?"
print(qa.run(question))
The same knowledge layer architecture applies across industries—even though the tech stacks and regulatory constraints differ.
In all cases, the pattern is the same: AI understands the system only to the extent that your technical knowledge layers are rich, connected, and machine-readable.
Organizations that succeed with AI assisted legacy development usually follow a phased roadmap.
1. Assessment and architecture mapping
2. Standardize structure and templates
3. Build the knowledge layer and RAG index
4. Integrate AI assistants incrementally
5. Close the loop with post incident learning
Over time, this creates a virtuous cycle where AI improves understanding and modernisation of legacy systems, and modernisation in turn generates cleaner artifacts that AI can leverage even more effectively.
A robust AI‑assisted development strategy in legacy environments ultimately depends less on which model you choose and more on how well you’ve engineered your knowledge layers around the system. When code, configs, architecture decisions, runbooks, tests, and incident histories are structured, linked, and machine‑readable, AI assistants can reason for real business intent and operational behavior instead of merely predicting the next token. That shift from syntax completion to context‑aware decision support is what makes modernisation safer, faster, and genuinely sustainable in complex, high-risk estates.
For cross‑industry IT and DevOps teams, the path forward is clear: standardize repository layouts and metadata, treat runbooks and incident logs as first-class technical assets, and expose everything through a RAG‑powered context engine that your AI tooling can reliably query. Start with a thin slice a critical service or business capability to prove that AI can accelerate understanding and refactoring without increasing incident rates, then iteratively expand coverage across domains and platforms. Over time, each incident resolved, test added, and ADR written compounds into a richer knowledge graph, steadily improving both developer productivity and AI decision quality.
In practice, building technical knowledge layers is not just a documentation initiative; it is a core modernisation capability that turns brittle legacy stacks into systems your AI agents can safely collaborate with. Organizations that invest in this foundation today will be able to plug in new AI tools, models, and workflows tomorrow without starting from scratch, because their real competitive advantage lives in the structured understanding of their own systems