RAG over enterprise data sounds simple in the demo and gets complicated in week three. Here's what the journey looked like building one over a CAFM system: chunking, retrieval evaluation, hallucination guardrails, the cost story, and the things I'd do differently next time.
The brief
The original ask was disarmingly simple: "Can our maintenance supervisors ask plain-English questions about asset history?" Specifically — What was the last service done on Pump P-1031? Which contractor handled it? When is the next PPM due? What spares were used? — without having to navigate the CAFM UI, run reports, or wait for someone in IT to extract data.
Three constraints framed everything that followed. The CAFM data is live and operational, not a snapshot. Some of the data is sensitive (vendor pricing, internal contractor disputes, safety findings). And the audit requirement was non-negotiable: every answer the system gave had to be traceable back to a specific source record.
What we built
The architecture, in five components:
- A nightly ingest job that pulls structured records from the CAFM database (assets, work orders, PPM schedules, service history, contractor records) and serialises each record into a domain-specific text representation.
- An embedding pass over each serialised record, with metadata (record type, asset ID, date, site, status) attached to every vector.
- A vector store with metadata filtering — so a query about "Pump P-1031 last year" pre-filters by asset and date range before retrieval, rather than relying on semantic similarity alone.
- A retrieval + LLM layer that builds the answer from retrieved records, with a strict "cite the work-order number" instruction.
- A guardrail layer that rejected or flagged answers without citations, refused to invent asset numbers, and routed sensitive queries (cost, contractor performance) to a more restrictive permission-checked path.
Chunking and indexing approach
This is where most RAG projects lose. Generic chunking strategies — "split documents into 500-token windows with 50-token overlap" — assume your source data is documents. CAFM data isn't documents. It's relational records. Asset records, work orders, PPM schedules, contractor records all have natural boundaries and natural relationships.
What worked: chunking by record type and entity. One chunk per asset (with its full hierarchy, location, criticality, and core metadata). One chunk per work order (with status, dates, technician, parts used, and a textualised description of the failure and fix). One chunk per contractor (with their service history aggregated). The metadata on each chunk — asset ID, work-order ID, date, type — let the retrieval layer filter precisely before semantic ranking did its work.
The lesson
Generic chunking strategies fail on relational data. The structure of your CAFM tables is the chunking strategy — lean into it.
Retrieval evaluation
The most important lesson of the whole build: build your eval set before you build the retrieval. We started the other way and lost two weeks.
What an eval set looks like: 50-100 question/answer pairs that real supervisors would ask, with the specific record (or records) that should be retrieved as the ground truth. "What was the last work order on Pump P-1031?" → WO #12482, completed 2026-02-14. "Which contractor did the AHU servicing in Tower A in March?" → Acme Mechanical, WO #12530-12537.
With that set, you can measure recall@k (did the right record show up in the top k retrieval results?) and answer accuracy (did the LLM produce the right number/date/name?) on every iteration. Without it, you're tuning on vibes.
What we found: recall@10 around 78% out of the box, 92% after metadata-filtered retrieval, 96% after we added a re-ranker. Each step was worth its added cost; we wouldn't have known without the eval set.
Hallucination guardrails
The strict no-no's on a CAFM system are: never invent an asset number, never fabricate a work-order ID, never paraphrase a safety procedure, never make up a contractor name. The cost of a wrong answer in maintenance operations is real — people send technicians to wrong assets, raise spurious purchase orders, miss compliance windows.
We enforced this with a citation-required prompt pattern: every numerical or factual claim in the answer had to reference a specific work-order ID, asset ID, or PPM schedule ID. The post-generation guardrail layer parsed the answer, verified each cited ID existed in the CAFM database (a cheap integrity check), and rejected answers that cited IDs that didn't exist.
For safety procedures and compliance content, we restricted the system to retrieve and quote rather than summarise. The system would return the source SOP verbatim with the source document name — not a paraphrase. Less elegant, much safer.
The cost story
Honest accounting on a representative month, ~3,000 supervisor queries:
- Embedding cost. Re-embedding only changed records nightly. ~$40/month at the volume.
- Vector storage. ~$15/month for the index size we ended up with.
- LLM tokens per query. Average prompt size ~2,500 tokens (instructions + retrieved context), average completion ~300 tokens. At the model tier we used, this came to ~$0.02 per query — ~$60/month at 3,000 queries.
- Re-ranker calls. ~$25/month.
- Total cloud cost: ~$140/month. Plus engineering time amortised over the build — the operating cost was the easy part.
What I'd flag for budget conversations: the cloud cost is small relative to the value (a supervisor saving 10 minutes per query, several times a day, pays this back many times over). The bigger cost was the eight weeks of build, eval, tuning, and guardrails. Plan for that, not the inference bill.
What I'd do differently
- Build the eval set first. Two weeks lost because we built retrieval before knowing what good looked like.
- Start with metadata-filtered retrieval, not pure semantic. CAFM queries almost always have anchor entities (asset, date, site). Filter first, semantically rank within the filter. Faster, more accurate, cheaper.
- Restrict scope on launch. We tried to support free-form queries from day one. We should have launched with a fixed set of supported question patterns and expanded based on what supervisors actually asked.
- Permissioned context. Bake the user's RBAC into the retrieval layer, not the response layer. Sensitive records shouldn't reach the LLM at all if the user can't see them.
- Plan the de-bias review. A supervisor at one site found out the system phrased things slightly differently when their site appeared in retrieved context. Innocuous, but it took a day to investigate. Build the review channel for these reports into the rollout, not after.
Related reading on the data side: Asset hierarchy design, CAFM data migration strategy, and RBAC design for CAFM systems — the access-control layer this RAG system has to respect.
Muhammad Abbas
CMMS / CAFM Manager & Independent Advisor · 22+ years in enterprise tech.
Work with me