mail@mabbaz.com Abu Dhabi, UAE

AI for Enterprise · Procurement

AI-Assisted Cost Coding in Hexagon EAM: Stopping Miscoded POs at the Source

A case study from inside a Hexagon EAM deployment. How an OpenAI integration on the PO entry path catches miscoded purchase orders before they reach finance, and what we got wrong on the first attempt.

Muhammad Abbas May 20, 2026 ~13 min read

Cost codes are the small invisible decision points where enterprise budgeting quietly breaks. Every PO line has to land on a code, every coding error sits in the system until someone notices at month-end, and the cumulative drag on project budgets and finance integrity is significant. This is the story of plugging an LLM into the PO entry path inside Hexagon EAM to catch those errors at the source, with honest notes on what worked and what didn't.

The hidden cost of miscoded POs

Miscoded purchase orders rarely fail loudly. They sit in the system tagged to the wrong project, the wrong cost centre, or the wrong GL category, and the impact only surfaces in three places: at month-end close when finance reconciles, at quarter-end when a project sponsor asks why their budget burned faster than planned, and during audits when a sample line has to be re-classified after the fact.

On the deployment I'll describe in this post, the finance team estimated that around eight to twelve percent of PO lines required some kind of re-classification before close. Some of that was minor (a typo on a sub-code). Some of it was material: a capex item booked as opex, a maintenance line charged to a project that had no budget for it. The cleanup work consumed days of finance time every month, and several project overruns traced back to errors that had been sitting in the system for weeks.

Where the errors come from

Before the AI work, I ran a short root-cause exercise with the procurement and finance teams. The pattern that emerged was not random user carelessness. It was structural.

  • Long dropdowns and codes that look alike. A typical chart of accounts in an asset-heavy organisation runs into the hundreds of active codes. Many of them are siblings (MAINT-CIVIL vs MAINT-CIVIL-ROADS vs MAINT-CIVIL-BUILDINGS). A tired user at 4pm picks the first plausible one.
  • New users guessing. Onboarding rarely covers the chart of accounts in any depth. New PO creators learn by trial, error, and finance pushback. The first six months of their PO history is the most error-prone.
  • Time pressure at quarter-end. The same week finance most needs accuracy is the week PO creators are pushing through the highest volume.
  • Cost codes that grow without communication. Finance adds a new sub-code mid-year, no one tells procurement, and POs that should hit the new code continue to hit the old generic parent.
  • "I'll fix it later." A user knows the code is approximate but plans to correct it after a clarifying conversation with finance. The clarifying conversation never happens.

Why rules-based validation falls short

The traditional fix is configuration: tighten validation rules in Hexagon EAM, restrict which codes a user can pick based on their department, build approval workflows that require finance sign-off on the code before the PO is approved. We had all of this in place and it still left the miscode rate at double digits.

Three reasons rules-based validation hits a ceiling:

  • Hard rules block too much. A real PO might legitimately need an unusual code (a one-off project, a transfer between cost centres). Blocking these in the UI forces workarounds that are themselves error-prone.
  • Soft warnings get ignored. "Are you sure?" dialogs train users to click through. After the third one, the message has no information value.
  • Approval workflows catch errors too late. By the time finance reviews a coded PO, the requester has moved on. The reviewer rarely has enough context to know whether the code is right, only that it looks plausible. Most miscoded POs sail through approval.

Rules are necessary, but they are not sufficient. They are good at preventing impossible codes. They are bad at picking the right code from a list of plausible ones.

The framing that unlocked this

"Picking the right code from a list of plausible ones" is a classification problem with rich context (line description, vendor, item, requester, history). Rules don't solve it well. Language models solve it well, if you give them the right context.

The pattern: LLM suggestion at PO entry

The system we built sits in the PO line entry path inside Hexagon EAM. As soon as a user fills in the line description, vendor, and item category, an OpenAI API call runs in the background. By the time the user reaches the cost code field, the system has three suggested codes ready, each with a confidence score and a one-line reason.

The user sees the three suggestions inline. They can accept the top one with a single click, pick the second or third, or override entirely. If they override the top suggestion, the form asks for a one-sentence reason. That reason is logged and reviewed quarterly.

The principle: never auto-apply, always show the reasoning, always capture the override. The system is a smart assistant, not an autopilot. The PO creator is still accountable for the code. The system reduces the cognitive load of finding the right one in a hurry.

Architecture inside Hexagon EAM

The integration sits as middleware between the Hexagon EAM PO entry form and the save action. The flow:

  1. Hexagon EAM PO line form fires a webhook on field-blur of the description, vendor, and item-category fields.
  2. Middleware service assembles the context for the LLM: PO line details, requester's department, similar past POs from the last 90 days, and the full chart of accounts with descriptions.
  3. OpenAI API call goes out with a structured prompt requesting JSON output: an array of three code suggestions, each with a confidence score and a reasoning string.
  4. Response parsing validates that every returned code exists in the active chart of accounts (a cheap integrity check that catches hallucinations) and rejects any suggestion below a confidence threshold.
  5. UI render back into the Hexagon EAM form as a panel below the code field with the three suggestions and a one-click apply.
  6. Audit log records every suggestion offered, every code accepted, and every override (with the user's stated reason).

Two design choices were load-bearing. First, caching the chart of accounts (with descriptions and parent-child relationships) refreshed nightly, not per-request. Sending the full COA in every prompt would be wasteful, but a stale COA produces stale suggestions. Nightly refresh was the right cadence for this deployment. Second, treating the suggestion pass as advisory: the PO can be saved without ever accepting a suggestion. The user is not blocked, the system is not in the critical path.

The prompt engineering that mattered

Three things turned a mediocre prompt into a useful one:

  • The full COA hierarchy in context, not just a flat list. Cost codes have parents and children. Sending "MAINT-CIVIL-ROADS" without telling the LLM that it sits under "MAINT-CIVIL" (which sits under "MAINT") meant the model couldn't reason about hierarchy. Including the tree, with a short description on each node, made the suggestions sharply more accurate.
  • Few-shot examples of correct codings. For each major category, we included two or three examples of "this kind of line goes to this code, because..." The examples taught the model the organisation's specific conventions (which a generic LLM has no way to know). When we added new examples after seeing common errors, the model corrected on the next deployment.
  • Forced structured JSON output with a reasoning field. Free-form responses produced suggestions you couldn't parse reliably. Forcing JSON gave us clean machine-readable output. The required reasoning field had a second benefit: when the model has to articulate why it picked a code, its picks improve. And humans can sanity-check the reasoning at a glance.

What didn't work on the first attempt: trying to pre-compute embeddings for each cost code and retrieve "similar" codes for the prompt. The LLM did better when it could see the whole chart of accounts than when we tried to be clever about narrowing it down. For a COA of a few hundred codes, just send the lot.

Guardrails and human-in-the-loop

Five guardrails kept the system trustworthy:

  1. Top-3 only, never auto-apply. The user always has to click. The system never picks for them.
  2. Confidence threshold for suggesting at all. If the model's top suggestion has confidence below a set threshold, the suggestion panel shows "no high-confidence match, please pick manually" instead of forcing a weak guess.
  3. Code existence check. The middleware verifies every returned code exists in the active chart of accounts and is not disabled. Hallucinations get caught here before they ever reach the user.
  4. Override-with-reason. When a user overrides the top suggestion, they write one sentence about why. Those reasons are categorised quarterly and fed back into the few-shot examples.
  5. Quarterly drift review. Once a quarter, finance picks a sample of accepted suggestions and a sample of overrides, and reviews accuracy. If the model is drifting, the prompt and examples get refreshed.
The test that proved it worked

We measured the miscode rate at month-end before the system rolled out, and at the same point three months after. The headline number dropped to roughly a third of the baseline, and the finance team's cleanup time at close fell proportionally. PO entry time did not increase. The system is a net help, not a tax on the user.

What to measure

If you're considering something similar, instrument these from day one. Without numbers you cannot tell whether the system is earning its cost.

  • Miscode rate at month-end. The headline metric. The percentage of PO lines that finance has to re-classify before close. This is what you exist to reduce.
  • PO entry time. If the system slows down the user, adoption collapses and you have a different problem. Measure time from form open to form save, before and after.
  • Suggestion acceptance rate. The percentage of POs where the user accepted the top suggestion. If this is very high, the model is confident and right. If it's very low, either the COA is hard to model or the prompt needs work.
  • Override-with-reason analysis. Categorise the reasons monthly. They tell you where the model is consistently wrong and where the COA itself may need clarification.
  • Downstream cost of corrections. The finance team's hours per month spent on re-classification. This is the dollar-equivalent of the headline metric and the figure to put in front of a CFO.

Cross-reference with how to build the ROI business case for the broader framing on measurement and justification.

Where this would break

This pattern is not universally applicable. Four situations where I would not recommend it:

  • The chart of accounts itself is unclear. If the codes overlap, are inconsistent across sites, or are constantly being renamed, the AI will confidently produce wrong answers. Fix the COA first. The AI assists a clean COA, it cannot rescue a broken one.
  • Very low PO volume. If your organisation does fifty POs a month, the engineering effort and ongoing operating cost won't pay back. Rules and good training are enough.
  • Highly regulated environments without LLM approval. Some sectors (defence, certain government tiers) cannot send descriptive PO data to a third-party API. A self-hosted model is technically feasible but the build cost goes up substantially.
  • Cost codes that change weekly. If the COA is in flux, the nightly cache breaks down. The system can be adapted but the maintenance overhead grows.
The honest limitation

An LLM that suggests cost codes is only as good as the chart of accounts it is reading from. If your COA is the real problem, this project will surface it loudly. That's a feature, not a bug, but it can be uncomfortable when the data quality conversation lands on someone else's desk.

Natural extensions

The cost code use case is the cleanest place to start. Once the pattern is proven, the same approach extends naturally to other points in the PO and asset workflow:

  • Asset assignment. Which asset record this work order belongs to. Same pattern: line description plus past work-order history points at the right asset. Read more on asset hierarchy design for the structure this would sit on.
  • Project / WBS code suggestion. The same logic applied to project breakdown structures.
  • Vendor selection nudge. "For this kind of item, your three previous lowest-cost vendors were these." Less classification, more decision support, same architecture.
  • GL category mapping for finance integration. Pre-validating how a PO will land in the GL when it flows into finance. Pairs naturally with three-way matching on the downstream side.
  • Procurement workflow optimisation. Same data, broader framing, covered in the procurement workflow pillar.

Related reading on the AI side: Document AI in procurement, practical prompt engineering, and AI governance, the framework that should sit over any deployment like this one.

Muhammad Abbas

CMMS / CAFM Manager & Independent Advisor · 22+ years in enterprise tech.

Work with me
MAbbaz
© MAbbaz.com