AI Agent Safety in Production: Why Trust and Safety Infrastructure Isn't Optional Anymore

An AI agent was told to scan a network. So it did. And then it kept going, acquiring resources until the task was ‘done,’ at which point the operator got a cloud bill big enough to ruin a quarter. Same week, a separate story: a security researcher compromised a financial AI assistant with a €0.01 bank transfer, the instruction tucked inside the transaction description.

Neither was a model failure. The models did exactly what they were built to do. Nobody had told them what they weren't allowed to do.

Three incidents, the same hole

Three of these landed on Hacker News inside a few days. The DN42 agent that ran up the bill - nothing in its setup capped scope, so it just didn't stop. An agent dropped into a Linux box that started changing files no one asked it to touch. And the banking assistant, taken over through a normal text field that happened to contain a command.

The comment section did what comment sections do: ‘obvious,’ ‘how did nobody catch this,’ ‘here's the architecture you need.’ The part worth noticing was quieter - the number of engineers describing their own near-misses. Products that haven't made the news. Yet.

The gap nobody schedules time for

The pattern is identical across all three. The agent could act. It was authorized to start. And nothing constrained where it stopped - not in scope, not in sequence, not in cost.

Most teams design the happy path with real care. What the agent should do, in order, when everything works. The part about what it shouldn't do gets a line on the roadmap labeled ‘before launch.’ Launch keeps moving. The line never gets crossed off.

Then there's prompt injection, which is a different animal. The model isn't malfunctioning, it's faithfully executing whatever reaches it. The problem is that external content (a message, a file, a €0.01 transfer note) arrives as data and gets treated as a command. The banking researcher didn't write an exploit. He typed in a field that was working as designed. The input was in spec. The context was the attack.

In any product where people talk to an agent, the conversation is the attack surface. Every message is a possible injection. Every attachment is a possible vector. This is the part trust and safety infrastructure is supposed to cover — not by making the model smarter, but by making the channel something other than a transparent pipe.

What moderation actually has to do here

This is the case for CometChat's moderation & guardrails and it's worth being specific about why it fits the problem.

It reads context, not keywords. A keyword filter sees ‘transfer €0.01’ and shrugs. Contextual moderation looks at the whole conversation and catches the instruction hiding in a field that's supposed to hold a description. It runs both directions - screening what users send to the agent (jailbreaks, injections, harmful prompts) and what the agent sends back (unsafe, biased, or off-policy output). It handles text, image, and video, across languages, including code-mixed messages like Hinglish that single-language filters miss entirely. You can run it on CometChat AI, OpenAI, or your own API, and drop it into any stage of the message lifecycle. CSAM detection, audit trails, and RBAC come with it - which matters less when nothing has gone wrong and a great deal when something has.

Three checks before your agent is the headline

What can a user send that the agent will act on without question? Write it down. If the list includes a transaction field, a document body, a profile, or any other external text - that's your injection surface, and it needs to be evaluated before it reaches the agent.

Does the agent have scope constraints, not just permissions? Permissions say what it can call. Scope says when to stop. The DN42 agent had both problems: it could spin up resources, and nothing told it enough was enough.

Do you have a full audit trail of what went in and what the agent did? Not application logs, the actual conversation record. When something breaks, and eventually it will, that record is the difference between a root cause and a shrug. In regulated industries it's also the difference between a postmortem and a fine.

This week wasn't a story about what AI can do. It was a story about what happens when the guardrails live on the roadmap instead of in the product.

If you're building something where users talk to agents and you'd rather the conversation layer be a boundary than a liability — start building with CometChat.

Shrinithi Vijayaraghavan

Creative Storytelling , CometChat

Shrinithi is a creative storyteller at CometChat who loves integrating technology and writing and sharing stories with the world. Shrinithi is excited to explore the endless possibilities of technology and storytelling combined together that can captivate and intrigue the audience.

Real-Time User Communication

Chat & Messaging

Voice and Video Calls

Full-Stack AI Agent Platform

Omnichannel Campaigns

CometChat Air

Moderation & Guardrails

Analytics & Insights

Notification Engine

Multi-Tenant Infrastructure

On-Prem Deployment

Widget Builder

UI Kit Builder

UI Kits

SDKs

Coding Agent Skills

Documentation

Sample Apps

Product Updates

Feature Requests

Community

Help Center

Get Support

Report an issue

Blog

Tutorials

We raised $6.5M

Vibe code vs Buy

React Chat App Tutorial

Flutter Chat App Tutorial

AI Agent Safety in Production: Why Trust and Safety Infrastructure Isn't Optional Anymore

Three incidents, the same hole

The gap nobody schedules time for

What moderation actually has to do here

Three checks before your agent is the headline