How to defend against Account Takeovers
Learn about account takeover threats, protection strategies, and detection methods to secure your digital accounts and prevent unauthorised access.
Support FAQ
A retrieval augmented generation pipeline, or RAG pipeline, connects a large language model to external knowledge at answer time. Instead of relying only on the model's training data, the system retrieves relevant documents, passes them into the prompt, and asks the model to answer using that context.
RAG is useful when answers need to reflect current, private, or domain-specific information: product documentation, support articles, policies, contracts, catalogues, incident notes, or internal runbooks. It can reduce hallucinations and make AI systems easier to update because teams can change the source material without retraining the model.
Building a good RAG pipeline is not just a data engineering task. It is also an application security and governance task. The system needs trusted sources, access control, safe retrieval, useful evaluation, and monitoring once real users depend on it.
RAG systems often sit close to sensitive information. A customer assistant may retrieve account-specific support data. An employee assistant may search HR documents or security runbooks. A sales assistant may use contract language and pricing guidance. If retrieval is wrong, stale, or over-permissive, the AI system can expose information or produce misleading answers.
RAG also changes how public and private content is valued. Documents that were once ordinary web pages may become source material for AI products. Public sites may see scraping from systems building retrieval indexes. Some of that collection may be legitimate, but some may violate policy, overload routes, or extract commercially sensitive data. Site owners should understand LLM web scrapers and crawler controls before assuming all AI-related traffic is harmless.
Start with a specific use case, not a generic "chat with our data" goal. A RAG system for public documentation has different requirements from a system that answers account questions, drafts incident reports, or helps security teams investigate suspicious traffic.
Define the users, allowed data, expected questions, unacceptable answers, and escalation path. Decide whether the system may only answer, or whether it can also trigger actions through APIs. If actions are involved, the RAG pipeline becomes part of an application workflow and needs stronger controls.
Good early questions include:
RAG quality depends heavily on source quality. Collect documents from approved repositories, remove duplicates, archive stale material, and identify the owner of each collection. If a document is not trusted enough for a human employee to rely on, it is probably not trusted enough for a RAG system to cite.
Break documents into chunks that preserve meaning. Chunks that are too small may lose context. Chunks that are too large may include irrelevant text and reduce retrieval precision. Include metadata such as title, source, owner, publication date, access level, product area, and document type.
Access control should be designed before indexing. A common mistake is to put all documents into one vector database and filter only after retrieval. Safer designs apply permissions during retrieval so users cannot retrieve chunks they are not allowed to see.
Teams should also plan for data poisoning and source manipulation. If public pages, user-generated content, tickets, or external documents can enter the index, treat them as untrusted until reviewed. Version indexes so a bad source update can be rolled back.
A typical RAG pipeline has five parts.
First, ingest documents from approved sources. This may include crawling, API import, file upload, or database export.
Second, transform the content. Clean formatting, split documents into chunks, attach metadata, and remove material that should not be indexed.
Third, create embeddings. An embedding model converts each chunk into a vector representation that can be searched by semantic similarity.
Fourth, retrieve relevant chunks when the user asks a question. Retrieval may use vector search, keyword search, filters, reranking, or a combination. For security-sensitive systems, retrieval must respect user permissions and source trust.
Fifth, generate the answer. The prompt should instruct the model to use retrieved context, cite or reference sources where appropriate, and say when the available context is insufficient.
RAG evaluation should measure more than whether the answer sounds fluent. At minimum, test retrieval quality, answer faithfulness, completeness, refusal behavior, and access control.
Retrieval quality asks whether the system found the right sources. Faithfulness asks whether the answer is grounded in those sources rather than invented. Completeness asks whether the answer includes the important details needed by the user. Refusal behavior asks whether the system declines unsafe or unsupported requests. Access-control testing confirms that users cannot retrieve data outside their permissions.
Use realistic question sets from support tickets, search logs, internal FAQs, and incident reviews. Include adversarial tests: ambiguous questions, outdated documents, conflicting sources, malicious instructions inside retrieved text, and attempts to reveal restricted information.
RAG systems often rely on APIs for document ingestion, search, authentication, and answer delivery. Those APIs need ordinary security controls: authentication, authorization, schema validation, rate limits, monitoring, and clear error handling. The foundations are covered in what is API security and what is REST API security.
If the RAG system is exposed publicly, also plan for automated abuse. Attackers may enumerate prompts, extract snippets from the index, test for prompt injection, or use the system as a search interface for sensitive data. Apply rate limits, abuse detection, session controls, and alerting around unusual query patterns.
For websites that publish source material, monitor crawler behavior. AI-related crawlers may request documentation, product pages, articles, or pricing information at scale. Teams can start with how to detect AI crawlers and decide whether to allow, limit, challenge, or block using how to block AI crawlers.
Before moving a RAG pipeline into production, confirm:
RAG can make AI systems more useful and more grounded, but only when retrieval is governed. The strongest RAG pipelines combine good data hygiene, careful access control, measured evaluation, and operational monitoring.
Learn about account takeover threats, protection strategies, and detection methods to secure your digital accounts and prevent unauthorised access.
An overview of Account Takeover Attacks
A practical reference for common AI crawler user agents, operators, purposes, and recommended Peakhour bot-management actions.
AI For Cybersecurity explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.
AI Image Generation explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.
AI Misuse explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.
© PEAKHOUR.IO PTY LTD 2025 ABN 76 619 930 826 All rights reserved.