Crawling an internal website

Question

Crawling an internal website

Martin Kafka 0

Hi,

I am doing a PoC for the RAG based agent in Foundry. One source is an internal hub where the agent shall crawl the webpage with the specific information for the user (region based) and shall be authenticated by Entra ID so German user does not see an information for the Swedish one. Hot to properly approach this in PoC and further apply it in PROD?

Thanks

0 comments

2 answers

Your answer

Answer 1

Hi @Martin Kafka

For a PoC I'd skip crawling at index time, instead use a custom API tool that fetches the internal page at query time, authenticated as the calling user.

Here's how I'd wire it up:

User asks a question → agent calls your custom API tool with the user's Entra token
API fetches the internal hub page via OBO (On-Behalf-Of) flow, Entra ID handles the region filtering naturally
Content comes back as context to the agent

This way you're not storing region-restricted content in a shared index, which gets complicated fast. For PROD I'd go with the same pattern but tighten up the OBO flow and add caching for content that isn't user-specific.

If you really want to pre-index, look into Azure AI Search security trimming, tag each doc with allowed regions/groups and filter at query time using the user's claims. But honestly for a PoC the custom tool approach is simpler and gets the auth right out of the box.

Please Upvote and accept the answer if it helps!!

Answer 2

hi Martin Kafka & thx for sharing urs issue here at Q&A portal,

For this PoC, don’t let the agent ‘freely crawl’ the internal site at answer time. That’s risky and hard to secure.

crawl/index the site with a controlled backend process, store content in Azure AI Search, and add metadata like region, country, department, securityGroup, or allowedUsers. Then at query time, filter results based on the signed-in user’s Entra ID claims.

user signs in with Entra ID > app gets user claims/groups > app calls the agent/search layer > search query includes a security filter -> agent only sees docs the user is allowed to see. German user gets filter like region eq 'DE'

or group-based: allowedGroups/any(g: g eq 'group-id-for-germany')

Don’t rely only on the prompt saying ‘don’t show Sweden data to German users’. That’s not security. The filter has to happen before docs are passed to the model.

For PROD, I’d use ACL/security trimming with Azure AI Search and keep the crawler as a trusted service using managed identity. If the source site uses Entra auth, the crawler should either use app permissions approved by the content owner, or crawl per content partition w/ clear ownership. Don’t scrape w/ a random user token unless u want audit/compliance pain later.

crawl/index centrally, tag content w/ access metadata, enforce filters in search, then let the agent answer only from filtered results. For PoC u can start w/ region filtering; for PROD move to group/user ACLs.

rgds,

Alex

&

If my answer was helpful pls mark it and additional thx if u follow me at Q&A portal

and at my blog https://ctrlaltdel.blog/