Why AI Crawlability Is Different From Google SEO
Google's crawler is sophisticated enough to render JavaScript and handle complex SPA behavior. Most AI engines are not. ChatGPT, Perplexity, and Gemini retrieve pages using HTTP GET requests and read the raw HTML response. If the raw HTML is empty, blocked, or returns a non-200 status, the page does not exist for those engines.
AEO content that lives behind JavaScript rendering is invisible to AI citation. All the optimization work -- direct answer blocks, FAQ sections, schema markup -- means nothing if the crawler cannot reach the page.
The test: if curl cannot read your page content, AI engines probably cannot either. A site fully accessible to Google may still be partially invisible to AI search engines.
The 5-Point AI Crawlability Audit
1
200 status with full HTML content
Every public route must return HTTP 200 with actual page content in the response body. Run: curl -s -o /dev/null -w "%{http_code}" https://yourdomain.com/your-page. A 200 with empty body still fails -- check the response size is greater than a few hundred bytes.
2
robots.txt allows all crawlers
Fetch your robots.txt directly: curl https://yourdomain.com/robots.txt. It must contain User-agent: * on the first line and Allow: / on the second. Any Disallow rules that cover your content pages will prevent citation.
3
No middleware blocking non-browser agents
Some Express applications include middleware that blocks requests based on user agent or origin. Check your server file for any middleware that references user-agent, req.headers, or origin in a way that could restrict access. Remove any such restrictions from public routes.
4
Meta tags exist in raw HTML
Fetch your page with curl and search for your title and meta description in the output. If they are absent or generic, your meta tags are being injected by JavaScript after load and AI crawlers cannot see them. Switch to server-side injection.
5
Sitemap is accurate and accessible
Fetch your sitemap: curl https://yourdomain.com/sitemap.xml. Verify it lists all your live content pages and that none of the URLs return 404. Remove any deprecated or redirecting URLs -- they dilute crawl authority.
The Replit Audit Prompt
If you are on Replit, paste this into the AI assistant to run the full audit automatically:
Replit audit prompt
Run a full site crawlability audit. Check: 1. All public routes return 200 with actual HTML content -- list each URL, status code, and response size in bytes. 2. Check robots.txt is correct and accessible at /robots.txt. 3. Check for any middleware blocking requests based on user agent or IP. 4. Check for CORS, auth, or rate limiting on public GET routes. 5. Return a pass/fail for each check and fix any failures automatically.
Frequently Asked Questions
How do I check if AI engines can crawl my website?
Run a 5-point audit: verify every public route returns 200 with full HTML content, confirm robots.txt allows all crawlers, check for middleware blocking non-browser user agents, verify meta tags exist in raw HTML before JavaScript, and confirm your sitemap lists all live pages.
Why would a website that ranks on Google still be invisible to AI engines?
Google's crawler renders JavaScript. Most AI engines like ChatGPT and Perplexity do not -- they read raw HTML responses. Meta tags injected by JavaScript, content loaded after page render, and SPAs without server-side rendering are often invisible to AI crawlers even if they rank well on Google.
What should robots.txt contain for AI crawlability?
User-agent: * on the first line and Allow: / on the second line. Any Disallow rules that cover your content pages will prevent AI engines from indexing them. Point to your sitemap on the third line.
How do I test if my meta tags are visible to AI crawlers?
Use curl to fetch your page URL and search the raw terminal output for your title and meta description. If they are absent, your meta tags are being injected by JavaScript after load and need to be moved to server-side injection.
Does a 200 status with empty body pass the crawlability check?
No. AI engines need actual HTML content in the response body to index and cite a page. A 200 response with an empty body or only a loading spinner means the page content is being rendered by JavaScript after load -- which most AI crawlers cannot see.