When someone asks an AI assistant for a plumber's pricing norms, a dentist's advice on a cracked tooth, or which local accountant handles S corps, the answer comes from somewhere. Sometimes that somewhere is a small business site like yours.
How the assistant got there is not magic, and it is not a secret algorithm you need to outsmart. It is a pipeline with stages you can observe, and being honest about the stages matters more than any trick.
The pipeline has distinct stages
It helps to separate what actually happens, because each stage has different rules.
Stage one: crawling. Bots like GPTBot, ClaudeBot, and PerplexityBot fetch pages from the open web, either to build training data or to feed retrieval indexes. If your site blocks them, times out, or hides content behind scripts they do not execute, this stage fails silently.
Stage two: retrieval. When a user asks a question, many assistants search a live index, often built on or alongside traditional search results, and pull in a handful of pages that look relevant. This is the stage where most current citations are born. It looks a lot more like search ranking than like model training.
Stage three: selection and citation. From the retrieved pages, the assistant decides what to use and what to cite. Pages that answer the question directly, state facts plainly, and are easy to extract from get used. Pages that bury the answer get skipped, even when they were retrieved.
The practical consequence: most of what earns AI citations today is the same work that earns search visibility, plus an extra premium on clarity and extractability.
What you can actually control
The controllable surface is unglamorous.
Be reachable. Check whether your robots.txt blocks AI crawlers, and decide on purpose rather than by default. Some site owners block them deliberately, which is a legitimate choice. What you want to avoid is blocking them accidentally through an old catch-all rule or an overeager firewall setting.
Be readable without execution. Retrieval systems vary in how well they handle JavaScript-heavy pages. If your core business facts, services, location, and answers only exist after client-side rendering, you are betting that every retrieval system renders. Server-rendered or static content removes the bet.
Answer questions where the question is asked. Assistants cite pages that resolve a query directly. A page titled around the actual question, opening with the actual answer, then expanding, is structurally easy to cite. A page that meanders toward the point is structurally easy to skip.
Keep facts findable and consistent. Name, location, services, hours, prices where you publish them. Stated in text, consistent across pages. Assistants synthesizing an answer about your business will use what they can extract, and inconsistency reads as unreliability.
What you cannot control, said plainly
You cannot make an assistant cite you. There is no submission form, no paid placement for citations, and no markup that forces inclusion.
It is also worth being honest about the popular tactics. Adding structured data and llms.txt files gets sold as an AI visibility lever. The published evidence so far does not support treating either as a citation lever. Structured data remains worthwhile as general search hygiene, and that is the honest framing: foundational, not magical.
Anyone promising guaranteed AI citations is selling something the pipeline does not offer.
How to know if any of this is working
This is where measurement discipline matters, because the stages produce different evidence and conflating them creates fake progress.
Crawl evidence lives in your server or CDN logs: which AI bots hit which pages, how often, and whether they come back. A bot visit means the retrieval layer can reach you. It does not mean you are being cited.
Citation evidence is harder and currently messier: assistants naming or linking your site in answers, and referral traffic from assistant interfaces where they pass referrers. It is sparse, uneven across tools, and not something to build weekly KPIs on yet.
Keeping those two honest and separate is the core of useful AI visibility monitoring. Crawl proof says the systems can see you. Citation proof says the content is strong enough to get used. Site Clinic tracks them as distinct signals for exactly this reason, and the same separation is worth maintaining however you measure.
One scope note for the developers and technical owners reading this: log-based crawler monitoring is a narrow observational check. It tells you what touched the site. It is not an audit of content quality, and it cannot tell you why a page was or was not cited.
The boring strategy that holds up
Strip the hype away and the strategy for a small business site is short:
- let the crawlers in, on purpose
- serve your core content without requiring script execution
- write pages that answer real questions in the first paragraph
- keep business facts consistent and in plain text
- watch your logs for crawl activity, and treat citations as a lagging signal
None of that is a growth hack. All of it compounds, because it is the same work that makes the site better for humans and for traditional search. The retrieval layer is new. The fundamentals it rewards are not.
