OUTCOMES ARTICLES

AI’s Sources of Truth: How Chatbots Cite Health Information

Large language models (LLMs) have rapidly become a default source for health advice, from the most basic inquiries such as "What causes migraines?" to more complex ones like "What are the treatment options for autoimmune disorders? However, the sources from which these AI models acquire their medical knowledge remain unclear.

Our study, “AI’s Sources of Truth: How Chatbots Cite Health Information”, examined health-related queries to address this question. Leveraging a wide array of health-related prompts, researchers collected 5,472 citations generated by the following four web-enabled models: ChatGPT (GPT-4o with browsing), Google Gemini (2.5 Flash), Claude (Sonnet 4), and Perplexity (Sonar mode).

By analyzing the citations produced by widely used chatbots, the study identifies the predominant websites, examines the recency of sources, characteristics of the content cited, and notes whether the information is accessible or paywalled. The results offer a clear picture of the reliability and gaps in AI-driven healthcare information.

Top Websites Cited by LLMs

When asked health-related questions, it is first significant to note that LLMs leverage a distributed array of resources. The most frequently cited domain, PubMed Central (pmc.ncbi.nlm.nih.gov) with 385 citations, represents less than 0.1% of the citations analyzed.

That said, evidence also suggests LLMs tend to favor a small, concentrated group of highly trusted domains. Ranked far higher than the other citations, PubMed Central dominates this group underscoring AI’s heavy reliance on peer-reviewed, open-access research. Institutional health media, Cleveland Clinic (174 mentions) and Mayo Clinic (163), play a prominent role in this grouping reflecting a strong preference for patient-friendly, medically reviewed resources. This group rounds out with the Government-backed NCBI database (150) reinforcing credibility with official, authoritative references.

Beyond the “big four,” the ranking blends academic publishers (ScienceDirect, Nature, arXiv) with health media outlets (Healthline, WebMD, Medical News Today, Verywell Health) and prominent public health authorities (CDC, NHS, WHO, Heart.org), highlighting the role of official guidelines in shaping chatbot responses. Notably, YouTube (47 mentions), the world’s second-largest search engine after Google, breaks into the top 20. Its inclusion shows that video explainers and expert-led talks with detailed transcripts sometimes surfaced, suggesting the platform's strategic significance as an LLM data source, despite it being user-generated content, which generally ranks as a lower-value source by LLMs.

Rank	Website	Total Mentions
1	pmc.ncbi.nlm.nih.gov	385
2	my.clevelandclinic.org	174
3	www.mayoclinic.org	163
4	www.ncbi.nlm.nih.gov	150
5	www.sciencedirect.com	93
6	medlineplus.gov	91
7	www.hopkinsmedicine.org	81
8	www.cdc.gov	70
9	www.simbo.ai	67
10	www.healthline.com	67
11	www.webmd.com	64
12	arxiv.org	54
13	www.medicalnewstoday.com	52
14	www.nhs.uk	52
15	www.betterhealth.vic.gov.au	51
16	www.heart.org	51
17	www.youtube.com	47
18	www.nature.com	47
19	www.who.int	46
20	www.aha.org	46
21	www.verywellhealth.com	41
22	www.nhlbi.nih.gov	38
23	pubmed.ncbi.nlm.nih.gov	38
24	www.health.harvard.edu	33
25	www.mckinsey.com	33

‍

Nearly one in three citations (30.7%) comes from health media sources such as the Mayo Clinic, Cleveland Clinic, or Healthline. Followed right behind are commercial and affiliate-driven sites (23.1%), whether they are corporate blogs, e-commerce platforms or product-linked pages. Domain expertise sources such as academic and research sectors surprisingly only remain in the top 3 with 22.9%, demonstrating how AI tends to favor accessible, consumer-facing content rather than translating domain language into digestible terms.

Interestingly, general news (3.7%) and social or user-generated content (1.6%) barely make an impression, suggesting that mainstream journalism or anecdotal experiences are less likely to be considered valuable to LLMs.

Looking at chatbots one by one makes the contrast sharper. ChatGPT tends to be biased toward the health media (35.8%), with almost one quarter of references (23.0%) still originating in academic literature. Meanwhile, Claude presents as having a more balanced approach: 29.7% from health media and 28.9% from academic research.

Gemini stands out as our analysis reveals it relying more on government and NGO content (24.9%) than its peers, making it the most policy-driven of the chatbots. Still, health media is the leader at 30.8%, with commercial/paid sources (23.7%) not far behind. Perplexity is surprisingly the only model where commercial content is highest (30.5%) and also leads in citing information from social or user-generated content at 3.7% compared to 1.9% from Claude.

When it comes to credibility, Domain Rating (DR) scores provides a useful lens. The data reveals that 62.4% of all citations originate in domains with the highest authority ratings (DR scores 81-100). NIH, PubMed, and Mayo Clinic are some of the sites users tend to instinctively trust. By contrast, only 2.7% of citations are from the lowest tier of authority (DR 0-20), where reliability is most questioned.

While most chatbots base their health responses on highly authoritative sources, they value domain sources differently. ChatGPT (68.0%) and Claude (67.4%) show the highest preference towards top-rated domains, with most of their citations remaining in the elite category. Gemini (56.0%) and Perplexity (57.9%) use these top sources less frequently and draw more frequently from mid-tier domains in the DR 41-80 range. Interestingly, Perplexity chose the highest number of low-authority websites (3.3%), which aligns with its higher reliance on commercial and user-generated websites.

The Recency of Sources Referenced by LLMs

Analyzing the publication years of the 5,400+ citations reveals that chatbots predominantly reference recent research and articles, with almost two-thirds of citations dated either 2024 or 2025. All four chatbots draw the most heavily from 2025 (40% of all citations), and decline precipitously after that.

Content Features

When it comes to the content of the health advice returned, chatbots lean more on interpreters of science than science itself. Across all four models, 59% of references include content drawn from summaries such as health media sites, explainers, and consumer guides compared to 41% from peer-reviewed research. In other words, it's more likely that LLMs will quote "what the science means" than the raw studies behind it.

The preference is the most obvious in ChatGPT, which draws 62% of its citations from summaries and only 38% from research articles. Perplexity is almost the same, with 63% summaries and 37% peer-reviewed, showing a clear bias towards consumer-friendly explanations. Claude has the highest rate of peer-reviewed citations at 47%, while Gemini is positioned in a better balance, with 58% summaries and 42% research sources.

It is obvious why LLMs would favor summaries over studies, as it makes the answer more accessible to people's everyday lives. On the other hand, it could pose a threat to the original evidence if the cited interpretation is wrong and does not capture the full analogy, particularly in vital disciplines such as medicine, where nuance and accuracy are important.

While the quality of the content returned is important, the number of supportive sources is also a matter of concern. On average, every answer a user gets to a health-related question is supported by 12 to 15 citations.

Perplexity, which ranks last in terms of high-quality citations, ranks first in the volume of citations, averaging 14.97 citations per query. Its approach represents a philosophy of abundance, drawing from a broad array of sources: commercial sites and even user-generated content are incorporated to give a more complete picture. Coming second at 13.99 is Claude, which provides a more even balance of academic, government, and health media references. These are followed by ChatGPT with 13.59, and then Gemini with 12.29 citations per response.

Content Accessibility

Of 5,400+ citations in our analysis, 99.3% can be classified as “open access”, meaning they were freely accessible. That makes LLMs highly "top of funnel" by focusing not on knowledge that's behind subscription walls, but on information anyone can click and read.

ChatGPT presents as the exception; it had 2.4% of its citations linking to paywalled material. While it's still a small share, it indicates that OpenAI is a little more willing to use gated research or premium media than its competitors. By contrast, Claude, Gemini and Perplexity almost never cite paywalled content, with less than 0.3%. Their outputs reflect a strong bias toward freely available sources, making responses more accessible, but may also limit depth when important research is found behind a paywall.

One of the greatest frustrations in online research is to click on a source, only to find oneself hitting a dead end. Luckily, chatbots appear to have largely solved that problem. Out of over 5,470 citations analyzed, only 0.2% were broken URLs. On all four models, the vast majority of the links worked out correctly, demonstrating that LLMs are surprisingly reliable in their ability to surface live, working sources. For users, that means fewer dead ends and more confidence that the evidence behind an AI's answer can actually be verified.

Conclusion

The data from this study paints a telling picture of how today's health chatbots construct their answers. They overwhelmingly prefer accessible sources that are recent and high-quality from an authorial perspective, and favor summaries over original research. Most of the answers have a lot of references, mostly from open sources, and almost never from broken links.

However, there are clear differences between LLMs. Perplexity showed the highest interest in commercial and user-generated content. Claude approaches closest to research parity, ChatGPT draws heavily from health media, and is the most likely to cite paywalled research. Gemini, meanwhile, focuses more on government and NGO sources and is more policy-driven.

How we interact with healthcare information is drastically changing due to AI chatbots. From a vast blackhole of URLs in Google Search, users are now able to pinpoint particular sources that are interesting and relevant to them. From this study, we can be quite confident in the suggestions that AI models are producing; however, they are by no means immune to bias; thus, careful consideration and validation are always important and vital, especially with medical information.

Methodology

This study analyzed 5,472 unique citations generated by AI chatbots in response to health-related prompts. The goal was to better understand not just how often these systems cite sources, but what kinds of sources they prioritize when providing medical and health information.

Data Collection

We built a prompt set designed to mimic real-world health queries, ranging from general wellness advice to more technical medical topics. These prompts were run through four major web-enabled large language models during August 2025:

ChatGPT (Web browsing mode, GPT-4o)
Google Gemini (2.5 Flash)
Claude (Sonnet 4)
Perplexity (Sonar mode)

All links surfaced in chatbot responses were extracted and cleaned before classification. The final dataset comprised 1,497 citations from Perplexity, 1,217 from Gemini, 1,359 from ChatGPT, and 1,399 from Claude.

Source Categorization

Every citation was assigned to one of six groups to capture differences in authority, accessibility, and intent:

Research / Academic – Peer-reviewed journals, preprints, academic publishers, universities, and scholarly databases.
Government / NGO – Official government health websites and nonprofit or global health organizations.
Health Media – Patient-facing resources from hospitals, medical centers, or health publishers.
General News – Mainstream media outlets reporting on health issues and scientific developments.
Social / UGC – Platforms featuring user-generated or anecdotal content such as Reddit or YouTube.
Commercial / Paid – Corporate blogs, affiliate content, e-commerce sites, or pages with clear marketing intent.

Evidence Level

To evaluate rigor, citations were also tagged by evidence level:

Peer-reviewed research – Primary studies, systematic reviews, and authoritative reports from government or NGOs.
Summary content – Secondary or tertiary sources such as news articles, health guides, encyclopedias, or blogs that interpret or distill information.

Limitations

The findings represent a snapshot of model behavior at a single point in time. Since generative AI systems are frequently updated and retrained, citation patterns may shift in future iterations.

Accelerate Your

Growth

& Market Impact

Our healthcare growth teams works closely with you to design strategies tailored to your unique goals and market dynamics, fully focused on growth.

Book a Free Strategy Session

Home Press & Media Contact Us

Digital Marketing Podcast Production Email Marketing Marketing Strategy Web Design & Development Video/Podcast Production

(858) 704-6096

Terms & Conditions