February 4, 202618 min read

Synthetic Audiences: How AI Respondents Actually Work

How large language models learn to simulate human survey respondents. Training data, algorithmic fidelity, replica conditioning, and multi model ensembling.

A language model has never been to Iowa. It has never argued with a spouse about money, been laid off from a factory job, or switched political parties after watching a debate that changed its mind.

And yet. When a team at Stanford created AI agents from interviews with 1,052 real people and then asked those agents to fill out the General Social Survey, the agents matched what the actual humans would say with 85% of the accuracy that those same humans demonstrated when retaking the survey themselves two weeks later.

That number is remarkable. Not because it is perfect, but because of what it implies about synthetic audiences. Somewhere inside these models, something approximating human opinion has been encoded. Not through understanding. Not through lived experience. Through patterns in language.

This article explains what that "something" actually is. Not how accurate synthetic audiences are (we cover that in our complete guide), and not which tools to use or what they cost (see our complete guide for that). This is about the mechanism. How do language models learn to simulate human opinions? And what does understanding the mechanism tell you about when to trust the results?

Everything Starts with the Training Data

Before a language model can simulate anyone's opinion, it has to learn what opinions look like. This happens during pretraining, the phase where the model reads an enormous volume of text written by real humans.

The scale is hard to grasp. GPT 3 was trained on roughly 570 gigabytes of text. GPT 4 and Claude used datasets that are likely several times larger, though the exact numbers are not public. Meta's LLaMA was trained on a dataset that was 67% Common Crawl, 15% C4, 4.5% GitHub, 4.5% Wikipedia, 4.5% Books, 2.5% ArXiv, and 2.0% StackExchange.

What does that actually include? Reddit threads where people argue about politics, parenting, and whether pineapple belongs on pizza. Amazon reviews where customers explain exactly why they love or hate a product. Forum posts where professionals debate industry practices. News articles reflecting editorial positions. Twitter threads. Blog posts. Academic papers. Legal filings. Customer support transcripts. The collected opinions, preferences, complaints, and beliefs of millions of real people, expressed in their own words, organized by context.

This is not a structured database where "Person A believes X" is stored as a fact. It is unstructured language. But language carries information about who is speaking and what they believe. When someone writes "As a nurse in a rural hospital, I think the healthcare system is fundamentally broken," that single sentence encodes occupation, geography, and a political stance. When someone writes "I switched from Android to iPhone last year and I am never going back," that encodes brand preference, switching behavior, and product satisfaction.

The model processes billions of these sentences. It never builds an explicit database of opinions. What it stores, in the connections between its billions of parameters, are the statistical relationships between demographic contexts and the language patterns that tend to follow them.

This is the foundation that makes synthetic audiences possible. Not intelligence. Not understanding. Pattern recognition on a scale that has never existed before.

Algorithmic Fidelity: The Core Concept

In 2023, a team of researchers at Brigham Young University published a paper in Political Analysis that gave the field its foundational concept. Lisa Argyle and her colleagues called it "algorithmic fidelity."

The idea is straightforward. If you give a language model a detailed demographic description of a person and then ask it to respond as that person, the response should reflect not just surface level stereotypes, but the deep, fine grained correlations between demographics and attitudes that exist in real populations.

Argyle's team tested this by creating what they called "silicon samples." They took real sociodemographic backstories from participants in major U.S. surveys, fed them to GPT 3, and asked the model to answer the same survey questions. Then they compared the synthetic responses to the real ones.

The results showed that the model's outputs were not random guesses. They were "fine grained and demographically correlated." A synthetic respondent described as a Black Democrat from an urban area answered differently from one described as a white Republican from a rural area, and those differences matched the actual patterns in the polling data.

The team evaluated this across four dimensions:

Pattern correspondence. Do the synthetic responses reflect the same statistical relationships between demographics and opinions that appear in real data? Yes. The model captured the relationships between race, party affiliation, education, geography, and policy preferences that show up in established survey findings.

Forward continuity. If you read a synthetic response, does it feel like a natural continuation of the demographic backstory? Yes. The tone, vocabulary, and content of responses tracked consistently with the replica description.

Backward continuity. If you show someone only the response without the backstory, can they correctly guess the demographics of the replica? Yes. Evaluators inferred race, party affiliation, and other demographics from the generated text at rates significantly above chance.

Social science Turing test. Can evaluators distinguish synthetic responses from real human responses? Not reliably. The synthetic text passed a Turing test in the context of survey research.

This paper established something important. The pattern recognition encoded in language models during pretraining is not superficial. The demographic information woven throughout billions of training documents has been absorbed in a way that allows the model to reproduce complex, multi dimensional opinion structures when prompted with the right context.

John Horton, an economist at MIT, came at this from a different angle in his NBER working paper. He coined the term "homo silicus" and showed that GPT based models could replicate established behavioral economics experiments. The model demonstrated downward sloping demand, status quo bias, and fair play preferences that tracked with the original human studies. When prompted as a "mathematician," the model accepted unfair offers in the ultimatum game. When prompted as a "legislator," it demanded even splits. The replica conditioning changed the economic behavior in ways that mapped onto real human patterns.

How Replica Conditioning Actually Works

Understanding algorithmic fidelity is one thing. Understanding the practical mechanism is another.

When a synthetic audience platform like Replicas generates a respondent, the process works like this.

First, you define the population. This can be broad: "American adults aged 25 to 45." Or specific: "IT directors at mid market B2B companies who are actively evaluating cybersecurity vendors." The more context you provide, the more the model has to draw on when generating responses.

Second, the system generates individual replicas. Each replica gets a detailed backstory: age, gender, ethnicity, location, income, education, occupation, personality traits, lifestyle, and whatever other attributes are relevant to your research question. These backstories serve as the conditioning context. They are the prompt that tells the model which region of its learned probability space to draw from.

Think of it this way. The model has absorbed billions of sentences from millions of people. When you tell it "you are a 55 year old retired teacher from rural Ohio with two adult children and a household income of $48,000," you are not handing it a script. You are narrowing the space of possible responses to those that are statistically consistent with what people like that actually say.

The quality of the backstory matters. Moon et al. (2024) developed what they called the "Anthology" approach to backstory construction and found that richer, more detailed backstories produced an 18% improvement in matching real response distributions and a 27% improvement in response consistency compared to simpler demographic prompts. Jiang et al. (2023) at NeurIPS showed that personality prompting could successfully induce specific personality traits in language models, with those traits generalizing to scenarios beyond the ones they were measured in.

Third, you ask your question. This can be anything you would ask a real respondent: a multiple choice survey question, a Likert scale, an open ended prompt, a product concept to evaluate, a pricing scenario. The model generates a response as the replica, drawing on the intersection of its training data and the conditioning context.

Fourth, you analyze the responses at scale. Because this process is automated, you can run it for thousands of replicas simultaneously. The result is a distribution of responses across your synthetic population, broken down by whatever segments you defined.

Replicas automates this entire workflow and adds a critical step: it runs each replica through multiple language models rather than just one. Understanding why that matters requires knowing what happens when you rely on a single model.

Why One Model Is Not Enough

Every language model carries systematic biases. These biases come from three sources, and each model handles them differently.

Training data selection. GPT, Claude, Gemini, and LLaMA were all trained on overlapping but different datasets. Each company makes different decisions about what to include, what to filter out, and how to weight different sources. A model trained with more Reddit data will carry a different opinion landscape than one trained with more academic text or more international news.

Alignment training. After pretraining, models go through reinforcement learning from human feedback (RLHF) or similar alignment techniques. This is where the model learns to be helpful, harmless, and honest. But it also introduces what researchers at Stanford identified as a leftward political skew in RLHF tuned models. Anthropic's own research found that by default, Claude's responses are most similar to opinions from the United States and certain European and South American countries. These alignment choices create systematic opinion shifts that differ between providers.

Architecture and scale. Larger models tend to produce more nuanced and diverse responses. Smaller models compress more aggressively and lose more of the tail distribution of opinions. A 7 billion parameter model and a 70 billion parameter model will produce different response profiles for the same replica.

The practical consequence: if you generate your entire synthetic audience using a single model, your entire audience shares that model's systematic blind spots. Every replica thinks through the same lens.

Using multiple models from different providers reduces this problem. A replica generated by GPT will carry OpenAI's training data biases and alignment choices. The same replica generated by Claude will carry Anthropic's. Generated by LLaMA, Meta's. Generated by Mistral, Mistral's. When you blend responses across models, the idiosyncratic biases of each individual model get diluted. What remains is the signal that is consistent across all of them: the genuine underlying pattern of human opinion.

This is the same logic behind polling aggregation. FiveThirtyEight does not trust any single pollster. They combine dozens of polls, each with its own methodological quirks, and the average is more accurate than any individual poll. Replicas applies this principle by routing each research question through 10 or more different models and synthesizing the results.

The Deepsona framework, published in late 2025, provided formal validation for this approach. Using six coordinated AI agents with multi trait population definitions produced more accurate, more human aligned predictions than single model or single trait approaches.

Brand, Israeli, and Ngwe (2023) at Harvard Business School demonstrated something else worth noting: GPT based models produced willingness to pay estimates that were "strikingly similar" to real consumer conjoint studies. The model showed downward sloping demand curves, diminishing marginal utility of wealth, and state dependence. Fine tuning on prior survey data made the results even more realistic. Market research, specifically, appears to be a domain where the training data is rich enough to produce genuinely useful synthetic responses.

Where the Science Breaks Down

Understanding how synthetic audiences work also means understanding where and why they fail. The limitations are not random. They are predictable consequences of the mechanism described above.

The sycophancy problem

Models trained with RLHF learn that being agreeable gets positive feedback from human evaluators. This is helpful for a chatbot. It is terrible for survey research.

Research from Emporia found that B2B synthetic respondents display what they called "strong positive bias" and "herd mentality." A 2025 study at ACM UMAP confirmed that synthetic respondents "overestimated humans' tendencies to provide positive ratings and exhibited substantially reduced variance."

In plain language: ask a synthetic audience what they think of your product and they will be nicer about it than real people would be. They will rate things higher, complain less, and express fewer strong objections. If you are counting on honest negative feedback, a naive synthetic audience will let you down.

This is why Replicas built adversarial mode. Instead of accepting the default sycophantic tendencies, adversarial mode specifically generates the replicas most likely to object, resist, or push back on your concept. It forces the model out of its agreeable default. The logic is simple: if even the positively biased version of your audience objects to something, the real version absolutely will.

The WEIRD skew

WEIRD stands for Western, Educated, Industrialized, Rich, and Democratic. It is a term from social psychology that describes the demographic group most overrepresented in research samples. Language model training data has the same problem, amplified.

The text these models learned from is disproportionately English language, disproportionately from internet users (who skew younger and more affluent), and disproportionately from platforms like Reddit that have their own demographic profiles. A study published in Humanities and Social Sciences Communications found that models perform significantly better when simulating opinions of Western, English speaking populations, with measurable disparities across gender, ethnicity, age, education, and social class.

Anthropic's GlobalOpinionQA research found that when you prompt a model to take the perspective of a specific country, the responses shift accordingly, but they can "reflect harmful cultural stereotypes" rather than authentic national perspectives.

What this means practically: if your research targets affluent, English speaking, college educated consumers in North America or Western Europe, the training data is relatively dense for that population and the synthetic responses will be more reliable. If your target audience is rural seniors in Southeast Asia or working class communities in Eastern Europe, the model has far less to draw on and the results should be treated with much more caution.

Variance compression

Real human populations have wide opinion distributions. Some people love a product. Some hate it. Some are indifferent. Some hold unusual combinations of beliefs that do not fit neat demographic categories.

Language models compress this variance. They tend toward the mean and underrepresent the tails.

Bisbee et al. (2024) in Political Analysis demonstrated this quantitatively. In the American National Election Study, real responses had a standard deviation of 31.4 on feeling thermometer questions. The same questions answered by ChatGPT replicas had a standard deviation of only 16.1. The average was close to right. The spread was roughly half of what it should have been.

Aher et al. (2023) found a related phenomenon they called "hyper accuracy distortion." When they replicated a Wisdom of Crowds experiment, the LLM crowd was too coherent. Real crowds include noise, outliers, and wildly wrong guesses that paradoxically improve collective accuracy. The synthetic crowd lacked that productive randomness.

This matters because many research questions depend on understanding the distribution, not just the average. "What does the typical person think?" is a fundamentally different question from "What range of opinions exist?" Synthetic audiences are better at the first than the second.

The temporal freeze

A language model's knowledge has a cutoff date. It cannot know about events, trends, products, or cultural shifts that occurred after its training data was collected.

This is more than a matter of missing the latest news. It means the model's opinion patterns reflect a specific moment in time. If public sentiment about a brand shifted dramatically in the last six months due to a controversy or a product launch, the model will not know. If a new cultural conversation has changed how people think about a topic, the model is stuck in the past.

Bail (2024) in PNAS raised an additional concern: closed source models from companies like OpenAI are "deprecated within 3 months to 1 year," meaning the exact model you validated against may not exist by the time you want to replicate the study. He called this "process reproducibility failure."

Prompt sensitivity

Small changes in how you word a question can produce meaningfully different response distributions from the same model with the same replica.

The NeurIPS 2024 paper from the Max Planck Institute tested 43 language models and found that responses were "highly sensitive to prompt perturbations" and exhibited "token biases and recency biases." A question phrased one way might yield a 60/40 split. The same question rephrased slightly might produce 70/30. Neither version is "wrong" in the way a human respondent could be wrong. The model simply does not have a fixed opinion to express. It has a probability distribution that shifts with context.

Verasight's research put it bluntly: "Researchers have no way of knowing in advance if the particular LLM or prompt are increasing or decreasing error relative to actual human responses."

This is where methodology and expertise matter. How you phrase the question, how you construct the replica backstory, how you structure the response options: all of these influence the output. The same technology in different hands can produce very different results.

Putting a Number on It

A paper published in 2025 by Huang, Wu, and Wang, accepted at ICML, asked a provocative question: "How many human survey respondents is a large language model worth?"

Their answer: at most 60 randomly selected people in the general U.S. population, for social opinion surveys.

That number puts the technology in perspective. A single model's output is not equivalent to surveying thousands of people. It is equivalent to surveying a few dozen. The value of synthetic audiences comes not from replacing large scale human research with one model, but from generating many respondents across multiple models, analyzing the distributions, and using the results as directional signals.

Kim and Lee (2023) showed that fine tuning a model on the General Social Survey (69,000 adults, 3,100 questions, spanning 1972 to 2021) could achieve a correlation of 0.98 at the aggregate public opinion level and an AUC of 0.86 at the individual level. But accuracy was significantly higher for white, higher income, and more educated individuals, reflecting the demographic skew in both the training data and the survey data used for fine tuning.

Suh et al. (2025), in a paper at ACL, showed that fine tuning on their SubPOP dataset (3,362 questions, 70,000 subpopulation response pairs) reduced the gap between LLM and human responses by up to 46% compared to prompt engineering alone. The technology is improving. But even with these improvements, a NAACL 2025 paper that tested specialized models across global populations still concluded by "cautioning against the use of LLMs, specialized or not, for simulating survey response distributions today."

The honest picture: studies have shown results ranging from roughly 85% of human test retest reliability (Park et al.) to aggregate level correlations above 0.95 with real survey data (Kim and Lee), depending heavily on the population, the topic, the phrasing, and the methodology. For a detailed breakdown of those accuracy ranges and what they mean in practice, see our accuracy deep dive.

What Understanding the Mechanism Tells You

Knowing how synthetic audiences work changes how you should use them.

If you understand the training data, you understand the coverage. The model knows a lot about populations heavily represented in internet text and very little about populations that are not. This immediately tells you where to trust the results and where to be skeptical.

If you understand RLHF, you understand the positive bias. The model is optimized to be helpful and agreeable. Expecting it to deliver brutally honest negative feedback by default is expecting it to work against its training. You need to actively design for criticism, either through adversarial testing or through question design that invites disagreement.

If you understand variance compression, you focus on relative rankings rather than absolute numbers. "Concept A scored higher than Concept B across all segments" is a reliable finding from synthetic research. "Exactly 73% of people prefer Concept A" is not.

If you understand prompt sensitivity, you invest in question design. The exact wording of your questions matters more with synthetic respondents than with real ones. Running the same question with multiple phrasings and checking the consistency of results is a useful validation technique.

If you understand the multi model argument, you choose platforms that use multiple models. A single model gives you a single perspective with all of its systematic biases intact. Replicas uses 10+ models from different providers specifically because the research shows this reduces the impact of any single model's blind spots.

The bottom line is not complicated. Synthetic audiences are not artificial intelligence forming opinions. They are pattern matching engines drawing on an enormous corpus of real human expression. The quality of their output depends directly on how well the relevant population is represented in their training data, how carefully the replica conditioning is designed, and how thoughtfully the results are interpreted.

The science behind them is real and growing. Argyle et al. established that algorithmic fidelity exists. Park et al. showed it can reach 85% of human test retest reliability. Brand et al. demonstrated that economic preferences like willingness to pay transfer realistically. ESOMAR's 2025 code update formally addresses synthetic data with new guidelines for transparency, accountability, and responsible use.

The limitations are also real and well documented. Bisbee et al. showed variance is compressed. Dominguez-Olmedo et al. questioned whether LLM survey responses reflect genuine opinion structure at all. Santurkar et al. showed that demographic steering does not fully close the representativeness gap.

Use synthetic audiences with an understanding of how they work. Calibrate your trust based on that understanding. And validate what matters most with real humans.

For a practical framework on when to use them and when not to, read our accuracy guide. For cost comparisons, competitor analysis, and expert commentary, see our complete guide.

Frequently Asked Questions

How do synthetic audiences differ from traditional market research replicas?

Traditional replicas are static profiles built from a handful of interviews. Someone on your team talks to 15 customers, finds patterns, and writes up three to five fictional profiles. They get outdated whenever behavior shifts, they represent a tiny sample, and you cannot ask them follow up questions. Synthetic audiences are dynamic. They are generated from the statistical patterns embedded in language models trained on billions of real human expressions. You can create thousands of them, define any combination of demographics and psychographics, and survey them repeatedly. The underlying technology is now formally addressed in ESOMAR's 2025 code update, which added new definitions and guidelines for the responsible use of AI and synthetic data in research.

What does "algorithmic fidelity" mean?

Algorithmic fidelity is a term coined by Argyle et al. (2023) to describe the property that language models encode fine grained, demographically correlated patterns from their training data. When you condition a model with a detailed replica description, the resulting responses are not random or generic. They reflect real statistical relationships between demographic profiles and attitudes, beliefs, and preferences. This is the scientific basis for why synthetic audiences produce useful results.

Why do synthetic audiences tend to give more positive responses than real people?

This is a consequence of how language models are trained. After the initial pretraining phase, models go through alignment training (commonly called RLHF) where human evaluators reward responses that are helpful and agreeable. This creates a systematic tendency toward positive, consensus oriented responses. Emporia Research documented this as a "strong positive bias" in B2B research contexts. To counteract this, Replicas offers adversarial mode that specifically generates the replicas most likely to push back on your concept.

How does using multiple language models improve synthetic audience quality?

Each language model carries systematic biases from its training data, alignment process, and architecture. If you generate your entire synthetic audience with a single model, every respondent shares those same blind spots. Using multiple models from different providers (OpenAI, Anthropic, Google, Meta, Mistral) means each respondent is influenced by a different set of biases. When you aggregate across models, the idiosyncratic biases cancel out and the genuine human signal becomes stronger. Replicas uses 10+ models for this reason. The approach follows the same logic as polling aggregation, where combining multiple imperfect polls produces a more accurate picture than relying on any single one.

Can synthetic audiences fully replace human research?

Not entirely. The academic consensus in 2025 and 2026 cautions against using synthetic audiences as a full replacement for human research, particularly for high stakes decisions, sensitive populations, and cases requiring precise quantification. The emerging best practice is a hybrid model: use synthetic audiences for speed, scale, and exploration (concept screening, message testing, audience segmentation), then validate the most important findings with targeted human research. For a detailed framework on when synthetic audiences add genuine value and when they fall short, see our practical guide.

References

Foundational Research

Argyle, L.P., Busby, E.C., Fulda, N., Gubler, J.R., Rytting, C., and Wingate, D. (2023). "Out of One, Many: Using Language Models to Simulate Human Samples." Political Analysis, 31(3), 337 to 351.

Horton, J.J. (2023). "Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?" NBER Working Paper No. 31122.

Aher, G., Arriaga, R.I., and Kalai, A.T. (2023). "Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies." ICML 2023.

Brand, J., Israeli, A., and Ngwe, D. (2023). "Using GPT for Market Research." Harvard Business School Working Paper No. 23-062.

Park, J.S., et al. (2024). "Generative Agent Simulations of 1,000 People." Stanford University.

Opinion and Bias Research

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., and Hashimoto, T. (2023). "Whose Opinions Do Language Models Reflect?" ICML 2023.

Durmus, E., et al. (2024). "Towards Measuring the Representation of Subjective Global Opinions in Language Models." Anthropic. COLM 2024.

Bisbee, J., Clinton, J.D., Dorff, C., Kenkel, B., and Larson, J.M. (2024). "Synthetic Replacements for Human Survey Data? The Perils of Large Language Models." Political Analysis, 32(4), 401 to 416.

Dominguez-Olmedo, R., Hardt, M., and Mendler-Dunner, C. (2024). "Questioning the Survey Responses of Large Language Models." NeurIPS 2024.

Methodology and Improvement

Moon, S., et al. (2024). "Virtual Personas for Language Models via an Anthology of Backstories."

Jiang, G., et al. (2023). "Evaluating and Inducing Personality in Pre-trained Language Models." NeurIPS 2023.

Kim, J., and Lee, B. (2023). "AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction."

Suh, J.J., et al. (2025). "Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions." ACL 2025.

Huang, C., Wu, Y., and Wang, K. (2025). "How Many Human Survey Respondents is a Large Language Model Worth?" ICML 2025.

Malukas, M. (2025). "Deepsona: An Agent-Based Framework for Multi-Trait Synthetic Audiences." Research Square preprint.

Limitations and Critiques

Cao, Y., et al. (2025). "Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations." NAACL 2025.

Emporia Research (2024). "Real Insights or Robotic Responses."

ACM UMAP (2025). "Simulating Human Opinions with Large Language Models."

Bail, C.A. (2024). "Can Generative AI Improve Social Science?" PNAS, 121(21).

Verasight (2025). "The Risks of Using LLM Imputation of Survey Data."

Performance and Biases of Large Language Models in Public Opinion Simulation. (2024). Humanities and Social Sciences Communications, 11, Article 1095.

Industry Standards

ICC/ESOMAR (2025). "ICC/ESOMAR International Code on Market, Opinion and Social Research."