Mistakes NLG Can Make and How AI Platforms Can Avoid Them

Natural LanguageIndustry PerspectivesNeural Networks

By Jeff Coyle On Jan 22, 2021

To err is human, but machines make mistakes, too. In this case, the devices we’re talking about are natural language generation platforms. Sometimes these mistakes, in the form of bias, can be subtle. At other times they can go really wrong.

How does NLG work?

Historically, language experts created rules and models that drove natural language generation. This time-intensive and manual process was the defacto operational standard until relatively recently. Access to vast amounts of data, coupled with advances in machine learning and computing power, has caused a dramatic shift towards the commercial use of statistical-based natural language generation.

This approach, popularized by GPT-3, is based on distributional semantics or word association. All prior terms in the document statistically determine the next word in a sentence. What’s most likely to occur next depends on what has come before. Really, it’s an educated guess. This type of NLG model does not “understand” language, even though it may sometimes seem that way.

How do they do this?

Known as a statistical generative language model, it’s created by analyzing a vast number of documents. Machine learning needs to observe how words occur together, how close they are, and the frequency of usage. In principle, the more observations made, the greater the certainty in predicting how words occur.

The issue is with the data used to train the models. “As for all biased results in data science predictions, it depends on the dataset we are training the models on,” says Rosaria Silipo, Ph.D., principal data scientist at KNIME. “If the dataset is a collection of biased texts, the generated texts will reflect that bias.” Train the natural language generation model with enough documents where “Muslim” occurs in close context with “terrorist” or “white people” with “KKK,” and it may take that to be the norm.

Where and how does bias occur in the process?

So far, the convention has been to train these NLG models on exponentially larger sets of text data. Many models use the Common Crawl corpus, consisting of over 2.5 billion unfiltered pages, as their training data. For some, that’s just a start. They’ll essentially scrape everything they can from the entire internet, including the good, the bad, and the ugly.

It’s not just blog posts and news articles that make up these datasets: social media, forum posts, and the like account for a large portion. “We need to be really vigilant in knowing what kinds of content the training sets contain,” states Pure Strategy Inc Founder and CEO Briana Brownell.

While these massive sets of training data may improve a model’s predictive capability, there are downsides. The sheer size of the data makes it extremely difficult to check for toxic language, and there’s a lot. The internet is more misogynistic, racist, and sexist than most people realize. Training an NLG model to view this as typical is not a good idea by any stretch of the imagination.

What are the types of bias and misrepresentation?

Scrape enough data from the internet, and soon your vocabulary will expand to include numerous words unfit for public consumption. But sometimes, even innocuous words strung together can form biased representations of gender, profession, race, and religion. Take, for example, the gender stereotype “blonde bombshell,” a term full of negative connotations from a couple of commonplace words.

But when considering bias, “there is an important but subtle distinction between the action taken by a system versus the analysis,” explains Rayid Ghani, Professor in the Machine Learning Department and Public Policy at Carnegie Mellon University. “How the analysis will be used helps in determining what types of biases are more important than others to avoid.”

Language changes over time, and datasets based on past language usage may not reflect what is current. Take the medical profession, where at one time, nurses were women and doctors were men.

Where Traditional Observability Stops in AI-Enabled Applications

Jul 16, 2026

Why AI Still Gets Geometry Wrong

Jul 14, 2026

Designing Edge AI for Real-World Environments

Jul 2, 2026

Prev Next 1 of 1,270

Obviously, it is no longer the case, but the model can perpetuate this stereotype if not retrained regularly using updated datasets.

What steps can be taken to mitigate these issues?

If the dataset used to train the language model contains biased text, it would be reflected in the generated text. Using a bigger dataset will not necessarily overcome that bias. At MarketMuse, we’ve found that a well-curated training dataset that’s scrubbed clean of toxic language works better than one that is far larger and unrefined.

According to Christopher Penn, Chief Data Scientist at TrustInsights, “Almost all real-world datasets contain biases.” He believes the real question is whether those biases are harmful or illegal.

“Recirculating, for example, disinformation is not illegal, but it is harmful. If we were generating language about vaccines, for example, we would want to eliminate disinformation.”

Obviously, illegal biases are also of great concern. As Christopher points out, “In the United States, protected classes on which we may not discriminate – which includes the training data we provide to models – include race, national origin, sexual orientation, gender identity, veteran status, disability, and religion.”

Since large real-world datasets are known to contain biases, vendors should explore ways to measure their pre-trained models to determine the extent of discrimination. One possibility is StereoSet from MIT, a dataset of 17,000 sentences measuring bias across gender, race, religion, and profession.

Make use of guardrails. In real life, these crash barriers keep automobiles on the road reducing the risk of serious accidents. In NLG, guardrails serve a similar purpose, ensuring the generated content doesn’t go off-course. Not using barriers is like giving the model carte blanche, which is never a good idea.

NLG models need to learn continually. It’s not enough to train a model once and use it evermore. Retrain the model regularly and fine-tune it to learn more about the subject for which it is creating a new piece of content. Ideally, this occurs with every new generation request.

Involve humans in the content creation process

Perhaps the most significant risk doesn’t come from the model itself but its implementation. Many marketers see NLG as a way of removing humans from the content creation equation. I think they view it as a kind of easy-button where they push and out comes a publishable piece of content ready for consumption. This lack of oversight is where the danger lies.

“The solution is incredibly simple,” explains digital consultant Vip Sitaraman. “Pair the computer with a human. Insofar as there is always a human editor curating the works of natural language generation, there is no outsize risk.”

That parallels our experience in designing First Draft, our NLG platform. We see natural language generation as augmenting the work of writers, not replacing them. In creating our system, we account for interactivity at multiple steps of the process. We find that letting the user configure the content toward their goals and giving them editorial control at key stages is crucial to avoiding embarrassing mistakes.

Natural language generation is still in its infancy and marketers should not have blind faith in the process. They will achieve the best outcome by taking a hands-on approach to incorporating NLG as a sophisticated writing aid. Simultaneously, vendors need to be aware of the potential for bias in large datasets used to train language models and take appropriate action.