Everyone can cluster feedback now. That's the problem.

A year of TalkTalk reviews, and what the word cloud can't see.

Share

A year of TalkTalk reviews, and what the word cloud can't see.

A decade ago, sorting a stack of customer reviews into themes was a real job. You hired someone, or you bought a Qualtrics licence, and you waited. Today any half-decent language model does it over a coffee, and roughly four hundred startups will sell you the result with a gradient, a chatbot, and the word "AI" somewhere it wasn't last year.

The trouble is the answer is nearly always the same, because customers, as a species, are not complicated. Point one of these tools at a telecom and it reports, gravely, that people dislike not reaching support, being charged for things they cancelled, and the internet not working. We pointed one at TalkTalk's last twelve months of Trustpilot reviews and got exactly that. NPS of minus seventy. Top complaints: rude agents, ignored callbacks, and a broadband service that, for a broadband company, is a bold thing to have as your headline review. Anyone who has ever held a phone contract could have written that list from memory, for free, which is rather the problem.

Clustering tells you what your customers are unhappy about. By about the third week of running any business, you already know what your customers are unhappy about. The useful question is what is happening to it: which problems are growing, which are shrinking, which one is quietly about to become the headline, and which of those movements are real rather than this month's noise. None of that is in the word cloud, and almost no tool will tell you.

So we did the rest of it. The TalkTalk dashboard behind this piece is public, if you want to dig through it yourself: app.sunbeam.cx/public/oqhjn2bv.

The number that says nothing is happening

TalkTalk's review score has sat in the same hole for a year. Minus sixty-five, minus seventy, back again. By that measure nothing has changed: it was bad in July and it is bad now. The cluster chart agrees, same themes on top, same misery. The score is the only part standing still. Underneath it, the whole thing rearranges, and that rearrangement is the story.

The pain is moving house

Read the complaints as a time series instead of a pile, and two things move in opposite directions.

One of the heaviest drags on TalkTalk's score is poor communication: unreturned calls, missed callbacks, the chatbot that loops you for hours.

That theme is enormous, and one of the heaviest weights on the score. It is also receding. Mentions per day fell by about a third across the year, and a Poisson test on the rate (corrected for the fact that we are testing dozens of themes at once, a few of which will look significant by luck) says that drop is real. q = 0.019.

While support complaints recede, billing complaints advance. Charges after cancellation, fees for paper bills nobody asked for, a new billing system that hides the invoice.

As a share of all complaints, billing climbed from one in twenty-four to one in fifteen (4.2% to 6.9%), and a proportion test says that climb is real too. q = 0.012.

So the locus of TalkTalk's pain is migrating, from the call centre to the billing system. Here is the part the score actively hides: the problem that is shrinking weighs about four times more than the problem that is growing. Net, the rating is quietly improving even while billing rots, which is why the headline number can sit flat all year and tell you nothing true. A company reading only its NPS would conclude it had a stable disaster. A company reading the change would know its support problem is genuinely shrinking and its billing problem is the next fire, well before that fire reaches the score.

That is the difference between counting complaints and understanding them, and you cannot get there by clustering harder.

Why you can believe any of this

Most "AI insight" products show you a bar that moved and call it a finding. Bars move; feedback is noisy, and across dozens of themes and twelve months something is always up and something is always down. The work is separating the real movements from the random ones, and that work is statistical, unglamorous, and the entire reason to trust a number at all.

Which is the moment to admit the most exciting thing we found was wrong. A change-point detector lit up in mid-November: outage complaints and "the engineer never showed up" complaints spiking together over the same five days. Two unrelated problems jumping at once smells like a real incident, the kind clustering would never surface. Internally, the word "flagship" was used. Then we tested it, and it didn't hold. TalkTalk customers complain about outages every week of the year, so a five-day bump is, statistically, a Tuesday, and when we went looking for an actual outage on those dates the best we found was a business-line wobble and a very promising tweet that turned out to be from the previous January. So we binned it. You get to trust the migration precisely because we throw away the things we would have loved to keep.

It isn't just TalkTalk

We ran the identical pipeline over a year of Virgin Media reviews as a check, and it is public too. Same result at the structural level: the composition of complaints is significantly different from one half of the year to the next, by a chi-square test on the whole distribution, with a p-value of about 0.0000000000000003. The mix is being repainted there too. The specifics differ and the pattern holds. Telecom feedback behaves like a slow-moving weather system, and the score on the front is the last thing to change.

How to actually do this on your own feedback

The method is general and mostly a matter of discipline. If you want to read your own reviews, tickets, or survey answers this way, here is the recipe.

Hold the taxonomy still. Theme your feedback once, then tag every new period against the same themes. Let the model re-cluster each month and "delivery delays" becomes "late deliveries" becomes "shipping issues," and you can no longer compare anything to anything. It is unglamorous, and it is the prerequisite for every step below.

Test rates, don't eyeball bars. For each theme, ask whether it is mentioned more often than before, and answer with a Poisson rate test. Then correct for multiple comparisons (Benjamini-Hochberg will do), because testing dozens of themes guarantees a few fake trends by chance. This is the step almost everyone skips, and it is why most dashboards are confidently wrong.

Test share as well as count. A theme can take over while your total volume stays flat, by growing as a fraction of everything. A proportion test with a confidence interval catches it. Billing grew at TalkTalk as a share of all complaints even where its raw volume barely moved.

Find the date it moved. Change-point detection tells you when a theme stepped up or down, which lets you line it up against a release, a price change, an outage. Treat a single step on a noisy line as a question and go corroborate it (see: the outage we binned).

Check the whole deck. A chi-square test on the entire theme distribution tells you whether the mix as a whole has changed. A flat top-line can hide a completely reshuffled pack, which is precisely what happened here.

Tie it to a number you care about. This is the step nobody takes and the one that matters most. Score each theme by how far it drags a metric, weighted by how much of your feedback it covers. On public reviews that metric is the star rating. On your own data it is whatever you actually run on: average order value, occupancy, churn, refunds. That turns "customers mention billing" into "billing is worth this many points of the thing the board asks about," which is the only version of feedback analysis worth paying for.

Then distrust the dramatic. The most eye-catching item on the page is usually the least robust, because drama and small samples travel together. Use a long enough window while you are at it. A ninety-day slice of this same data sold us a vulnerable-customer "surge" of nine times the baseline that a full year dissolved to nothing. A quarter will lie to you with great confidence. A year is harder to fool.

The actual frontier

Theming feedback is a solved, commodity problem, and a solved problem is a poor thing to build a company on. The valuable work is the two steps almost nobody takes: putting real statistics on what is changing, and tying those changes to a metric the business already cares about. Do both, and feedback becomes an early-warning system that tells you which fire to put out next, and what it is costing you to leave it burning.

That is the part we built, because it is the part worth building.

You can point Sunbeam at your own reviews, tickets and surveys, against your own metrics, and see what is moving and what it is worth. Free to try at sunbeam.cx.