My trick for getting consistent classification from LLMs

verdik.substack.com

294 points by frenchmajesty 8 days ago

minimaxir 17 hours ago

There is a flaw with the base problem: each tweet only has one label, while a tweet is often about many different things and can't be delinated so cleanly. Here's an alternate approach that both allows for multiple labels and lower marginal costs (albeit higher initial cost) for each tweet classified.

1. Curate a large representative subsample of tweets.

2. Feed all of them to an LLM in a single call with the prompt along the lines of "generate N unique labels and their descriptions for the tweets provided". This bounds the problem space.

3. For each tweet, feed them to a LLM along with the prompt "Here are labels and their corresponding descriptions: classify this tweet with up to X of those labels". This creates a synthetic dataset for training.

4. Encode each tweet as a vector as normal.

5. Then train a bespoke small model (e.g. a MLP) using tweet embeddings as input to create a multilabel classification model, where the model predicts the probability for each label that it is the correct one.

The small MLP will be super fast and cost effectively nothing above what it takes to create the embedding. It saves time/cost from performing a vector search or even maintaining a live vector database.

nico 17 hours ago

I just built a logistic regression classifier for emails and agree
Just using embeddings you can get really good classifiers for very cheap
You can use small embeddings models too, and can engineer different features to be embedded as well
Additionally, with email at least, depending on the categories you need, you only need about 50-100 examples for 95-100% accuracy
And if you build a simple CLI tool to fetch/label emails, it’s pretty easy/fast to get the data
- tomrod 12 hours ago
  
  I'm interested to see examples! Is this shareable?
addaon 3 hours ago

“Up to X” produces a relatively strong bias for producing X yesses. “For each of these possible labels, write a sentence describing whether it applies or not, then summarize with the word Yes or No” does a bounded amount of thinking per label and removes the bias, at the cost of using more tokens (in your pre-processing phase) and requiring a bit of post-processing.
- minimaxir 2 hours ago
  
  Those are just simple prompt examples: obviously more prompt engineering would be necessary.
  However, modern LLMs, even the cheaper ones, do handle the up to X constraint correctly without consistently giving X.
meander_water 11 hours ago

Why wouldn't you use OP's approach to build up the representative embeddings, and then train the MLP on that?
That way you can effectively handle open sets and train a more accurate MLP model.
With your approach I don't think you can get a representative list of N tweets which covers all possible categories. Even if you did, the LLM would be subject to context rot and token limits.
mattmanser 9 hours ago

There are multiple labels per tweet in the code examples, so not sure where you got that from.
portaouflop 16 hours ago

I am doing a similar thing for technical documentation, basically i want to recommend some docs at the end of each document. I wanted to use the same approach you outlined to generate labels for each document and thus easily find some “further reading” to recommend for each.
How big should my sample size be to be representative ? It’s a fairly large list of docs across several products and deployment options. I wanted to pick a number of docs per product. Maybe I’ll skip the steps 4/5 as I only need to repeat it occasionally once I labelled everything once
- minimaxir 15 hours ago
  
  If you're just generating labels from existing documents, you don't need that many data points, but the LLM may hallucinate labels if you have too few relative to the number of labels you want.
  For training the model downstream, the main constraint on dataset size is how many distinct labels you want for your use case. The rules of thumb are:
  a) ensuring that each label has a few samples
  b) atleast N^2 data points total for N labels to avoid issues akin to the curse of dimensionality

sethkim 20 hours ago

Under-discussed superpower of LLMs is open-set labeling, which I sort of consider to be inverse classification. Instead of using a static set of pre-determined labels, you're using the LLM to find the semantic clusters within a corpus of unstructured data. It feels like "data mining" in the truest sense.

frenchmajesty 19 hours ago

OP here. This is exactly right! You perfectly encapsulated the idea I stumbled up so beautifully.
alansaber 6 hours ago

problem is these dont bin properly

pietz 4 hours ago

I enjoyed reading this, but it seems overly complex and at least slightly flawed.

Why not embed all tweets, cluster them with an algorithm of your choice and have an LLM provide names for each cluster?

Cheaper, better clusters and more accurate labels.

frenchmajesty 4 hours ago

OP here. I agree! I should've called out why I did _not_ follow that approach as many others have commented the same.
The main reason why is that I needed the classification to be ongoing. My system pulled over thousands of tweets per day and they all needed to be classified as they came for some downstream tasks.
Thus, I couldn't embed all tweets, then cluster, then ...
- pietz 2 hours ago
  
  Makes sense, I appreciate the comment. Well written article. Subscribed.
PaulHoule 3 hours ago

k-Means clustering works very well on embedding models such as SBERT, if you feed in 20,000 documents and ask for k=20 clusters, the clusters are pretty good -- with the caveat that the clustering engine wants to make roughly equal-sized clusters so if 5% of your articles are about Fútbol you will probably get a cluster of Fútbol but if 20% of them are about the carbon cycle you will get four clusters of carbon cycle.
There are other clustering algorithms that try to fit variable size clusters or hierarchically organized clusters which may or may not make better clusters but generally take more resources than k-Means; k-Means is getting started at 20,000 documents and others might be struggling at that point.
Having the LLM write a title for the clusters is something you can do uniquely with big LLMs and prompt engineering.
It's wrong to say "don't waste your time collecting the data to train and evaluate a model it because you can always prompt a commercial LLM and it will be 'good enough'" because you at the very least need the evaluation data to prove that your system is 'good enough' and decide if one is better than another (swap out Gemini vs Llama vs Claude)
In the end, though, you might wish that the classification is not something arbitrary that the system slapped on it but rather is a "class" in some ontology which has certain attributes (e.g. a book can have a title, and a "heavy book" weighs more than 2 pounds by my definition) If you are going the formal ontology route you need the same evaluation data so you know you're not doing it wrong. If you've accepted that, though, you might as well collect more data a train a supervised model and what I see in the literature is that many-shot approach still outperforms one-shot and few-shot.
[1] which is on the scale of the training data in most qapplications
EForEndeavour 4 hours ago

From my limited experience trying exactly this, it gets you 80% of the way there, then devolves into an infuriating and time-wasting exercise in endless iteration and prompting to sweep clustering parameters and labeling details to nail the remaining 20% needed for acceptance by downstream "customers" (i.e., nontechnical business people).
If your end goal is to show an audience of nontechnical stakeholders an overview of your dataset in a static medium (like a slide), I would suggest you do the cluster labeling yourself, with the help of interactive tooling to make the semantic cluster structure explorable. One option is to throw the dataset into Apple's recently published and open-sourced Embedding Atlas (https://github.com/apple/embedding-atlas), take a screenshot of the cluster viz, poke around in the semantic space, and manually annotate the top 5-10 most interesting clusters right in Google Slides or PowerPoint. If you need more control over the embedding and projection steps (and you have a bit more time), write your own embedding and projection, then use something like Plotly to build a quick interactive viz just for yourself; drop a screenshot into a slide and annotate it. Feels super dumb, but is guaranteed to produce human-friendly output you can actually present confidently as part of your data story and get on with your life.

kgeist 12 hours ago

I did something similar: made an LLM generate a list of "blockers" per transcribed customer call, calculated the blockers' embeddings, and clustered them.

The OP has 6k labels and discusses time + cost, but what I found is:

- a small, good enough locally hosted embedding model can be faster than OpenAI's embedding models (provided you have a fast GPU available), and it doesn't cost anything

- for just 6k labels you don't need Pinecone at all, with Python it took me like a couple of seconds to do all calculations in memory

For classification + embedding you can use locally hosted models, it's not a particularly complex task that requires huge models or huge GPUs. If you plan to do such classification tasks regularly, you can make a one-time investment (buy a GPU) and then you'll be able to run many experiments with your data without having to think about costs anymore.

meander_water 11 hours ago

Agreed, I've run sentence-transformers/all-MiniLM-L6-v2 locally on CPU for a similar task, and it was approx X2 faster than calling the OpenAI embedding API, not to mention free.
frenchmajesty 7 hours ago

OP here. I agree with you. For production use we use VoyageAI which is usually 2x faster than OpenAI at similar quality levels (p90 is < 200ms) but we're looking at spinning up a local embedding in our cloud environment, that would make p95 < 100ms and make cost negligible as well.

cadamsdotcom 2 hours ago

Neat! LLM-as-judge use cases could benefit from this too.

Normally you’d ask the judge LLM to “rate this output out of 5” or whatever the best practice is this week.

Vectorizing the output you’re trying to judge, then judging on semantic similarity to a desired output - instead of asking a judge “how good was this output” - avoids so many challenges. Instead of a “rating out of 5” you get more precise semantic similarity and you get it faster.

No doubt obvious to folks in the space, but seemed like a huge insight to me.

rorylaitila 4 hours ago

When I started cataloging my vintage ad collection (https://adretro.com), I originally started with a defined set of entities (like brands, or product categories). I used OpenAI vision to extract out the categories the ad belongs to. However, I found that it would simply not be consistent in its classification. So I decided to let the model classify however it wants, and I map those results back to my desired ontology after the fact. My mapping is manually in my case. But I could see how I could use techniques to dynamically cluster.

xyzzy_plugh 4 hours ago

I would love to better understand what you mean by "classify however it wants." Is the output structured?
- rorylaitila 3 hours ago
  
  Yeah, the output is json structured, but I mean the entity value that is returned. A simple case is classifying the Brand of the ad. It might return any of "Ford", "Ford Motor Company", "Ford Trucks", "The Ford Motor Company", "Lincoln Ford" even on very similar ads. Rather than try to enhance the prompt like "always use 'Ford Motor Company' for every kind of Ford" I just accept whatever the value is. I have a dictionary that maps all brands back to a canonical brand on my end.
  - AbstractH24 2 hours ago
    
    What are you using to build the dictionary? Particularly when it encounters something you've never seen before.
    This is really interesting to me.

dinobones 19 hours ago

Dunno if this passes the bootstrapping test.

This is sensitive to the initial candidate set of labels that the LLM generates.

Meaning if you ran this a few times over the same corpus, you’ll probably get different performance depending upon the order of the way you input the data and the classification tag the LLM ultimately decided upon.

Here’s an idea that is order invariant: embed first, take samples from clusters, and ask the LLM to label the 5 or so samples you’ve taken. The clusters are serving as soft candidate labels and the LLM turns them into actual interpretable explicit labels.

barbazoo 3 hours ago

Im really impressed by this. If I understand correctly they classify the tweet but then embed the class in the vector space and find the closest neighbour. If it’s really close it means it’s semantically similar and if it’s close enough it’s a “cache hit”. Beautiful.

But when I read this:

> In order to train an AI model to tweet like a real human

Ugh, we’re doing this again, trying to fool people to believe some AI Twitter account is a real person, presumably for personal gain. Am I wrong?

frenchmajesty 3 hours ago

Hey OP here. You're not wrong! Leaving aside the philosophical debate (isn't all form of capitalist participation selfishly motivated?), the main motivator was to help me and my friends with a problem we struggled with.
Many solo Entrepreneurs you see on Twitter with large audiences are busy people so they have hired cheap labor from India / Philippines to be the social media manager. They often take on the task of keeping up with the niches and drafting post ideas. The big issue is that the variance in quality of who you hire is very high, and it's also a mental and energy toll to manage an employee who works on the other side of earth.
So the AI helps to scours "here is what all the tech bros are talking about since 3 days ago" and then drafts 3-5 posts and shows them to me so I can curate. I get to keep my page and audience engaged while protecting my time from actual deep work instead of scrolling the feed all day.

radarsat1 6 hours ago

I'm curious, what is the use case for open-ended labeling like this? I can think of clustering ie finding similar tweets but that can also just be done via vector similarity. Otherwise maybe the labels contain interesting semantics but 6000 sounds like too many to analyze by hand. Maybe you are using LLMs to do further clustering and working on a graph or hierarchical "ontology" of tweets?

frenchmajesty 3 hours ago

Hey OP here. The use-case is to give an Agent the ability to post on my behalf. It can use these class labels to figure out "what are my common niches" and then come up with keyword search terms to find what's happening in those spaces and then draft up some responses that I can curate, edit and post.
This is the kind of work you typically hire cheap social managers overseas to do through Fiverr. However, the variance in quality is very high and the burden of managing people on the other side of the world can be a lot of solo Entrepreneurs.

witnessme 12 hours ago

A simple word2vec embedding with continuous bag of words (CBOW) training is enough and beats all other complex solutions at rhe performance as well as cost

Reference: https://blog.invidelabs.com/how-invide-analyzes-deep-work/

yahoozoo 7 hours ago

> 2020

jawns 20 hours ago

If you already have your categories defined, you might even be able to skip a step and just compare embeddings.

I wrote a categorization script that sorts customer-service calls into one of 10 categories. Wrote descriptions of each category, then translated into embedding.

Then created embeddings for the call notes and matched to closest category using cosine_similarity.

copypaper 18 hours ago

I originally settled on doing this, but the problem is that you have to re-calculate everything if you ever add/remove a category. If your categories will always be static, that will work fine. But it's more than likely you'll eventually have to add another category down the line.
If your categories are dynamic, the way OP handles it will be much cheaper as the number of tweets (or customer service calls in your case) grows, as long as the cache hit rate is >0%. Each tweet will get it's own label, i.e. "joke_about_bad_technology_choices". Each of these labels gets put into a category, i.e. "tech_jokes". If you add/remove a category you would still need to re-calculate everything, however you would only need to re-calculate the labels to categories as opposed to every single tweet. Since similar tweets can share the same labels, you end up with less labels than total amount of tweets. As you reach the asymptotic ceiling, as mentioned in OPs post, your cost to re-embed labels to categories also becomes an asymptotic ceiling.
If the number of items you're categorizing is a couple thousand at most and you rarely add/remove categories, it's probably not worth the complexity. But in my case (and ops) it's worth it as the number of items grows infinitely.
nerdponx 20 hours ago

How did you construct the embedding? Sum of individual token vectors, or something more sophisticated?
- liqilin1567 14 hours ago
  
  sentence embedding models like all-MiniLM-L6-v2 [1], bge-m3 [2]
  [1] https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
  [2] https://huggingface.co/BAAI/bge-m3
  In my recent project I used openai's embedding model for that because of its convenient api and low cost.
- minimaxir 17 hours ago
  
  Model embedding models (particulaly those with context windows of 2048+ tokens) allow you to YOLO and just plop the entire text blob into it and you can still get meaningful vectors.
  Formatting the input text to have a consistent schema is optional but recommended to get better comparisons between vectors.
- olliepro 19 hours ago
  
  sentence embedding models are great for this type of thing.
kurttheviking 20 hours ago

Out of curiosity, what embedding model did you use for this?
minimaxir 18 hours ago

This works in a pinch but is much less reliable than using a curated set of representative examples from each targeted class.
svachalek 20 hours ago

That was my first thought, why even generate tags? Curious to see if anyone's proved it's worse empirically though.
- soldeace 19 hours ago
  
  In a recent project I was asked to create a user story classifier to identify whether stories were "new development" or "maintenance of existing features". I tried both approaches, embeddings + cosine distance vs. directly asking a language model to classify the user story. The embeddings approach was, despite being fueled by the most powerful SOTA embedding model available, surprisingly worse than simply asking GPT 4.1 to give me the correct label.
- frenchmajesty 19 hours ago
  
  OP here. It depends what you use it for. You do want the tags if you intend to generate data. Let's say you prompt an LLM to go tweet on your behalf for a week, having the ability to:
  - Fetch a list of my unique tags to get a sense of my topics of interests
  - Have the AI dig into those specific niches to see what people have been discussing lately
  - Craft a few random tweets that are topic-relevant and present them to me to curate
  Is very powerful workflow that is hard to deliver on without the class labels.
TZubiri 8 hours ago

I had this same idea mid 2024, but embeddings and cosine similarity is way less consistent, not even the classical king+woman=queen work. The latest embedding models from OpenAI are from like 2023. Did you actually try this? What embedding models work for this?

vessenes 7 hours ago

Arthur, question from your GitHub + essay:

In GitHub you show stats that say a "cache hit" is 200ms and a miss is 1-2s (LLM call).

I don't think I understand how you get a cache hit off a novel tweet. My understanding is that you

1) get a snake case category from an LLM

2) embed that category

3) check if it's close to something else in the embedding space via cosine similarity

4) if it is, replace og label with the closest in embedding space

5) if not, store it

Is that the right sequence? If it is, it looks to me like all paths start with an LLM, and therefore are not likely to be <200ms. Do I have the sequence right?

frenchmajesty 5 hours ago

OP here. We embed both the label AND the tweet. So if tweet A is "I love burgers" and tweet B is "I love cheeseburgers", we ask in our vector DB if we have seen a tweet before that is very similar to B? If yes, we skip LLM altogether (cache hit) and just take the class label that A has.
inanothertime 7 hours ago

From what I understood, we check whether "snake case category" from step (1) is already known to us (in the cache) so that we need no further processing. So that step (2) and further don't apply for categories that were already produced earlier.

pu_pe 11 hours ago

What about accuracy? Maybe I'm missing something, but the crucial piece of information that is missing is whether the labels produced by both methods converge nicely. The fact that OP had >6000 categories using LLMs makes me wonder whether there is any validation at all, or you just let the LLMs freestyle.

vismit2000 9 hours ago

I used sentence transformers for clustering for a similar use case: https://huggingface.co/sentence-transformers

axpy906 20 hours ago

Arthur’s classifier will only be as accurate as their retrieval. The approach depends on the candidates to be the correct ones for classification to work.

frenchmajesty 19 hours ago

OP here. This is true. If you make your min_score .99 you can have very high confidence in copy-pasting the label, but then this is not very useful. The big question is then how far can you get from 0.99 while still having satisfying results?
- me_vinayakakv 14 hours ago
  
  Thanks for the article and approach. How did you come up with min_score at the end? Was it by trial and error?

dan_h 18 hours ago

This is very similar to how I've approached classifying RSS articles by topic on my personal project[1]. However to generate the embedding vector for each topic, I take the average vector of the top N articles tagged with that topic when sorted by similarity to the topic vector itself. Since I only consider topics created in the last few months, it helps adjust topics to account for semantic changes over time. It also helps with flagging topics that are "too similar" and merging them when clusters sufficiently overlap.

There's certainly more tweaking that needs to be done but I've been pretty happy with the results so far.

1: jesterengine.com

nreece 17 hours ago

Am I understanding it right that for each new text (tweet) you generate its embedding first, try to match across existing vector embeddings for all other text (full text or bag of words), and then send the text to the LLM for tag classification only if no match is found or otherwise classify it to the same tag for which a match was found.

Will it be any better if you sent a list of existing tags with each new text to the LLM, and asked it to classify to one of them or generate a new tag? Possibly even skipping embeddings and vector search altogether.

me_vinayakakv 14 hours ago

Yeah. But I think one problem could be if this list is very large and overflows the context.
I was thinking giving the LLM a tool `(query: string) => string[]` to retrieve a list of matching labels to check if they already exist.
But the above approach sounds similar to OP, where they use embeddings to achieve that.
liqilin1567 15 hours ago

I think your understanding is correct.
I actually built a project for tagging posts exactly the way you described.

rao-v 11 hours ago

You could probably speed this up a lot by getting the token log probs and finding your category based on the highest log prob token that is in the category (you may need multiple steps if your categories share token prefixes)

kpw94 18 hours ago

Nice!

So the cache check tries to find if a previously existing text embedding has >0.8 match with the current text.

If you get a cache hit here, iiuc, you return that matched' text label right away. But do you also insert a text embedding of the current text in the text embeddings table? Or do you only insert it in case of cache miss?

From reading the GitHub readme it seems you only "store text embedding for future lookups" in the case of cache miss. This is by design to keep the text embedding table not too big?

frenchmajesty 18 hours ago

Op here. Yes that's right. We do also insert the current text embedding on misses to expand the boundaries of the cluster.
For instance: I love McDonalds (1). I love burgers. (0.99) I love cheeseburgers with ketchup (?).
This is a bad example but in this case the last text could end up right at the boundary of the similarity to that 1st label if we did not store the 2nd, which could cause a cluster miss we don't want.
We only store the text on cache misses, though you could do both. I had not considered that idea but it make sense. I'm not very concerned about the dataset size because vector storage is generally cheap (~ $2/mo for 1M vectors) and the savings in $$$ not spend generating tokens covers for that expense generously.

deepsquirrelnet 18 hours ago

I think a less order biased, more straightforward way would be just to vectorize everything, perform clustering and then label the clusters with the LLM.

frenchmajesty 18 hours ago

OP here. Yes that works too and get you to the same result. Remove risks for bias but the trade-off is higher marginal cost and latency.
The idea is also that this would be a classification system used in production whereby you classify data as it comes, so the "rolling labels" problem still exists there.
In my experience though, you can dramatically reduce unwanted bias by tuning your cosine similarity filter.

vladde 5 hours ago

could someone please explain what this means?

> cluster the inconsistent labels by embedding them in a vector space

TheTaytay 5 hours ago

For any label it generates, he uses an embedding model to generate the embedding vector for it. (You could say that "embeds the label in a vector space") Then, he looks at that generated embedding for that label and asks if there is another previously-generated embedding that had a _very_ similar embedding generated for it. If you set some sort of "how close is close enough" threshold for that, you are "clustering" all generated labels, by saying "These 10 labels have slightly different words, but essentially mean the same thing.

deadbabe 5 hours ago

I just did the same thing but instead of vector space I used full text search with sufficiently tokenized text and searched for possible existing keyword matches before assigning the label, and basically got the same results or better than the article at a fraction of the cost (no need to vectorize).

It’s fairly obvious tbh, an agent needs a way to search, not just expect it to produce the same labels that magically match prior results. Have we been so blinded we’ve forgotten this kind of stuff?

ur-whale 10 hours ago

PSA: DSU means Disjoint Set Union and PSA means Public Service Announcement.

TZubiri 8 hours ago

"Read the following tweet and provide a classification string to categorize it.

Your class label should be between 30 and 60 characters and be precise in snake_case format. For example: - complain_about_political_party - make_joke_about_zuckerberg_rebranding

Now, classify this tweet: {{tweet}}"

I stopped reading here. It's a bit obvious that you need to define your classification schema beforehand, not on a per message basis. And if you do, you need a way to remember your schema. Of course you will generate an inconsistent and non-orthogonal set of labels. I expected the next paragraphs to immediately fix this like

"Classify the tweet into one of: joke, rant, meme..." but instead the post went on to intellectualizing with math? It's like a chess player hanging a queen and then going on about bishop pairs and the london system