LLM-based sentiment analysis of Hacker News posts between Jan 2020 and June 2023

Reading Time: 12 minutes

tl;dr
We analyzed all Hacker News posts with more than five comments between January 2020 and June 2023.
Leveraging LLama3 70B LLM, we examined both the posts and their associated comments to gain insights
into how the Hacker News community engages with different topics. You can download the datasets
we produced at the bottom of the article.

Use the tool below to explore various topics and the sentiments they evoke. The sentiment column
reports the median sentiment score, 0 being the most negative sentiment and 9 the most positive.
Click on the column headers to discover topics that inspire strong reactions, whether positive or
negative, and to identify divisive subjects that tend to generate polarized commentary.

Motivation​

If you have been following Hacker News for a while, you have likely developed an intuition
for the topics that the community loves – and topics the community loves to hate.
If you port Factorio to run on ZX Spectrum using Rust, you will be overwhelmed
with love and karma. On the other hand, you better don an asbestos suit if, instead
of raising a funding round, you sell your startup to a private equity company
so they can add telemetry in the codebase to power targeted ads.

But that’s just a hazy intuition! Since the community is a big believer in rationality
and data science (the phrase is divisive though), we would be in a stronger position
if we could back our hunch with proper data analysis.

Also, it would give us an excuse to play with large language models which, all the
AI hype aside, are mind-blowingly effective at practical tasks like this. And,
we happen to be developers of an open source tool,
Metaflow, written in Python,
which makes it fun and educational to hack projects like this.

Type the bolded phrases in the textbox above to see if we are optimizing for an
appropriate emotional response.

What topics trend on Hacker News?​

As described in the implementation section below, we downloaded about 100,000 pages posted
on Hacker News to understand what content resonates with the community. We focus on posts
that gained at least 20 upvotes and 5 comments, prompting a large language model to come
up with ten phrases that best describe each page.

Here are the top 20 topics, aggregated from the 100,000 pages, along with the number of
posts covering each topic:

The top topics are hardly surprising. However, the community is known for having
diverse intellectual interests. The top 20 topics represent only 10% of the topics
covered across posts. You can see this by yourself by using the tool above that
cover all the 14,000 topics that are associated with at least five posts.

Have the top topics evolved over time? You bet. Here are the top-10 trending
topics:

Even though the dataset only extends to June 2023, we can see a tsunami
of AI, natural language processing, and related topics. Sadly, layoffs started
trending mid-2022 as well. The rise of AI articles likely explains the growth
in the Technology topic as well.

What’s declining​

In 2020-2022, an overwhelming macro-trend was everything related
to the COVID pandemic which fortunately is not a pressing issue
anymore. Fascinatingly, August 2021 was a tumultuous month with especially
Apple’s proposed CSAM
scanning
causing an uproar in posts related to Privacy and Apple.

If you want to take a trip down the memory lane, consider the past
topics on the left which haven’t been covered once since January 2022:

Before Jan 2022 📉 After Jan 2022 📈 George Floyd GPT-4 Herd immunity Stable Diffusion Antibodies Russia-Ukraine War IOS 14 Ventura Freenode Bank Failure Suez Canal Midjourney Wallstreetbets Hiring Freeze Hydroxychloroquine FIDO Alliance Infection rates Cost of living crisis

Correspondingly, the topics on the right didn’t exist before
January 2022. It doesn’t take a PhD in Economics to understand how
Wallstreetbets got replaced by Cost of living crisis,
Hiring freeze, and Bank failure.

But how do people feel about these topics​

Importantly, sharing or upvoting a post doesn’t imply endorsement – often the opposite. Hence
to truly understand the dynamics of a community, we need to analyze how people react to posts, as expressed
through their comments.

To do this, we reconstructed comment threads associated with the posts in the dataset and asked
an LLM to classify the sentiment of the discourse between 0 and 9, zero being an all-out flamewar
and nine indicating a perfect harmony and positivity.

Our dutiful LLM took the job as a virtual community moderator and went through 100k comment threads in
about 9 hours, reading through 230M words consisting of the full emotional gamut of bitterness,
passion, wisdom, humor, and love.

Here’s what the LLM came back with in terms of the distribution of sentiments:

Firstly, the LLM is utterly befuddled about what a neutral discussion looks like – there are no 5 in the results.
Or, maybe this is a snarky note by the LLM noting that humans are incapable of unemotional, unbiased discourse.

Secondly, the sentiments are clearly skewed
towards the positive side, which aligns with our personal experience with the site. The bias to
positivity and optimism is a key reason why we open Hacker News daily. There’s enough vitriol elsewhere.

It is also worth noting that at least some of the few ones assigned by the machine seem to be bogus.
For instance, this post by BentoML was given a score of one,
although the sentiment seems majorly positive and supportive. All the usual caveats about LLMs apply.

Now with the topics and the sentiment scores at hand, we can finally give a scientific answer to the question of
what the community loves and loves to hate:

Love 😍 Hate 😠 Programming FTX Computer Science Police Misconduct Open Source Sam Bankman-Fried Python Xinjiang Game Development Torture Rust Employee Monitoring Electronics Cost Cutting Mathematics Racial Profiling Functional Programming Online Safety Bill Programming Language War on Terror Physics Atlassian Embedded Systems CSAM Self Improvement NYPD Database Alameda Research Unix International Students Astronomy TSA Retro Computing Earn It Act Nostalgia Car Features Debugging Bloatware

Geeks, nerds, and hackers should find themselves right at home! Interestingly, while the community tends to be
visionary and forward-looking (with technical matters at least), it definitely has a soft spot for (technical) nostalgia.
While the modern world is amazing, we miss ZX Spectrum, Z80 assembly, and 8086 dearly. At least we find some
comfort in PICO-8.

On the anger-inducing side, most topics need no explanation. It is worth clarifying though that Hacker News does not
hate International Students, but the posts related to them tend to be overwhelmingly negative,
reflecting the community’s sympathy for the challenges faced by those studying abroad.

Comically, Hacker News is not a community of car lovers. When we talk about cars, it is because there’s
something wrong with them. For more insights like this, use the tool at the top of the page to explore
the diverse landscape of HN topics in detail.

Some topics are just divisive​

Besides topics being unimodally love or hate-inducing, some topics are bimodal: Sometimes
a post about the topic generates a highly positive response, other times a flamewar. Examples include

GNOME – KDE vs. GNOME – the war has been raging for 25 years. Google – a dominant force in the Internet, both in good and bad. Government regulations – damned if you do and damned if you don’t. Venture capital – the lifeblood of Silicon Valley and a source of endless gossip.

See more by sorting by the divisive column in the tool above. To rank highly by the divisiveness
score, the topic must be associated with both negative and positive posts equally
and not many neutral ones.

Is the mood improving?​

Cue Betteridge’s Law .

We can plot the average sentiment over time, as expressed in daily comment threads:

The data proves that things sucked in August 2021 (or maybe it was just the Apple debacle highlighted
in the chart above). Overall, there’s a clear but modest downward trend in the average sentiment.

It
would require a deeper analysis to understand why this is the case. Putting aside an obvious hypothesis
that life is just getting worse (which might not be true),
an alternative hypothesis may be a variant of
Eternal September. It takes conscious and tireless
effort to maintain a positive mood in a growing community. Kudos to HN moderators for keeping the
community thriving and positive over the years!

Implementation​

A reason to be excited and optimistic about the future is the very existence of this article.
While natural language processing and sentiment analysis have been around for decades,
the quality, versatility, and the ease of use afforded by LLMs is absolutely unprecedented.

Attaining the quality
of topics and sentiment scores with a messy dataset like the one here would have required a PhD-thesis
level of effort just a few years ago – and very likely the results would have been worse. In contrast,
we developed all the code for this article in about seven hours. Processing 350M tokens with just
a decent-sized model would have required a supercomputer a decade ago, whereas in our case it took about 16 hours
using widely available hardware.

Most amazingly, all the building blocks, LLMs included, are available in open source! Let’s do a
quick overview (with code and data), showing how you can repeat the experiment at home.

Here’s what we did at the high level:

Each white box in the picture is a Metaflow flow, linked below:

HNSentimentInit creates a list of posts to analyze. HNSentimentCrawl downloads the posts. HNSentimentAnalyzePosts parses the posts and runs them through an LLM. HNSentimentCommentData reconstructs comment threads based on a Hacker News dataset in Google BigQuery. We should/could
have used this in the step (1) too. Next time! HNSentimentAnalyzeComments runs the comment threads through an LLM. Data analyses and the charts shown above are produced in notebooks (the gray box in the picture).

Here’s what the flows do:

Donwloading posts​

First, we wanted to analyze topics covered by Hacker News posts that have
generated some discussion. Using a publicly
available dataset of HN posts
(thanks Julien!), we queried all posts between
January 2020 and June 2023 (the latest date available in this dataset) which had at
least 20 upvotes and more than five comments, which resulted in about 100,000 posts.
Here’s the simple Metaflow flow
that did the job, much thanks to DuckDB.

Since the 100,000 posts are mostly on different domains, we can safely download
them in parallel without DDOS’ing the servers. It took only around 25 minutes to
download the pages with 100 parallel workers
(see here how).

Large-scale document understanding with LLMs​

Parsing the text content from random HTML pages used to be a massive PITA, but
BeautifulSoup makes it beautifully
straightforward.

At this point, we have a relatively clean dataset of about 100,000 text documents with
more than 500M tokens in total. Processing the dataset once with a state-of-the-art LLM,
say, GPT-4o, or Llama3 70b on AWS Bedrock would cost around $1,300 (using ChatGPT batch API),
not counting the output tokens. Besides the cost, time is a concern – we want to process
the dataset as fast as possible, so we can evaluate the results quickly, and rinse-and-repeat
if needed. Hence we don’t want to get rate-limited or otherwise bottlenecked by the APIs.

We have posted previously about our
success with NVIDIA NIM microservices that
provide a quickly growing library of LLMs and other GenAI models as prepackaged images,
fully optimized to take advantage of vLLM, TensorRT-LLM and the Triton Inference Server,
so you don’t have to spend time chasing the latest tricks with LLM
inference.

Since we had NIMs included in our Outerbounds deployment,
we just added @nim to our flow
to get access to a high-throughput LLM endpoint that costs only as much as the auto-scaling GPUs
it runs on, in this case, four H100 GPUs. Naturally you can hit an LLM endpoint of your
choosing – the options are many these days.

Prompting an LLM to produce a list of topics for each post​

Our prompt is straightforward:

Assign 10 tags that best describe the following article.
Reply only the tags in the following format:
1. first tag
2. second tag
N. Nth tag

[First 5000 tokens from a web page]

The llama3 70b model we used has a 8,000 token context window, but we decided to
limit the number of tokens to 5,000 to account for differences in the tokenizer behavior,
making sure that we don’t go past the limit.

Processing about 140M inputs tokens in this manner took about 9 hours. We were able to
increase throughput to around 4,300 input tokens per second by hitting the model concurrently
with five workers, as neatly shown in our UI below, to take advantage of dynamic batching
and other optimizations.

Instead of trying to download 100,000 comment pages directly from Hacker News, we
leveraged a Hacker News dataset in Google
BigQuery.
Annoyingly, comments in the database are not directly associated with their parent
post, so we had to implement a small function to reconstruct the comment
threads.

We can’t run the function with BigQuery directly but luckily we can export the data easily
in Parquet files. Loading the resulting 16M rows in DuckDB and scanning through them
was a breeze, aided by the fact that Metaflow knows how to load data
fast. We just
added @resources(disk=10000, cpu=8, memory=32000) to run the function on a large enough instance.

Prompting an LLM to analyze sentiment​

With the comment threads at hand, we were able to run them through our
LLM with this prompt:

In the scale between 0-10 where 0 is the most negative sentiment
and 10 is the most positive sentiment, rank the following discussion.
Reply in this format:

SENTIMENT X

where X is the sentiment rating

[First 3000 tokens from a comment thread]

Getting a simple structured output like this seems to work without issues. We processed
through some 230M input tokens in this manner, which took only about 7 hours as we
needed only two output tokens.

You can reproduce all the steps above using your favorite toolchain. Here are
key reasons why we used Metaflow and why you might want to consider it too:

Staying organized without effort – a big benefit compared to random Python
scripts or notebooks is that Metaflow persists all artifacts automatically, tracks all
executions, and keeps everything organized. We relied on Metaflow’s
namespaces to execute large and expensive
runs alongside prototypes, knowing that the two can’t interfere with each other. We
used Metaflow tags
to organize data sharing between flows while they were being developed independently.

Easy cloud scaling – crawling required horizontal
scaling,
DuckDB required vertical scaling,
and LLMs required a GPU backend.
Metaflow handled all the cases out of the box.

Highly available orchestrator – running a large dataset through an LLM can cost
thousands of dollars. You don’t want the run to fail because of random issues.
We relied on a highly available Argo Workflows
orchestrator that Metaflow
supports out of the box to keep the run running for hours.

You can do all of the above using open-source Metaflow, but we had a few
additional benefits by running the flows on Outerbounds Platform:
It is simply fun to develop code like this, including notebooks, with VSCode running on
cloud workstations,
scaling to the cloud is smooth sailing,
and @nim allowed us to hit LLMs without
worrying about cost or rate limiting.

If features like this sound relevant to your interests, we are happy
to get you started for free.

Dive deeper at home​

There’s much more that can be analyzed and visualized with this dataset. Instead of spending
a few thousand dollars hitting OpenAI APIs, you can
download the topics and sentiments we created:

post-sentiment.json contains a mapping post_id -> sentiment_score

post-topics.json contains a mapping post_id -> [topics]

topics-data.json contains a cleaned and joined dataset based on the above JSONs,
powering the tool at the top of this article.

You can find metadata related to post IDs in this HuggingFace
dataset and in the
Google BigQuery Hacker News
dataset.
To view posts, simply open
https://news.ycombinator.com/item?id=[post_id].

For instance, it would be interesting to look into the correlation between post domains and
sentiments and topics. Our hunch is that certain domains produce predominantly positive
sentiments and vice versa. Or, do divisive topics garner more points?

If you create something fun with this data, please link back to this blog article and let us
know! Join Metaflow Slack and drop a note on #ask-metaflow.

To support our open-source efforts, please give Metaflow a star! 🤗

Article Source




Information contained on this page is provided by an independent third-party content provider. This website makes no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact editor @americanfork.business

Skip to content