Find hackernews audience whose comments matches most to you.
This is based on tf-idf weighted keywords match from comments. Covers data from Jan 1, 2020 through May 31, 2026. Keywords that are too rare or too broad across authors are filtered out before scoring.
Due to above filtering, a lot of the authors are NOT covered here. Full coverage would have yielded more than a few trillion records, and I don't have that much compute or disk.
Details of the process: https://hn-buddies.stupidlabs.lol/about-data
---
You can also see who talks most about certain topic or keyword. For example,
NSA: https://hn-buddies.stupidlabs.lol/?keyword=nsa
Trump: https://hn-buddies.stupidlabs.lol/?keyword=trump
---
Global insights page: https://hn-buddies.stupidlabs.lol/insights
---
This also lets you uncover duplicate accounts. For example:
1. "fdklhhjf" and "selamcan"
2. "angkatoto" and "jalantoto"
3. "Donnakravo" and "dommakravosec" and "kravossedonna" and "kravosdonna"
This could mean you might take the “temperament” of the person posting (like “estj”) and map the 2nd and 4th to the 4th and 2nd (the 2nd is regarded as the “input” and the 4th the “output”) so “S” is compatible with “P” and “N” Is compatible with “J”.
And then give a bonus modification for the opposite of the others, so “E” likes “I” and vise versa (always a quite dude hanging out with a talkative dude, though not exclusively). And “T” prefers the company of “F” though not exclusively (see these as technical and creative.)
This gives you compatible interfaces (input/output) and diverging (thus “more interesting”) social dispositions.
You could probably turn that into a good dating algorithm if it isn’t already, though it works for “pals” too!
The data comes from daily-updated public BigQuery dataset: https://news.ycombinator.com/item?id=40644563
Quick glance: TF-IDF, cosine-similarity, the only thing missing is a nice UMAP :-)
Most of the authors are actually missing. Full processing would have yielded multi-trillion row dataset. I didn't rally have that kind of compute with me.
I have even tried running the cross-join on BigQuery... after one hour, only about 3% was done.. so, had to cancel it.