Starting from scratch: Training a 30M Topological Transformer

(tuned.org.uk)

77 points | by tuned 5 hours ago

5 comments

kouteiheika 7 minutes ago
If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks.
This has been done successfully in the past:
https://huggingface.co/featherless-ai/QRWKV-72B
Note that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.
ashirviskas 3 hours ago
I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance.
Experiments I want to build on top of it:
1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.
I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.
[-]
- Yemoshino 1 hour ago
  I found a few papers in this direction with perplexity like this one https://ceur-ws.org/Vol-4005/paper1.pdf and it doesn't seem to be that relevant for now.
  The progress of a handful models seem to be so much better (because limited compute, we have only a handful of big ones, i presume) that these finetunings are just not yet relevant.
  I'm also curious if a english java + html + css + javascript only model would look like in size and speed for example.
  Unfortunate whenever i ask myself the question of finetunging tokens (just a few days ago this question came up again), deep diving takes too much time.
  Claude only got lsp support in november i think. And its not even clear to me to what extend. So despite the feeling we are moving fast, tons of basic ideas haven't even made it in yet
- appplication 1 hour ago
  Not an expert in the space, but I’m not sure you need to modify tokens to get the model to see syntax, you basically get that exact association from attention.
lostmsu 3 hours ago
Comparison with vanilla of the same size/flops budget?
[-]
- Lerc 3 hours ago
  I'm not sure if that is the right calculation.
  Provided the flops are not prohibitive. Output quality per model bytes might be better. In general people run the largest model they can.
  I certainly think trading speed for quality at the same size is worth looking at. Especially if it uses methods that can benefit from the efforts of others to improve speed in general.
  That said performance difference at 30M may not be representative of performance difference at 30B
  There are probably a lot of really good ideas out there waiting for someone to drop a few million in training to reveal how good they are on large sizes.
  [-]
  - lostmsu 3 hours ago
    So no comparison?
keyle 3 hours ago
Does this make any sense, to anyone?
[-]
- kannanvijayan 3 hours ago
  I think this is an attempt to try to enrich the locality model in transformers.
  One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.
  This is obviously not powerful enough to express non-linear relationships - like graph relationships.
  This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.
  [-]
  - adroniser 2 hours ago
    Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.
- pwndByDeath 2 hours ago
  No, its a new form of alchemy that turns electricity into hype. The technical jargon is more.of.a thieves cant to help identity other conmen to one another
  [-]
  - postflopclarity 1 hour ago
    that's a strange way to spell "no, I didn't understand the paper"
  - Yemoshino 1 hour ago
    Try get over your ai hate.
    If you need help getting more out of ai, you can use chatgpt and co to go through papers and let yourself eli5 paragarphs. 1blue3brown also has a few great videos about transformer and how they work
- liteclient 3 hours ago
  it makes sense architecturally
  they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute
  that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far
  [-]
  - reactordev 2 hours ago
    Yup, keyword here is “under the right conditions”.
    This may work well for their use case but fail horribly in others without further peer review and testing.
geoffbp 2 hours ago
I dug into this a bit (with AI ofc) and it spat this out. I found it an easy way to visualise and start to understand:
> Standard AI models (like GPT-4) treat data using Global Geometry. They imagine every word as a point floating in a massive, flat, high-dimensional room. To see how two words relate, they draw a straight line between them.
> Local Topology changes the "room" into a landscape (a manifold). Instead of a flat void, the data exists on a curved surface that has hills, valleys, and paths.
[-]
- xtiansimon 2 hours ago
  What is a "high-dimensional room"? A "room" is by definition three-dimensional in so far as we're using metaphor for description. Then to add this "high-dimensional" modifier does little for me, since the only visualizable high-dimensional cube is a tesseract, which still leaves you at 4-d.
  The presented counterpoint to this metaphor has the "room" change into a "landscape". The room is a "flat void" compared to a landscape with "hills, valleys, and paths". None of these landscape features evoke higher dimensionality in my imagination. Certainly not in the way, say, the metaphor of the "coastline" of Great Britain does when discussing the unusual properties of a fractal.
  These moves don't shift my railroad mind from one track onto another. So I wonder, if a metaphoric usage is not in some way universal, how can it be instructive?