• 0 Posts
  • 13 Comments
Joined 3 years ago
cake
Cake day: June 9th, 2023

help-circle

  • A token is the word for the base unit of text that an LLM works with. It’s always been that way. The LLM does not directly work with characters; they are collected together into chunks less than a word and this stream of tokens is what the LLM is processing. This is also why the LLMs have such trouble with spelling questions like “how many Rs in raspberry?” — they do not see the individual letters in the first place so they do not know.

    No, the LLMs do not all tokenize the same way. Different tokenizers are (or at least were once) one of the major ways they differed from each other. A simple tokenizer might split words up into one token per syllable but I think they’ve gotten much more complicated than that, now.

    My understanding is very basic and out-of-date.






  • it is the only thing giving them an advantage over the USA

    That’s really not true at all anymore. China is an absolute manufacturing powerhouse. Almost all of the industry that used to be the USA’s strength in the ‘50s is China’s strength now.

    They haven’t been the cheapest labor anymore for a while now and they don’t need to be.

    Don’t get me wrong, the USA has other, newer strengths now — tech and design, among others. But they do appear to be throwing them away and ceding to others — especially China — as hard and fast as they can.

    On the other hand, humanoid shape for robots seems like an extreme waste of technical complexity and cost, so in my opinion this particular article is mostly showing up how China is also beating the USA at being faddish and dumb following tech fashion.






  • But if you are doing something advanced, down at the hardware level

    This part is wrong. Otherwise yes correct.

    The “unsafe” code in rust is allowed to access memory locations in ways that skip the compiler’s check and guarantee that that memory location has valid data. They programmer is on their own to ensure that.

    Which as you say is just the normal state of affairs for all C code.

    This is needed not because of hardware access but just because sometimes the proof that the access is safe is beyond what the compiler is able to represent.