• Hackworth@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    4 hours ago

    To run the 671B parameter R1, my napkin math was something like 3/4 of a million dollars in hardware. But that (plus the much lower training cost) made this a millionaire’s game rather than a billionaire’s. Plus the distillations do seem better than anything else we have at the smaller sizes at the moment. That said, I’m more looking forward to the first use of deepseek’s methods with google’s Titan architectures.