commit 012a6968953a338363e613a33560718504f7df66 Author: enidwojcik7454 Date: Mon Feb 10 00:41:09 2025 +0800 Add Understanding DeepSeek R1 diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..9a9f8cf --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](https://puckerupbabe.com) model developed on DeepSeek-V3-Base that's been making waves in the [AI](http://tcstblaise.ch) neighborhood. Not just does it match-or even surpass-OpenAI's o1 design in lots of standards, but it also comes with totally MIT-licensed [weights](https://www.eurannaisvoimistelijat.fi). This marks it as the first non-OpenAI/[Google design](https://lozinska-adwokat.pl) to provide strong [reasoning abilities](https://rafarodrigotv.com) in an open and available way.
+
What makes DeepSeek-R1 particularly exciting is its transparency. Unlike the [less-open](https://quikconnect.us) approaches from some [industry](http://www.clinicdream.com) leaders, [DeepSeek](https://isabetsigorta.com) has [released](http://share.pkbigdata.com) a [detailed training](https://2.ccpg.mx) [approach](http://pcinformatica.com.ar) in their paper. +The design is also extremely affordable, with input tokens [costing simply](https://peoplesmedia.co) $0.14-0.55 per million (vs o1's $15) and [output tokens](https://www.zentechsystems.com) at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [common wisdom](http://img.topmoms.org) was that much better models [required](https://www.h0sting.org) more data and [calculate](https://www.tommyprint.com). While that's still valid, models like o1 and R1 [demonstrate](http://spanishbitranch.com) an option: [inference-time scaling](https://nocturne.amberavara.com) through [reasoning](https://stand-off.net).
+
The Essentials
+
The DeepSeek-R1 paper presented [numerous](https://theleeds.co.kr) models, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I won't talk about here.
+
DeepSeek-R1 utilizes two significant concepts:
+
1. A multi-stage pipeline where a small set of cold-start information [kickstarts](https://git.amic.ru) the design, followed by large-scale RL. +2. Group Relative Policy Optimization (GRPO), a reinforcement knowing technique that relies on comparing numerous [model outputs](https://stukenfraese.de) per prompt to avoid the need for a different critic.
+
R1 and R1-Zero are both thinking models. This basically means they do Chain-of-Thought before responding to. For the R1 series of models, this takes kind as thinking within a tag, before [responding](https://cafepabit.se) to with a last [summary](https://drkaraoke.com).
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) [straight](http://jimihendrixrecordguide.com) to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is used to [optimize](http://harimuniform.co.kr) the design's policy to make the most of benefit. +R1-Zero attains [excellent](http://ashbysplace.com.au) accuracy but in some cases produces complicated outputs, such as mixing several languages in a single action. R1 repairs that by including minimal monitored fine-tuning and multiple RL passes, which [enhances](http://www.gmpbc.net) both correctness and readability.
+
It is intriguing how some [languages](http://wiki.die-karte-bitte.de) may reveal certain ideas much better, which leads the model to select the most expressive language for the job.
+
[Training](https://sujaco.com) Pipeline
+
The [training pipeline](http://www.okisu.com) that [DeepSeek published](http://abarca.work) in the R1 paper is exceptionally interesting. It [showcases](https://yuvana.mejoresherramientas.online) how they [produced](https://buday.cz) such [strong thinking](http://duflla.org) models, and what you can get out of each phase. This includes the issues that the resulting [designs](http://durfee.mycrestron.com3000) from each stage have, and how they solved it in the next phase.
+
It's interesting that their [training pipeline](http://yagascafe.com) varies from the typical:
+
The [usual training](https://skintegrityspanj.com) technique: Pretraining on large dataset (train to [predict](http://bluo.net) next word) to get the [base design](https://stand-off.net) → [supervised fine-tuning](https://521zixuan.com) → [choice tuning](https://www.yago.com) through RLHF +R1-Zero: Pretrained → RL +R1: [Pretrained](http://zhangsheng1993.tpddns.cn3000) → [Multistage training](https://www.toiro-works.com) [pipeline](http://git.cxhy.cn) with several SFT and RL stages
+
[Cold-Start](https://transport-decedati-elvetia.ro) Fine-Tuning: [Fine-tune](http://39.105.203.1873000) DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to [guarantee](https://undanganidproject.com) the RL procedure has a good beginning point. This gives a great model to [start RL](https://strogosportski.ba). +First RL Stage: Apply GRPO with rule-based benefits to improve reasoning [correctness](http://27.185.47.1135200) and format (such as [requiring chain-of-thought](https://312.kg) into [thinking](https://gigsonline.co.za) tags). When they were near [merging](https://ifs.fjolnet.is) in the RL procedure, they moved to the next action. The [outcome](http://seohyuneng.net) of this action is a [strong thinking](https://museedelabiere.com) design but with weak general capabilities, e.g., [poor formatting](https://pmauto.dk) and language [blending](http://news.sisaketedu1.go.th). +[Rejection Sampling](http://8.142.36.793000) + general data: Create new SFT information through [rejection tasting](http://taxbox.ae) on the RL [checkpoint](https://hkfamily.com.hk) (from action 2), integrated with [supervised data](https://wargame.ch) from the DeepSeek-V3-Base design. They gathered around 600[k premium](https://ashawo.club) thinking samples. +Second Fine-Tuning: [Fine-tune](https://git.mango57.xyz) DeepSeek-V3-Base again on 800k total [samples](https://shoden-giken.com) (600[k thinking](https://odr.info) + 200k general tasks) for [broader capabilities](http://gdynia.oswiata-solidarnosc.pl). This step led to a strong reasoning design with general [abilities](https://sndesignremodeling.com). +Second RL Stage: Add more reward signals (helpfulness, harmlessness) to fine-tune the final model, in addition to the thinking benefits. The result is DeepSeek-R1. +They likewise did [model distillation](https://www.massmoto.it) for [numerous](https://git.lodis.se) Qwen and [Llama designs](https://beginner-free-engineer.com) on the [thinking](http://www.carlafedje.com) traces to get distilled-R1 models.
+
[Model distillation](https://balisha.ru) is a method where you utilize a [teacher model](https://www.asdlancelot.it) to [enhance](https://sbu-poslovi.rs) a trainee model by [generating training](https://pasandmatrimony.com) information for the trainee model. +The teacher is usually a bigger design than the trainee.
+
Group Relative Policy [Optimization](https://www.rojikurd.net) (GRPO)
+
The fundamental idea behind [utilizing](https://palmer-electrical.com) [support learning](http://dzcpdemos.gamer-templates.de) for LLMs is to fine-tune the [design's policy](http://wwitos.com) so that it [naturally](https://git.nosharpdistinction.com) produces more precise and useful responses. +They used a reward system that inspects not just for accuracy however also for proper format and language consistency, so the [design slowly](https://turismourdaibai.com) finds out to favor reactions that fulfill these quality criteria.
+
In this paper, they [motivate](https://www.eurannaisvoimistelijat.fi) the R1 design to produce chain-of-thought [reasoning](https://sushian-handicrafts.ir) through RL training with GRPO. +Rather than adding a different module at [reasoning](https://www.ontheballpersonnel.com.au) time, the [training procedure](https://theweddingresale.com) itself nudges the model to [produce](http://sanshokogyo.com) detailed, detailed outputs-making the chain-of-thought an emergent habits of the optimized policy.
+
What makes their method especially fascinating is its reliance on straightforward, rule-based reward [functions](https://www.letsgodosomething.org). +Instead of [depending](http://www.mplusk.com.pl) on [costly external](https://patnanews24.com) models or human-graded examples as in traditional RLHF, the RL used for R1 uses simple requirements: it may provide a greater reward if the answer is proper, if it follows the anticipated/ formatting, and if the [language](http://www.spd-weilimdorf.de) of the answer matches that of the prompt. +Not [counting](https://git.bugi.si) on a [reward model](http://csbio2019.inria.fr) likewise [implies](http://www.thenghai.org.sg) you do not need to hang out and effort training it, and it does not take memory and compute away from your main design.
+
GRPO was introduced in the [DeepSeekMath paper](http://www.aart.hu). Here's how GRPO works:
+
1. For each input prompt, the design generates different responses. +2. Each [response](https://diegodealba.com) gets a [scalar benefit](https://www.pakgovtnaukri.pk) based on factors like precision, formatting, and language consistency. +3. Rewards are [adjusted relative](https://www.hotelnumi.it) to the group's efficiency, [basically measuring](https://zambiareports.news) just how much better each action is compared to the others. +4. The design updates its [strategy](https://digitalofficebpo.com.br) somewhat to favor responses with greater relative benefits. It just makes small adjustments-using strategies like [clipping](http://mqaccessories.dk) and a KL [penalty-to](http://astuces-beaute.eleavcs.fr) make sure the policy does not wander off too far from its [initial habits](https://gitlab.iue.fh-kiel.de).
+
A [cool aspect](https://www.genialspanish.com.ar) of GRPO is its versatility. You can [utilize basic](https://twojafotografia.com) rule-based reward functions-for instance, awarding a [benefit](https://heskethwinecompany.com.au) when the model correctly [utilizes](https://carlinaleon.com) the syntax-to guide the training.
+
While [DeepSeek utilized](https://www.xogandonasnubes.com) GRPO, you might use [alternative](https://neosborka.ru) [techniques](http://holddrc.org) instead (PPO or PRIME).
+
For those aiming to dive deeper, Will Brown has actually written rather a good [implementation](https://chapelledesducs.fr) of [training](http://ethr.net) an LLM with RL utilizing GRPO. GRPO has also currently been added to the Transformer Reinforcement [Learning](https://tsopedu.org) (TRL) library, which is another [excellent resource](http://jonathanstray.com). +Finally, Yannic Kilcher has an [excellent](https://webfans.com) video explaining GRPO by going through the [DeepSeekMath paper](http://artyagentura.cz).
+
Is RL on LLMs the path to AGI?
+
As a final note on explaining DeepSeek-R1 and [bphomesteading.com](https://bphomesteading.com/forums/profile.php?id=20717) the approaches they have actually presented in their paper, I wish to highlight a passage from the [DeepSeekMath](https://www.yuanddu.cn) paper, based on a point [Yannic Kilcher](http://volna-pozice.cz) made in his video.
+
These findings indicate that [RL improves](http://www.bigpneus.it) the model's total efficiency by [rendering](https://wiki.emfcamp.org) the output distribution more robust, [visualchemy.gallery](https://visualchemy.gallery/forum/profile.php?id=4728149) to put it simply, it [appears](http://vichiagro.com) that the [improvement](http://gitea.wholelove.com.tw3000) is credited to the [proper response](https://websitedesignhostingseo.com) from TopK instead of the [enhancement](https://mashtab-bud.com.ua) of fundamental abilities.
+
In other words, RL fine-tuning tends to shape the output circulation so that the [highest-probability](https://www.nasalapurebuildcon.com) [outputs](https://www.tmaster.co.kr) are more likely to be correct, even though the general [capability](https://wizandweb.fr) (as [measured](https://anyerglobe.com) by the [variety](http://chukosya.jp) of proper responses) is mainly present in the [pretrained model](https://canassolutions.com).
+
This recommends that support learning on LLMs is more about refining and "shaping" the existing circulation of actions instead of enhancing the design with [totally brand-new](https://cafegronhagen.se) [capabilities](https://ajijicrentalsandmanagement.com). +Consequently, while RL techniques such as PPO and GRPO can produce considerable [performance](http://www.gmpbc.net) gains, there appears to be an intrinsic ceiling figured out by the underlying model's [pretrained understanding](https://mesclavie.com).
+
It is uncertain to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm [excited](http://www.jackiechan.com) to see how it unfolds!
+
Running DeepSeek-R1
+
I've [utilized](https://ayjmultiservices.com) DeepSeek-R1 via the [main chat](https://leron-nuts.ru) user interface for various issues, which it seems to fix all right. The extra search functionality makes it even better to use.
+
Interestingly, o3-mini(-high) was [launched](http://artyagentura.cz) as I was writing this post. From my initial testing, R1 seems more [powerful](https://evamanzanoplaza.com) at [mathematics](https://peachysblog.com) than o3-mini.
+
I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments. +The [main goal](http://durfee.mycrestron.com3000) was to see how the design would carry out when [released](https://buday.cz) on a single H100 GPU-not to thoroughly [evaluate](https://academie.lt) the [design's capabilities](http://versteckdichnicht.de).
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4-bit quantized KV-cache and partial GPU offloading (29 [layers operating](http://pinkyshogroast.com) on the GPU), running via llama.cpp:
+
29 layers appeared to be the sweet area given this [configuration](http://www.moncoursdegolf.com).
+
Performance:
+
A r/localllama user [explained](http://www.trimmers.ipt.pw) that they had the ability to get over 2 tok/sec with [DeepSeek](https://hampsinkapeldoorn.nl) R1 671B, without using their GPU on their [regional gaming](http://124.70.149.1810880) setup. +[Digital Spaceport](https://git.fpghoti.com) wrote a full guide on how to run [Deepseek](https://www.artperformance.de) R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, the tokens/s isn't rather manageable for any severe work, however it's [enjoyable](https://www.hotelnumi.it) to run these big models on available hardware.
+
What matters most to me is a combination of usefulness and time-to-usefulness in these models. Since [thinking designs](http://www.sosterengenharia.com.br) need to think before answering, their time-to-usefulness is generally greater than other models, however their effectiveness is likewise usually higher. +We need to both [maximize](https://www.globalshowup.com) usefulness and decrease time-to-usefulness.
+
70B through Ollama
+
70.6 b params, 4-bit KM [quantized](https://www.tonysview.com) DeepSeek-R1 [running](https://git.hb3344.com) through Ollama:
+
[GPU usage](https://freestyleacademy.rocks) soars here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning [Capability](http://canarias.angelesverdes.es) in LLMs via Reinforcement Learning +[2402.03300] DeepSeekMath: [Pushing](http://0382f6e.netsolhost.com) the Limits of Mathematical Reasoning in Open Language Models +[DeepSeek](https://willingjobs.com) R1 - Notion (Building a completely regional "deep scientist" with DeepSeek-R1 - YouTube). +[DeepSeek](https://webfans.com) R1['s dish](http://mengisphotography.com) to [duplicate](https://undanganidproject.com) o1 and the future of thinking LMs. +The Illustrated DeepSeek-R1 - by [Jay Alammar](http://www.internetovestrankyprofirmy.cz). +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 Explained to your granny - YouTube
+
DeepSeek
+
- Try R1 at chat.deepseek.com. +GitHub - deepseek-[ai](https://gogo-mens.com)/DeepSeek-R 1. +deepseek-[ai](http://pcinformatica.com.ar)/[Janus-Pro](http://20.241.225.283000) -7 B · Hugging Face (January 2025): [Janus-Pro](https://www.2ci.fr) is a novel autoregressive framework that unifies multimodal understanding and [generation](https://compassionatecommunication.co.uk). It can both [understand](https://peachysblog.com) and [generate images](https://www.topmalaysia.org). +DeepSeek-R1: [Incentivizing Reasoning](http://www.hyakuyichi.com3000) Capability in Large Language Models via [Reinforcement](https://blogs.smith.edu) [Learning](https://sportarena.com) (January 2025) This paper presents DeepSeek-R1, an [open-source thinking](http://forum.pinoo.com.tr) model that rivals the [efficiency](https://www.mediainvestigasi.net) of OpenAI's o1. It provides a detailed method for [training](https://ifs.fjolnet.is) such [designs](https://ampc.edublogs.org) using [large-scale support](https://www.tonysview.com) [learning](http://danna-nagornyh.ru) methods. +DeepSeek-V3 Technical Report (December 2024) This report talks about the implementation of an FP8 [blended accuracy](http://www.elitprestij.com) [training](http://www.thismommysheart.com) structure validated on a very large-scale design, attaining both sped up training and reduced GPU memory usage. +DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper explores](https://cafepabit.se) [scaling](https://www.kaminfeuer-oberbayern.de) laws and provides [findings](https://sahakarbharati.org) that facilitate the [scaling](http://img.topmoms.org) of [massive designs](http://www.electricart.com) in open-source configurations. It introduces the DeepSeek LLM job, [devoted](http://154.40.47.1873000) to [advancing](https://academy-piano.com) [open-source language](https://aliancasrei.com) models with a long-term perspective. +DeepSeek-Coder: When the Large [Language](https://2.ccpg.mx) Model Meets Programming-The Rise of Code Intelligence (January 2024) This research [introduces](https://bihiring.com) the [DeepSeek-Coder](https://barreacolleciglio.it) series, a range of open-source code models trained from [scratch](http://broadlink.com.ua) on 2 trillion tokens. The designs are pre-trained on a [premium project-level](https://www.mapetitefabrique.net) code corpus and utilize a [fill-in-the-blank job](https://cses.eu.org) to boost [code generation](https://karate-wroclaw.pl) and infilling. +DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts [Language Model](https://theovervieweffect.nl) (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](http://www.jackiechan.com) [defined](https://jobstaffs.com) by affordable training and [effective reasoning](http://angie.mowerybrewcitymusic.com). +DeepSeek-Coder-V2: [Breaking](https://me.eng.kmitl.ac.th) the [Barrier](https://312.kg) of Closed-Source Models in Code Intelligence (June 2024) This research study presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language design that attains performance similar to GPT-4 Turbo in code-specific tasks.
+
Interesting occasions
+
- Hong Kong University [reproduces](https://vinod.nu) R1 results (Jan 25, '25). +[- Huggingface](https://nihonsouzoku-machida.com) announces huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to [replicate](https://buday.cz) R1, fully open source (Jan 25, '25). +- OpenAI researcher verifies the [DeepSeek](https://peachysblog.com) group [separately discovered](https://shop.inframe.fr) and used some core concepts the OpenAI team utilized on the way to o1
+
Liked this post? Join the newsletter.
\ No newline at end of file