The LLAMA-2-CAT models are refined for use cases focused on dialogue, similar to the specific versions of the GPT model used in the Chatppt.
Supervised end adjustment (Sft) Was used to initiate the pre-formed LLAMA 2 base model to generate answers in the format expected by users in a chatbot or virtual agent parameter. In a series of Supervised learning Tasks, pairs labeled with dialogue style exchanges, annotated as (Invite, response), are used to form the model to minimize the divergence between its own response for a given prompt and the example of an answer provided by the labeled data. The model thus learns, for example, that the appropriate response to a prompt of “Teach me to make cookiesConsists of providing real instructions to cook cookies, rather than simply completing the sentence.
Rather than using millions of examples labeled, the article indicates that the results have been improved using “less quality but better quality examples”, noting that Meta AI collected 27,540 annotated samples.
After SFT, Meta used Learning to strengthen human feedback (RLHF) To further align the behavior of cat models with preferences and human instructions. In the RLHF, direct human feedback is used to form a “reward model” to learn the models of the type of responses that humans prefer. By translating the predictions of the reward model (concerning the question of whether a given answer would be preferred by humans) in a scalar Reward signalThe reward model is then used to form more Lama-2 pussy via learning to strengthen.
There are many different methods and formats in which this human feedback can be collected. Meta Ai used a simple method of binary comparison: human annotators were invited to write an prompt, then to choose between two models of the model – based on the criteria provided by META – generated by two different variants of LLAMA 2. To help the properly weighted reward model these choices, the annotators were also invited to assess the degree to which they preferred their response chosen by the other: “much better, ” “”a little better” Or “Negligably better / uncertain.“”
Human preferences have been used to train two distinct reward models: one optimized for utility, the other optimized for security (that is to say, avoiding toxic and hateful responses or responses that could be used to help violence or criminal activity). In addition to Proximal policy optimization (PPO), the algorithm generally used to update the weights of the LLM model in RLHF, Meta also used rejection sampling (Link resides outside IBM.com) to update Llama-2-Chat-70b.