[TOC]

StarCoder

BigCode 开发的 StarCoder,这是一个在一万亿的 token、80 多种编程语言上训练过的 16B 参数量的模型。训练数据多来自 GitHub 上的 issues、使用 Git 提交的代码、Jupyter Notebook 等等 。得益于对企业友好的许可证、长度为 8192 的 token、借助 multi-query attention 的快速大批量推理,StarCoder 可以说是当前对代码相关的应用最合适的开源选择。

  1. 代码: https://github.com/bigcode-project/starcoder
  2. 数据集: https://huggingface.co/datasets/HuggingFaceH4/oasst1_en
  3. 模型: https://huggingface.co/HuggingFaceH4/starchat-alpha

调优

$ git clone <https://github.com/bigcode-project/starcoder.git>
$ cd starcoder/chat

创建环境

$ conda create -n starchat python=3.10 && conda activate starchat
$ pip install -r requirements.txt
$ sudo apt-get install git-lfs
$ torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json

测试

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))