2023年9月

近期在尝试大数据在企业内的应用,两个需求:

  • 用户输入自然语言后返回代码模版,最好能够进一步推理
  • 用户输入自然语言返回给定答案,不要扩展

两种方向:向量库+大模型、模型微调。
方向选择.jpg

以下给出openai模型微调的详细过程,目前官方推荐gpt-3.5-turbo,gpt4的微调将在年底推出

  • 数据预处理:准备至少10条数据,质量越高且数量越多,效果越好。如果没有就人工老老实实的标记几十条高质量数据,比大量低质数据更好。格式如下
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

我的数据源是csv,第一列问题,第二列答案,用以下脚本处理

import pandas as pd
import json


def convert_csv_to_jsonl(input_csv, output_jsonl):
    # Read the CSV file
    df = pd.read_csv(input_csv)

    with open(output_jsonl, 'w', encoding='utf-8') as f:
        for _, row in df.iterrows():
            jsonl_data = {
                "messages": [
                    {"role": "system", "content": "SunSun is an internal knowledge base communication robot."},
                    {"role": "user", "content": row['Generated Questions']},
                    {"role": "assistant", "content": row['source']}
                ]
            }
            f.write(json.dumps(jsonl_data, ensure_ascii=False) + '\n')


# Usage
# convert_csv_to_jsonl('path_to_your_csv_file.csv', 'desired_output_file.jsonl')
if __name__ == "__main__":
    convert_csv_to_jsonl('/Users/jixing/Downloads/export_result0925.csv',
                         '/Users/jixing/Downloads/export_result0925.jsonl')
  • 上传文件至openai
import openai

# 替换你的key
openai.api_key = "sk-40LIdYxxxxxxx"
training_file = openai.File.create(
    file=open("export_result0925.jsonl", "rb"),
    purpose='fine-tune'
)
# 记录文件id,下一步需要使用
print(training_file.id)
  • 开始微调
import openai

# 你的key
openai.api_key = "sk-40LIdYIwxxxxx"

# 刚才的文件id
openai.FineTuningJob.create(training_file="file-0ACDKAM7xxxxxx", model="gpt-3.5-turbo")