采样温度对大语言模型中文文本校对能力的影响研究

Study on the Impact of Sampling Temperature on Chinese Text Proofreading Performance of Large Language Models

  • 摘要: 大语言模型(LLM)在中文出版物文本校对任务中的性能受采样温度(Ts)影响。针对这一问题,本研究系统评估了采样温度对21个大语言模型校对能力的影响。实验选取低(0.3)、中(0.7)和高(1.0)三档温度,对330条包含字词、语法、逻辑、知识等8类常见出版文本错误的中文句子进行自动校对。采用AI评审标签体系(完全正确、部分正确、未发现错误、引入新错误、误报)对模型输出进行一致性评价,并通过Cochran's Q检验和McNemar检验比较不同采样温度下校对结果的差异。结果表明:不同模型在三档温度下的校对准确率存在差异,部分模型差异显著,多数模型差异不显著;采样温度影响错误检出率与误报率的权衡,低温下模型较为保守,漏检率略高但新错少,高温下错误检出率提升,部分模型完全正确率达约68%,但误报和过度修改增加;不同模型最佳温度各异,约8个模型在低温0.3表现最佳,6个在高温1.0表现最佳,其余在中温0.7表现较好,表明无统一最佳温度;分类分析显示,采样温度对不同错误类型影响存在差异,数字、符号和逻辑错误在高温下纠正率略增,敏感内容和知识性错误在低温下误判较少。最后,讨论了温度的调节对大语言模型校对性能的工程意义,指出低温有助于保持保守性减少新错,适度升温可提升召回但需权衡误报风险。建议出版自动校对默认采用中温设置,并可根据需求调整,同时探索多次采样投票等方法缓解过度校正问题。研究为大语言模型参数调优提供了实证依据。

     

    Abstract: The performance of large language models (LLMs) in Chinese text proofreading is influenced by sampling temperature(Ts). This study systematically evaluates the impact of sampling temperature on the proofreading capabilities of 21 LLMs. Automatic proofreading experiments were conducted at three temperature levels—low (0.3), medium (0.7), and high (1.0)—on 330 Chinese sentences containing eight common types of errors, including character/word, grammar, logic, and factual errors. An AI-judged labeling system (Completely Correct, Partially Correct, Error Unidentified, New Error Introduced, False Positive) was used for consistent evaluation of model outputs. Cochran's Q test and McNemar's test assessed differences in proofreading results across temperature settings. Results indicate that proofreading accuracy varies among models and temperature levels, with some models showing statistically significant differences while most do not. Sampling temperature affects the trade-off between error detection rate and false positive rate: low temperatures lead to conservative modifications with slightly higher missed error rates but fewer new errors; high temperatures increase error detection rates, with some models achieving about 68% completely correct rate, but also raise false positives and over-correction. Optimal temperature varies by model: approximately eight models perform best at low temperature (0.3), six at high temperature (1.0), and the remainder at medium temperature (0.7), indicating no universal optimal setting. Categorical analysis reveals temperature effects differ by error type; correction rates for numerical, symbol and logical errors improve slightly at high temperatures, while misjudgments of sensitive and factual errors decrease at low temperatures. The engineering implications of temperature tuning are discussed: low temperature maintains conservatism reducing new errors, moderate temperature improves recall but requires balancing false positive risk. Medium temperature is recommended as default for publishing workflows, with adjustments possible for precision or recall needs. Multiple sampling and voting methods are suggested to mitigate over-correction. This study provides empirical evidence for parameter tuning of LLMs in intelligent proofreading for publishing.

     

/

返回文章
返回