Abstract:
The performance of large language models (LLMs) in Chinese text proofreading is influenced by sampling temperature(
Ts). This study systematically evaluates the impact of sampling temperature on the proofreading capabilities of 21 LLMs. Automatic proofreading experiments were conducted at three temperature levels—low (0.3), medium (0.7), and high (1.0)—on 330 Chinese sentences containing eight common types of errors, including character/word, grammar, logic, and factual errors. An AI-judged labeling system (Completely Correct, Partially Correct, Error Unidentified, New Error Introduced, False Positive) was used for consistent evaluation of model outputs. Cochran's Q test and McNemar's test assessed differences in proofreading results across temperature settings. Results indicate that proofreading accuracy varies among models and temperature levels, with some models showing statistically significant differences while most do not. Sampling temperature affects the trade-off between error detection rate and false positive rate: low temperatures lead to conservative modifications with slightly higher missed error rates but fewer new errors; high temperatures increase error detection rates, with some models achieving about 68% completely correct rate, but also raise false positives and over-correction. Optimal temperature varies by model: approximately eight models perform best at low temperature (0.3), six at high temperature (1.0), and the remainder at medium temperature (0.7), indicating no universal optimal setting. Categorical analysis reveals temperature effects differ by error type; correction rates for numerical, symbol and logical errors improve slightly at high temperatures, while misjudgments of sensitive and factual errors decrease at low temperatures. The engineering implications of temperature tuning are discussed: low temperature maintains conservatism reducing new errors, moderate temperature improves recall but requires balancing false positive risk. Medium temperature is recommended as default for publishing workflows, with adjustments possible for precision or recall needs. Multiple sampling and voting methods are suggested to mitigate over-correction. This study provides empirical evidence for parameter tuning of LLMs in intelligent proofreading for publishing.