GPT vs. Llama2: Which Comes Closer to Human Writing?

Saved in:
Bibliographic Details
Title: GPT vs. Llama2: Which Comes Closer to Human Writing?
Language: English
Authors: Fernando Martinez, Gary M. Weiss, Miguel Palma, Haoran Xue, Alexander Borelli, Yijun Zhao
Source: International Educational Data Mining Society. 2024.
Availability: International Educational Data Mining Society. e-mail: admin@educationaldatamining.org; Web site: https://educationaldatamining.org/conferences/
Peer Reviewed: Y
Page Count: 10
Publication Date: 2024
Document Type: Speeches/Meeting Papers
Reports - Research
Education Level: Higher Education
Postsecondary Education
Descriptors: Artificial Intelligence, Technology Uses in Education, Higher Education, Natural Language Processing, Intelligent Tutoring Systems, Writing Evaluation, Accuracy, Vocabulary, Syntax, Authors, Language Usage
Abstract: Large Language Models (LLMs) have prompted widespread application across diverse domains. In some applications, human-like quality in output is essential for optimal user experience and credibility. This is particularly evident in applications such as Chatbots. Conversely, concerns arise regarding LLM use in contexts where human authenticity is crucial, notably in higher education with materials like Letters of Recommendation (LOR) and Statements of Intent (SOI). Despite extensive research in this area, accurately distinguishing between human and LLM-generated content remains challenging. This study conducts a comparative analysis between two leading LLMs, GPT3.5 and Llama2-7B, evaluating their output's resemblance to human writing through vocabulary and structure analysis. Additionally, we apply classification models to detect human vs. LLM-generated content, with higher accuracy signaling deviations from human-like writing. Our findings suggest that both LLMs significantly deviate from human writing in terms of vocabulary and paragraph structure, with GPT-3.5 appearing closer to human. Furthermore, our classification models demonstrated near-perfect performance in identifying LORs and SOIs crafted by LLMs during our evaluation, and we have made these models accessible as online, open-access tools. However, it's important to acknowledge that these models are trained specifically for our tasks. Generalizing their application to other domains requires further research and validation. [For the complete proceedings, see ED675485.]
Abstractor: As Provided
Entry Date: 2025
Accession Number: ED675637
Database: ERIC
Description
Abstract:Large Language Models (LLMs) have prompted widespread application across diverse domains. In some applications, human-like quality in output is essential for optimal user experience and credibility. This is particularly evident in applications such as Chatbots. Conversely, concerns arise regarding LLM use in contexts where human authenticity is crucial, notably in higher education with materials like Letters of Recommendation (LOR) and Statements of Intent (SOI). Despite extensive research in this area, accurately distinguishing between human and LLM-generated content remains challenging. This study conducts a comparative analysis between two leading LLMs, GPT3.5 and Llama2-7B, evaluating their output's resemblance to human writing through vocabulary and structure analysis. Additionally, we apply classification models to detect human vs. LLM-generated content, with higher accuracy signaling deviations from human-like writing. Our findings suggest that both LLMs significantly deviate from human writing in terms of vocabulary and paragraph structure, with GPT-3.5 appearing closer to human. Furthermore, our classification models demonstrated near-perfect performance in identifying LORs and SOIs crafted by LLMs during our evaluation, and we have made these models accessible as online, open-access tools. However, it's important to acknowledge that these models are trained specifically for our tasks. Generalizing their application to other domains requires further research and validation. [For the complete proceedings, see ED675485.]