Assessing Domain Specific LLMs for CWEs Detections

Mohamed Elatoubi; Xiao Tan

doi:10.56394/aris2.v5i1.53

Authors

Mohamed Elatoubi Independent Researcher https://orcid.org/0009-0008-1612-1874
Xiao Tan https://orcid.org/0009-0004-1630-249X

DOI:

https://doi.org/10.56394/aris2.v5i1.53

Keywords:

Cybersecurity, Artificial Intelligence, Vulnerabilities, LLM, Software Security, CWE

Abstract

In recent years, Large Language Models (LLMs) have witnessed a significant evolution branching to several fields of life. From science & engineering to arts & literature, the realm of applications has become limitless. Their ability to assimilate and comprehend contextual writings is astonishing. This ability also extends to human-machine software written code. Hence, many novel attempts have demonstrated cutting-edge experiments with LLMs for software testing and security. These contributions have set the initial seed for promising future research endeavors to use LLMs to detect weaknesses, vulnerabilities, and malicious pieces of software code in even the largest repositories. However, further explorations remain short, especially with domain-specific LLMs. LLMs specifically trained for software security remain undiscovered and their behavior is still undisclosed in the literature. This paper aims to explore this new area of LLMs for software security through testing and comparing the accuracy of these AI models against general domain trained models and discover their abilities to recognize the exact vulnerability while performing and observational study of their behaviors while responding to the precisely crafted prompts. In our experiments, we considered GPT-3.5 from OpenAI and Gemini Pro from Google. We find that, in terms of recall, Gemini Pro outperformed GPT-3.5 by a large margin with recall of 63.13%, while GPT-3.5 has Recall of 43.56%, showing that Gemini Pro is better at identifying the true CWE vulnerability with less type II error. Meanwhile, Gemini Pro is also better at discovering the correct CWE vulnerability No. among all correct identified vulnerable cases, with the accuracy of 13.13% vs the GPT-3.5’s 10.61%. However, GPT-3.5 is superior to Gemini Pro in terms of Precision and Accuracy. The Precision of GPT-3.5 is 88.89%, while Gemini Pro has a precision of 54.35%, showing that Gemini Pro inclines to identify case having vulnerability. The Accuracy for both models is similar; GPT-3.5 has Accuracy of 68.75%, and Gemini Pro has Accuracy of 55.50%.

References

S. Kim, S. Woo, H. Lee, and H. Oh, “Vuddy: A scalable approach for vulnerable code clone discovery,” in 2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 595–614. DOI: https://doi.org/10.1109/SP.2017.62

Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu, “Vulpecker: An automated vulnerability detection system based on code similarity analysis,” in Proceedings of the 32nd Annual Conference on Computer Security Applications, ser. ACSAC ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 201–213. [Online]. Available: https://doi.org/10.1145/2991079.2991102 DOI: https://doi.org/10.1145/2991079.2991102

Y. Chen, Z. Ding, L. Alowain, X. Chen, and D. Wagner, “DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection”, Symposium on Research in Attacks, Intrusions and Defenses (RAID ’23), October 16–18, 2023, Hong Kong, China. ACM, New York, NY, USA, 15 pages. Available: https://doi.org/10.1145/3607199.3607242 DOI: https://doi.org/10.1145/3607199.3607242

R. Jensen, V. Tawosi, S. Alamir, “Software Vulnerability and Functionality Assessment using LLMs”, JP Morgan AI Research, London, UK, March 13th, 2024, arXiv:2403.08429.