Detecting Machine-Generated Code in Multiple Programming Languages and Domains

dc.contributor.authorKhamitov, Rakhat
dc.date.accessioned2026-05-26T12:34:44Z
dc.date.issued2026-04-30
dc.description.abstractThe widespread adoption of large language models for software development has created an urgent need for reliable detection of machine-generated code. This thesis studies machine-generated code detection under realistic conditions where code varies across programming languages, application domains, model families, and generation strategies. The experiments are grounded in SemEval-2026 Task 13, Subtask A, an externally organized benchmark for machine-generated code detection. The benchmark used in this thesis contains training and validation data in three programming languages and evaluation data spanning eight programming languages, multiple domains, unseen generator families, adversarial examples, and human–AI co-authored settings. This thesis contributes a systematic comparison of lexical, structural, neural-embedding, metric-learning, comment-embedding, and stylometric approaches under in-distribution and out-of-distribution evaluation. The results show that high in-distribution validation performance does not predict robust detection: direct classifiers reach validation Macro-F1 above 0.94 but fall to 0.24–0.41 OOD Macro-F1. The strongest configuration, a comment-embedding SVM, achieves 0.671 OOD Macro-F1 on the labeled diagnostic test sample and a 0.638 Kaggle submission score, suggesting that comment style is a more stable cross-language signal than code-surface patterns alone.
dc.identifier.citationKhamitov, R. (2026). Detecting machine-generated code in multiple programming languages and domains. Nazarbayev University School of Engineering and Digital Sciences
dc.identifier.urihttps://nur.nu.edu.kz/handle/123456789/18747
dc.language.isoen
dc.publisherNazarbayev University School of Engineering and Digital Sciences
dc.rightsAttribution-ShareAlike 3.0 United Statesen
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/us/
dc.subjectAI-generated code detection
dc.subjectLarge Language Models
dc.subjectSemEval
dc.titleDetecting Machine-Generated Code in Multiple Programming Languages and Domains
dc.typeMaster`s thesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
RakhatKhamitovMastersThesis.pdf
Size:
458.71 KB
Format:
Adobe Portable Document Format
Description:
Master`s thesis