Detecting Machine-Generated Code in Multiple Programming Languages and Domains
| dc.contributor.author | Khamitov, Rakhat | |
| dc.date.accessioned | 2026-05-26T12:34:44Z | |
| dc.date.issued | 2026-04-30 | |
| dc.description.abstract | The widespread adoption of large language models for software development has created an urgent need for reliable detection of machine-generated code. This thesis studies machine-generated code detection under realistic conditions where code varies across programming languages, application domains, model families, and generation strategies. The experiments are grounded in SemEval-2026 Task 13, Subtask A, an externally organized benchmark for machine-generated code detection. The benchmark used in this thesis contains training and validation data in three programming languages and evaluation data spanning eight programming languages, multiple domains, unseen generator families, adversarial examples, and human–AI co-authored settings. This thesis contributes a systematic comparison of lexical, structural, neural-embedding, metric-learning, comment-embedding, and stylometric approaches under in-distribution and out-of-distribution evaluation. The results show that high in-distribution validation performance does not predict robust detection: direct classifiers reach validation Macro-F1 above 0.94 but fall to 0.24–0.41 OOD Macro-F1. The strongest configuration, a comment-embedding SVM, achieves 0.671 OOD Macro-F1 on the labeled diagnostic test sample and a 0.638 Kaggle submission score, suggesting that comment style is a more stable cross-language signal than code-surface patterns alone. | |
| dc.identifier.citation | Khamitov, R. (2026). Detecting machine-generated code in multiple programming languages and domains. Nazarbayev University School of Engineering and Digital Sciences | |
| dc.identifier.uri | https://nur.nu.edu.kz/handle/123456789/18747 | |
| dc.language.iso | en | |
| dc.publisher | Nazarbayev University School of Engineering and Digital Sciences | |
| dc.rights | Attribution-ShareAlike 3.0 United States | en |
| dc.rights.uri | http://creativecommons.org/licenses/by-sa/3.0/us/ | |
| dc.subject | AI-generated code detection | |
| dc.subject | Large Language Models | |
| dc.subject | SemEval | |
| dc.title | Detecting Machine-Generated Code in Multiple Programming Languages and Domains | |
| dc.type | Master`s thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- RakhatKhamitovMastersThesis.pdf
- Size:
- 458.71 KB
- Format:
- Adobe Portable Document Format
- Description:
- Master`s thesis