Evaluating IndoGPT for Legal Queries: A Benchmark Against GPT-4 Models

Ade Cahyaning Palupi; ade irawan

Evaluating IndoGPT for Legal Queries: A Benchmark Against GPT-4 Models

Ade Cahyaning Palupi ⁽¹⁾, ade irawan ⁽²⁾

(1) Department of Computer Science, Universitas Pertamina

(2) Center for Data Science and Automation (CDSCAN), Universitas Pertamina, Jakarta

Fulltext View | Download

How to cite (JITCE) :

Palupi, A. C., & irawan, ade. (2025). Evaluating IndoGPT for Legal Queries: A Benchmark Against GPT-4 Models. JITCE (Journal of Information Technology and Computer Engineering), 9(2), 22–27. Retrieved from https://jitce.fti.unand.ac.id/index.php/JITCE/article/view/321

Citation Format :

This study evaluates a chatbot developed with the Large Language Model (LLM) IndoGPT, focusing on its use of Retrieval-Augmented Generation (RAG) to answer questions about university regulations from legal PDF documents in the Indonesian Language. IndoGPT's performance is benchmarked against the more advanced models, GPT-4 and GPT-4o. The chatbot combines RAG techniques with the LangChain framework, and its effectiveness is assessed using the Retrieval-Augmented Generation Assessment (RAGAS) framework. The dataset includes publicly available legal documents from Universitas Pertamina, with test data created by the authors. IndoGPT consistently underperforms compared to GPT-4 and GPT-4o. GPT-4 achieves superior metrics with Context Precision at 0.9027, Context Recall at 0.8693, Faithfulness at 0.8486, and Answer Relevancy at 0.8142. Similarly, GPT-4o delivers strong results with Context Precision at 0.8896, Context Recall at 0.8594, Faithfulness at 0.8804, and Answer Relevancy at 0.8773. In contrast, IndoGPT shows significant deficiencies, with much lower scores: Context Precision at 0.6687, Context Recall at 0.5711, Faithfulness at 0.0738, and Answer Relevancy at 0.1628. These findings highlight IndoGPT's substantial limitations, especially when compared to GPT-4 and GPT-4o, which excel in providing accurate, contextually relevant answers. The GPT-4-based chatbot demonstrates strong capabilities in understanding user queries and delivering accurate responses while effectively reducing hallucinations through the RAG technique.

[1] S. Panda and N. Kaur, "Exploring the viability of chatgpt as an alternative to traditional chatbot systems in library and information centers," Library hi tech news, vol. 40, no. 3, pp. 22–25, 2023.
[2] S. Nithuna and C. Laseena, “Review on implementation techniques of chatbot,” in 2020 International Conference on Communication and Signal Processing (ICCSP), 2020, pp. 0157–0161.
[3] N. Dolbir, T. Dastidar, and K. Roy, “NLP is not enough – contextualization of user input in chatbots,” 2021. [Online]. Available:https://arxiv.org/abs/2105.06511.
[4] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
[5] I. L. Alberts, L. Mercolli, T. Pyka, G. Prenosil, K. Shi, A. Rominger, and A. Afshar-Oromieh, “Large language models (llm) and chatgpt: what will the impact on nuclear medicine be?” European journal of nuclear medicine and molecular imaging, vol. 50, no. 6, pp. 1549–1552, 2023.
[6] M. Sallam, “Chatgpt utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns,” Healthcare, vol. 11, no. 6, 2023. [Online]. Available: https://www.mdpi.com/22279032/11/6/887.
[7] P. P. Ray, “Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope,” Internet of Things and Cyber-Physical Systems, vol. 3, pp. 121–154, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S266734522300024X.
[8] J. H. Choi, K. E. Hickman, A. B. Monahan, and D. Schwarcz, “Chatgpt goes to law school,” J. Legal Educ., vol. 71, p. 387, 2021.
[9] F. C. Kitamura, “Chatgpt is shaping the future of medical writing but still requires human judgment,” p. e230171, 2023.
[10] A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49313245
[11] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[12] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, Z. Zhou, C. Gong, Y. Shen et al., “A comprehensive capability analysis of gpt-3 and gpt-3.5 series models,” arXiv preprint arXiv:2303.10420, 2023.
[13] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[14] R. Abdurrohman, “Uji performa chatbot dengan retrieval augmented generation dan model gpt-4 untuk domain taharah berdasarkan empat imam mazhab fikih (studi kasus kitab rahmah al ummah fi ikhtilaf al a’immah),” Master’s thesis, Universitas Islam Negeri (UIN) Syarif Hidayatullah Jakarta, 2024, accessed on 1 May 2024. [Online]. Available:
https://repository.uinjkt.ac.id/dspace/handle/123456789/77195
[15] A. Afzal, A. Kowsik, R. Fani, and F. Matthes, “Towards optimizing and evaluating a retrieval augmented qa chatbot using llms with human-in-the-loop,” in DaSH workshopNaacl, 04 2024.
[16] S. Cahyawijaya, G. I. Winata, B. Wilie, K. Vincentio, X. Li, A. Kuncoro, S. Ruder, Z. Y. Lim, S. Bahar, M. L. Khodra et al., “Indonlg: Benchmark and resources for evaluating indonesian natural language generation,” arXiv preprint arXiv:2104.08200, 2021.
[17] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47.
[18] J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” 2023.
[19] P. Chung, “Specializing llms for domains: RAG vs fine-tuning,” Towards AI, 2024, accesed on 14 April 2024. [Online]. Available: https://towardsai.net/p/machine-learning/specializing-llms-for-domains-rag-vs-fine-tuning
[20] S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “Ragas: Automated evaluation of retrieval augmented generation,” arXiv preprint arXiv:2309.15217, 2023.
[21] Y. Qiao, C. Xiong, Z. Liu, and Z. Liu, “Understanding the behaviors of bert in ranking,” 2019. [Online]. Available: https://arxiv.org/abs/1904.07531NRK. "Medieval helpdesk with English subtitles," YouTube, Feb. 26, 2007 [Video file]. Available: http://www.youtube.com/watch?v=pQHX-SjgQvQ. [Accessed: Jan. 28, 2014].

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Please find the rights and licenses in the Journal of Information Technology and Computer Engineering (JITCE).

1. License

The non-commercial use of the article will be governed by the Creative Commons Attribution license as currently displayed on Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

2. Author(s)’ Warranties

The author(s) warrants that the article is original, written by stated author(s), has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary permissions to quote from other sources have been obtained by the author(s).

3. User Rights

JITCE adopts the spirit of open access and open science, which disseminates articles published as free as possible under the Creative Commons license. JITCE permits users to copy, distribute, display, and perform the work for non-commercial purposes only. Users will also need to attribute authors and JITCE on distributing works in the journal.

4. Rights of Authors

Authors retain the following rights:

Copyright, and other proprietary rights relating to the article, such as patent rights,
the right to use the substance of the article in future own works, including lectures and books,
the right to reproduce the article for own purposes,
the right to self-archive the article.
the right to enter into separate, additional contractual arrangements for the non-exclusive distribution of the article's published version (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal (Journal of Information Technology and Computer Engineering).

5. Co-Authorship

If the article was jointly prepared by other authors; upon submitting the article, the author is agreed on this form and warrants that he/she has been authorized by all co-authors on their behalf, and agrees to inform his/her co-authors. JITCE will be freed on any disputes that will occur regarding this issue.

7. Royalties

By submitting the articles, the authors agreed that no fees are payable from JITCE.

8. Miscellaneous

JITCE will publish the article (or have it published) in the journal if the article’s editorial process is successfully completed and JITCE or its sublicensee has become obligated to have the article published. JITCE may adjust the article to a style of punctuation, spelling, capitalization, referencing and usage that it deems appropriate. The author acknowledges that the article may be published so that it will be publicly accessible and such access will be free of charge for the readers.

Downloads

Download data is not yet available.