Automate Large Language Models Evaluation

How to Evaluate LLMs

Large Language models (LLMs) play a crucial role in natural language processing, powering applications such as conversation AI, text generation, and biomedical sequences. However, assessing the quality of language models poses challenges given the expansive design space of algorithmic structures and variable application contexts.

LLMs are designed to understand and generate human-like text, making evaluation multidimensional. Evaluation metrics that measure distance and similarity provide quantifiable metrics, offering a standard evaluation scale. Nonetheless, this doesn't negate the role of subject matter experts in conducting qualitative assessments for comprehensive evaluation.

This article delves into the language models evaluation, focusing on cosine similarity, Euclidean distance, and BLEU score metrics.
We will use the intellinode open-source module to accelerate the evaluation of multiple language models — LLaMA, Cohere, and OpenAI’s GPT-4.

1 Introduction to Evaluating Language Models

The efficacy of language models hinges on their robust evaluation. Language model evaluation embraces quantitative and qualitative measures that provide a digest of the model’s performance, its caliber to deliver high-quality output, and its standing against competing models.

Among several evaluation techniques, each possessing unique values and demerits, this article spotlights three techniques that have gained traction in digital linguistics research: cosine similarity, Euclidean distance, and BLEU score.

1. Cosine Similarity: computes and compare the angle between vectors. The cosine of this angle can help us understand how similar are the two vectors and how much they “point” in the same direction. When we apply this to text or documents, each document treated as a vector, where each word or phrase contributes to the direction of the vector. A high cosine similarity score suggests a strong similarity between the predicted and the desired responses.

cosine similarity  visual

2. Euclidean Distance: the straight-line distance between two points in a space known as Euclidean Distance. In language model evaluation; it yields a metric that measures the gap between predicted outputs and target responses. A low Euclidean distance score typically reference at a more accurate correlation.

euclidean distance visual

3. BLEU Score: stands for BiLingual Evaluation Understudy, is an evaluation metric predominantly used to assess machine translation tasks. It measures the agreement between the text generated by a machine and one or more human-generated references by determining the overlap of phrases, also known as ‘n-grams’. A high BLEU score indicates a considerable similarity with the reference text, suggesting a higher quality machine translation.

BLEU score visual

2 Language Model Overviews

LLaMA: Advancing Transformers with Open and Efficient Foundation Language Models

The open source Meta models LLaMA unveils a new suite of foundation language models, offering parameter capacities ranging from 7 billion to 65 billion. This innovative approach showcases the possibility of attaining state-of-the-art outcomes by exclusively relying on public datasets, with the ability to tune the models on your specific tasks or private data. The 13B LLaMA model notably outperforms the larger GPT-3 across some performance benchmarks, while the 65B variant competes closely with top-tier models such as Chinchilla70B, GPT-4 and PaLM-540B. LLaMA proves that enhancing performance isn’t necessarily tied to enlarging the model size; Superior results can be accomplished by efficiently training smaller models on larger data volumes.

Cohere: Command Completion

Cohere models, particularly their flagship text generation model known as “Command,” excel in generating command-like instructions. They have found essential utility in realms like software development, wherein tasks like code autocompletion and command-line assistance benefit from their application tremendously. Cohere’s command model scores notably high on the Holistic Evaluation of Language Models (HELM) benchmark leaderboard (as of March ’23 results) “HELM is a standard set by Stanford University for comparing large language models”. What makes Command one of the preferred models revolves around its continuous improvement, its training based on practically relevant use cases, and its optimization for business priorities such as security, privacy, and Responsible AI.

GPT-4 by OpenAI: A Leap for Multimodal Language Models

OpenAI’s GPT-4 sets a new standard for language models, significantly outperforming its predecessors. With the ability to comprehend and generate both text and image inputs, it has proven to be a transformative force in the landscape of natural language processing. The advancement of GPT-4 over GPT-3.5 is readily evident in its performance during simulations of bar exams. GPT-4 accomplished a score in the top 10% of human test-takers on this exam, a striking improvement over GPT-3.5, which scored in the bottom 10%. GPT-4 similarly outperforms GPT-3.5 and other state-of-the-art models on traditional NLP benchmarks. GPT-4 consistently demonstrates strong performance, not only in the English language but also across multiple languages on the MMLU benchmark. This benchmark consists of a suite of multiple-choice questions spanning 57 subjects.

However, it is important to acknowledge that while GPT-4 showcases robust capabilities, it also presents limitations echoing those of earlier GPT models. It is not fully reliable, demonstrating a phenomenon termed “hallucinations”, has a limited context window, and does not learn from experience.

3 Implementation: Evaluating Language Models Using IntelliNode

To illustrate the process of language model evaluation, we will use IntelliNode’s LLMEvaluation module in a Node.js environment. The models — LLaMA, Cohere, and GPT-4 — are evaluated and compared based on cosine similarity and Euclidean distance scores. IntelliNode provides an infrastructure for multifaceted language model evaluation, offering quick comparisons between different models with minimal coding.

Step 1: Establishing the Node.js Project and Installing the Dependencies

Firstly, we need to create the project and install the necessary libraries. Open your terminal, navigate to the desired directory and initialize the node project using npm:

npm init -y

This will create a new `package.json` file, establishing the base for your project. Next let’s install the intellinode dependency to use for the models evaluation:

npm i intellinode

Step 2: Implementation

Now, we switch to the actual evaluation process, which we will implement in a new Node.js file. Let’s assume that we name this file `evaluate.js`.

Begin by importing the necessary modules. `LLMEvaluation` include the logic to compare the models using a simple configuration. `SupportedChatModels` and `SupportedLangModels` used to reference the supported model names.

Next we will manage the API keys and prepare the the configurations of the language models to evaluate. We are going to use GPT-4 model from Openai, the command model from cohere, and Llama 70B model from replicate host.

Let’s start the evaluation procedure by sending the providers list for the model. The LLMEvaluation object requires defining a common embedding function to convert the input text to numerical representation that reserves the semantic meaning to use in the distance evaluation. For this, we will use Openai embedding function.

Now, you can easily run your `evaluate.js` script from the terminal using the following command:

node evaluate.js

The script will compare your selected language models’ outputs against the predefined target answers and deliver their respective cosine similarity and Euclidean distance scores with output that look like below:

In conclusion, we have explored the process of evaluating language models, highlighting key metrics such as cosine similarity, Euclidean distance, and BLEU score. We have also compared the performance of three leading models — LLaMA, Cohere, and OpenAI’s GPT-4.

These quantitative evaluations are critical for maintaining and enhancing the quality of language models. They provide a structured approach to understanding the performance of each model, allowing us to identify areas of strength and potential improvement. However, human evaluation is also crucial in text generation. Testers can evaluate the text fluency, coherence, and relevance, elements that quantitative metrics may not fully capture.


  • Cosine similarity: link.
  • Euclidean distance: link.
  • Intellinode model evaluation reference: link.

Register your email for early access to the AI cloud tools.

Get Early Cloud Access

Count me in for early cloud access with a free trial

* indicates required