The AI toolbox everyone should understand

0:00

Remember 2022 when ChatGPT was first released?

Just a few years ago, our understanding of building on top of Large Language Models was largely confined by calling the model through an API.

Fast forward to 2025 and that familiar foundational building block has exploded into an ecosystem of possibilities.

Developers now have a much wider set of building blocks at their disposal. This article aims to provide an overview of all these available components.

By understanding and strategically combining these building blocks, you gain a clear picture of what's achievable and can identify gaps in the existing ecosystem.

Note: This is not meant as an exhaustive list, the field of GenAI is progressing rapidly.

If you enjoy this kind of breakdown, subscribe for new posts.

What are AI building blocks?

"If I have seen further, it is by standing on the shoulders of giants" — Isaac Newton

Software engineering, at its core, embodies this principle. Every piece of software is built upon the contributions of countless others. Thanks to those who share their code, we can develop applications much faster, avoiding the constant need to reinvent the wheel.

Over the past few years, there's been significant investment in creating fundamental building blocks for Generative Artificial Intelligence (GenAI). This has empowered developers to embed advanced intelligence into applications and even deploy agents that perform tasks (semi) autonomously.

While these components currently reside within a distinct "generative AI stack," I believe they will soon become a part of any "software engineering stack." In the near future, the clear distinction between the two will diminish. What we now call "AI engineering" will simply be "software engineering."

Examples of these AI building blocks include:

Evals: Tools for evaluating the performance and quality of AI models.
Guardrails: Mechanisms to ensure AI models operate within predefined boundaries and prevent undesirable outputs.
Reasoning models: Components that enable AI to perform logical deductions and complex problem-solving.
Embeddings: Numerical representations of data that capture semantic meaning, crucial for understanding relationships.
Vector databases: Specialized databases optimized for efficient storage and retrieval of these embeddings.

Understanding when and how to leverage these AI building blocks is crucial for developing an "AI product sense." This enables you to identify which components can solve which problems and create unique experiences for your users, rather than simply deploying GenAI for its own sake.

Developing AI Product Sense

While Generative AI (GenAI) is hot and happening, its true value lies in the business benefits it can provide. Developing AI product sense helps you balance what's technically possible with practical application.

For instance, good product sense will allow you to make trade offs and talk with other engineers when to use a graph database instead of a vector database, or whether to fine-tune a model, improve your prompts, or use Retrieval Augmented Generation (RAG).

Good product sense means you have a technical understanding of the building blocks, but it's far more important to understand the business value and then pick and choose the blocks that fit your use case. Don't start with the building blocks, start with the problem and solution space.

Because large language models are based on probability, it's crucial to design products that can adapt over time while staying reliable and trustworthy. You can do this by using guardrails and continuous evaluation.

When building GenAI applications, you need a data-centric approach. This means making sure your models have the right, high-quality information to respond to user requests, which allows for personalized experiences. It also helps you avoid unintended problems and deploy these models ethically.

The different AI building blocks

If you're already familiar with some of these building blocks, feel free to skip ahead to the ones you're not yet acquainted with.

General Purpose Models

General-purpose models, such as GPT-4.1, Claude 4, and Gemini 2.5, are large language models (LLMs) trained on vast amounts of data.

They are crucial because they form the foundation for most AI applications seen today, from chatbots and copilots to creative writing and coding assistants, and they power tools like ChatGPT.

These models have significantly lowered the barrier to entry for building AI systems. Developers can now quickly create prototypes without needing to train their own models from scratch.

Traditionally, you would train a specific model for each task, like classifying the sentiment of customer reviews. Now, thanks to LLMs, developers have a versatile platform. A single model can perform tasks like sentiment analysis right away, without requiring custom training.

Companies like OpenAI, Anthropic, and Google are leading the development of these general-purpose models.

Large language models are non-deterministic, meaning they don't always produce the exact same answer consistently. Simply put, they predict the next word in a sentence. As these models are continuously updated, their behavior can change. Therefore, it's essential to constantly evaluate their non-deterministic outputs, as they may sometimes "hallucinate" or provide incorrect information.

I can't stress this often enough, when working with large language models, always remain skeptical. Don't trust, but verify.

Reasoning Models

The companies that developed general-purpose AI models have also introduced reasoning models, the first one, the O1 model being unveiled by OpenAI in September 2024.

Reasoning models enhance a large language model's ability to engage in multi-step thinking, allowing for deeper planning and reasoning. These models build upon the "chain-of-thought" prompting technique, where users instruct a model to reason through several steps before providing an answer.

With reasoning models, AI can now tackle complex tasks such as scientific research, in-depth analysis, and various "deep research" functionalities have been unveiled by the major labs. These capabilities are powered by the enhanced reasoning abilities of these "thinking" models.

A general principle in developing generative AI applications is to begin by using a more powerful model to automate a workflow. You can then experiment with scaling down to less powerful models, all while continuously evaluating its performance on the task at hand.

Reasoning models are particularly valuable in various workflows, including agentic workflows. In these setups, a reasoning model can for example plan tasks for general-purpose models, awaits their responses, and then proceeds to the next step. Typically, the reasoning model first outlines the entire task execution plan before instructing general-purpose models to start the work.

This approach mirrors human problem-solving: when we are given complex tasks and are allowed to systematically plan our approach, we tend to perform better compared to jumping in immediately. The same holds true for reasoning models, which benefit from this structured thinking process.

Prompting techniques

One of the most crucial techniques for maximizing the performance of large language models (LLMs) is crafting effective prompts. While prompt engineering is a field in itself, various prompting techniques have emerged, including chain-of-thought prompting, zero- or few-shot prompting, meta-prompting, and prompt chaining.

These techniques subtly guide an LLM's behavior without requiring modifications to its core training. Research indicates that "priming" a model; prompting it to adopt a specific role or behave in a certain way significantly enhances its responses.

Optimizing prompts is often the first step in aligning a model's behavior with your desired outcome. It offers the best effort-to-reward ratio compared to fine-tuning, which involves altering the model's internal workings by feeding it case-specific data.

Different models from various companies have their own distinct preferences, primarily due to how they were trained. Some models prefer task specifications at the beginning of a prompt, while others prefer them at the end. The same applies to using XML or other types of tags. Therefore, it's essential to tailor your prompts to the model's recommended specifications.

While prompting offers a quick way to improve a model's behavior, this behavior must be systematically evaluated against your specific use cases. This is where evals and tracing come in.

Evals (Evaluations)

There's a significant difference between a demo application and one ready for production, and a large part of this distinction depends on the quality of the evaluations (evals). Evals are systematic methods used to measure the quality, effectiveness, and behavior of large language models (LLMs).

These tests define what is acceptable behavior for an AI application. They go beyond typical software testing, which often focuses on simple pass/fail outcomes or latency checks.

Evals are crucial because they ensure AI systems behave reliably and safely before deployment. They help detect errors, misinformation, and incorrect responses. Without proper evals, it's highly likely an AI application will deliver a poor user experience.

Unlike more deterministic software tests, evals are designed to handle non-deterministic outputs from LLMs. They generate more qualitative metrics like relevance, coherence, and correctness, which better reflect the variable nature of LLM behavior.

There are typically different types of evals:

Human-based: Users or experts provide feedback or label the responses generated by LLMs.
Code-based: These automatically check if the output is valid code. However, their scope is limited to verifying code validity only.
LLM-based evals: These use another large language model as a judge to score outputs, offering a scalable evaluation method.

Ultimately, crafting effective evals can make the difference between a great AI product and a mediocre one. This topic alone is a field of study.

Want more notes like this? Join the newsletter.

Tracing

The second way to check how Large Language Models (LLMs) behave is through tracing. This involves systematically following and analyzing every step an LLM takes from the moment it receives an input until it produces an output.

Tracing tools record each intermediate step a model makes. For example, they show which tools the model uses or what information it retrieves. Every step is logged and includes useful details such as how long it took (latency), how many tokens were used, and other relevant information.

Tracing helps in troubleshooting LLM behavior and provides real-time monitoring. It also aids in later analysis, which can help identify potential biases and unwanted actions. This is especially important when LLMs are part of complex workflows that involve multi-step reasoning and connections with outside tools and APIs.

Relying on a "looks good to me" (LGTM) approach to testing often leads to problems for many teams. It can keep them stuck in the proof-of-concept stage, preventing them from confidently releasing applications for production use.

Guardrails

Without the right guardrails, large language models are open to misuse. Guardrails set rules, establish filters, and put monitoring systems in place. Their purpose is to make sure that a model's outputs are safe, ethical, appropriate, and accurate before they reach the user.

Guardrails aim to stop models from creating harmful or offensive content. They also work to prevent the model from revealing sensitive or personal information.

Large language models can face various attacks, including prompt injection, data leakage, hallucination exploitation, and jailbreaks. Having guardrails helps you actively filter and monitor against these types of attacks.

Another crucial step in making LLMs robust and secure is red teaming. This involves intentionally and systematically trying to find weaknesses in a large language model. A "red team" acts like an attacker, using clever prompts and strategies to push the model's boundaries.

By actively searching for vulnerabilities, you can strengthen the model and improve their defenses against various types of attacks, ensuring the AI application is safe to use.

RAG (Retrieval Augmented Generation)

Large language models (LLMs) are trained on massive amounts of data. However, by the time a model is released, its training data is often no longer current. This means LLMs usually draw information from a knowledge base that contains mostly older data.

Additionally, LLMs do not have direct access to the specific data found in your own systems or documents. Retrieval Augmented Generation (RAG) offers a solution by giving models the most up-to-date and relevant information. It allows them to pull the exact data needed to provide users with accurate answers.

RAG systems let models get information from the internet, internal company databases, or specialized documents before creating a response. This helps them provide answers that are more accurate and relevant to the situation, reducing the chance of giving general, made-up, or old information.

Businesses are already using RAG systems to "talk" to their own data. For example, customer support agents use them to find the latest company policies or internal user guides to help solve customer issues.

After prompt optimization, RAG offers the best return on investment for improving models. Only if RAG does not perform well enough should fine-tuning be considered as an option, but we will discuss that more later.

Vector databases

These new types of databases have become popular for machine learning and generative AI. This is because they are built to store, organize, and search information from sources that don't have a fixed structure, like text, images, or audio.

These databases allow you to search unstructured data by turning the data into embeddings. Embeddings are numerical representations of data, shown as vectors which you can think of as points in a multi-dimensional space. These embeddings capture the key features and connections within the original data. For example, words with similar meanings or relationships are placed closer together in this space like "happy" and "joyful".

This setup lets large language models (LLMs) find information using semantic search within the vector database. This allows them to effectively find information in documents or other unstructured data.

These databases are sometimes also used as a long term memory storage for LLMs. This is because information from past interactions and knowledge bases can be added to their working memory, giving them a broader context for their responses in future interactions.

Graph databases

Another type of database, the graph database, is now becoming an important part of Generative AI (GenAI) applications. These databases are good at modeling, storing, and finding complex connections between different pieces of information.

Graph databases show data as nodes (which are like individual items or entities) and edges (which show the relationships between those items). This setup helps large language models not only get information but also understand how different pieces of information relate to each other.

To illustrate, consider this example: if Jill and James buy similar things, the system can understand that Jill might also like products James bought. The model can even explain why it recommends a particular product to Jill. This method is called GraphRAG. It allows AI systems not only to find information but also to show how they reached their conclusion.

In contrast, typical RAG applications would only find similar meanings. So, if a user searches for "shoes," they might also suggest "sport shoes" or "heels" because those terms are semantically linked.

The ability of GraphRAG to explain its reasoning will be especially important for enterprise AI applications in industries where strict rules and regulations are in place and where there is a high need for transparency and very little room for mistakes.

Model fine tuning

Fine-tuning is often the last step taken after trying prompt engineering and using Retrieval Augmented Generation (RAG). Fine tuning involves creating a specific dataset for a particular task, filled with question and answer pairs, so the model can learn from these patterns.

There is a difference between full fine-tuning and instruction fine-tuning. For full fine-tuning on complex or general tasks, you typically need more than 50,000 labeled examples.

When we discuss fine-tuning a model, we usually mean instruction fine-tuning, not full fine-tuning. With instruction fine-tuning, you can often achieve some results with just a few hundred examples, and better results with even as few as 1,000 high-quality examples if the task is narrow and the data is well-labeled.

Since creating high-quality labeled data can be expensive and take a lot of time, fine-tuning is usually only done when you truly need very specialized, consistent, or domain-specific behavior that general models, RAG, or prompt engineering cannot effectively provide.

Voice Interfaces

Comparing typing speed to speaking speed shows a clear difference: the average person types 40 to 45 words per minute, but speaks 125 to 150 words per minute. This simple comparison highlights how much more efficient it is to use voice as an input instead of typing.

We're also much more accustomed to using our voices in everyday life, whether talking with colleagues or partners. So, it's no surprise that voice interfaces are becoming another core part of AI interactions.

We've already seen examples with devices like Amazon Alexa or Google Home, and even with ChatGPT's voice mode, allowing for spoken conversations. Some programmers are even using voice to give instructions to AI models that write code.

Voice interfaces will also form the basis for automated call centers, enabling AI agents to handle calls and speak directly with people. Voice interaction is a natural progression after text-based chat windows and will probably become one of the core components of AI applications.

Computer Usage

Currently, Large Language Models (LLMs) are primarily limited to processing and generating text. They don't inherently "see" or understand user interfaces visually, the way humans do. To overcome this, new AI agent systems are being developed. These systems often integrate an LLM with a Large Vision Model (LVM), or use a multimodal AI model (which handles both text and vision simultaneously), to give the AI access to an operating system. This allows the AI to perceive and interact with web browsers or any other application, much like a human would.

For these AI models to work with applications, they need a form of "sight." Since they don't have human-like vision, it's more accurate to say the AI system analyzes digital representations of the screen, such as screenshots. It's like the model "takes a picture" every few seconds, processes that image to understand the interface's current state (e.g., identifying buttons, text fields), and then uses that information to decide its next action.

There's a lot of exciting experimentation happening with these AI agent systems. For example, developers are building AI agents that can autonomously book hotel rooms or flights by navigating complex websites and interacting with their interfaces on your behalf.

However, real production ready use-cases still lack behind as a lot of VLMs have a large margin of error when interacting with user interfaces, leading to a bad and error prone user experience.

(Cloud) Sandboxes

When large language models work with code, they need a safe place to run it without impacting your development, testing, or production environments. You should keep these areas separate from where people work. This is where sandboxes come in.

A sandbox lets a model safely and securely run code without affecting the rest of the application. If the sandbox environment crashes, it's not a major issue; it can simply delete the old environment and start a new one.

Because of this, code sandboxes are an important part of the AI toolkit if your models interact with complex applications or code without disrupting other development environments.

Model Context Protocol

The Model Context Protocol (MCP) is an open standard introduced by Anthropic in late 2024. It standardized how large language models (LLMs) connect with and access tools, data sources, and other services.

The MCP acts as a connecting piece among many of the building blocks discussed earlier. It helps teams avoid repeatedly creating the same integration code. Before, developers would write code to connect with a specific application through its API. Now, the company owning the API can offer an MCP server. This allows developers to focus on writing code for the AI agent itself, instead of also figuring out how the agent should connect with the API.

The plan is to make these servers easily discoverable in the future. For example, if you tell your personal assistant AI to book a flight, it could directly connect with a flight company's MCP server, without needing to browse the company's website.

Get the next bite sized AI insight inside your inbox.

Daniel van der WoudeFounder of N8X