Context engineering is the way

By Dan Sisco

context engineering

evals

reliability

llm

We’ve…been working on this for over a year…and…he just…he tweeted it out.

I really like the term “context engineering” over prompt engineering.

It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM.
— tobi lutke (@tobi) June 19, 2025

Context engineering is the new hotness, and we’re so excited there’s a term for
this now!

Threading this needle is the key to building a reliable, successful LLM
application. And it happens to be exactly what we’ve been working on at Bolt
Foundry.

What is context?

Context is everything your model sees before it sends a response.

It’s your prompt, user message, tool calls, user turns, samples, and grader.

With too much context (like prompt stuffing) you divert the LLM’s attention and
reliability plummets. With too little context, LLMs are left guessing and
filling in gaps, which is similarly unreliable.

Why does this matter?

In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step>
— Andrej Karpathy (@karpathy) June 25, 2025

In human-to-human communication, context is key. We get context through dozens
of verbal and nonverbal cues, like body language, tone of voice, eye contact,
pitch, and more. Notably remote work sucks because we lose the majority of these
contextual clues on Zoom.

LLMs also need context to perform reliably.

We’ve found the best way to provide this context is:

Create data samples from examples of success and failure
Build graders from those samples that reinforce what you want the LLM to do
Structure your prompt with clear information hierarchy

The difference between ai slop and magical experiences is the context you give to the model
— boris tane (@boristane) June 23, 2025

Our work with samples, graders, and evals is context engineering at its core.
We’re structuring feedback and examples to optimize LLM performance, which is
exactly what Karpathy is describing.

What does this look like in practice?

Merely crafting prompts does not seem like a real fulltime role, but figuring out how to compress context, chain prompts, recover from errors, and measure improvements is super challenging.”
— Amjad Masad (@amasad) January 21, 2023

We recently implemented this approach with Fastpitch, an AI-generated sports
newsletter. We wrote about it here, but the
highlights are:

We started by creating Ground Truth samples of stories collected by the LLM,
scored by a human
We created an additional collection of synthetic samples to reinforce the
learning
We used those data samples to build a Grader that evaluates story data
We iterated on that Grader until it agreed with the Ground Truth samples
We then used that Grader to adjust our prompt

Proper information hierarchy also helps LLMs perform better. We've seen this
over and over with customers.

We took one customer from
86% reliability on XML output to 100%
in less than an hour with some basic prompt tweaks.

This approach to giving the model "just the right information" with human-graded
samples, a calibrated Grader, and correctly formatted prompt is the heart of
context engineering.

We're thrilled to see more people talking about this.

If you're interested in learning more about context engineering, and making LLM
development more science than art,
join our community on Discord.