{"id":"2056032201259831398","url":"https://x.com/lotte_verheyden/status/2056032201259831398","text":"","author":{"name":"Lotte","username":"lotte_verheyden","avatarUrl":"https://pbs.twimg.com/profile_images/1994707418790432768/NvTWenyx_200x200.jpg"},"createdAt":"Sun May 17 15:20:50 +0000 2026","engagement":{"replies":4,"retweets":34,"likes":320,"views":39222},"article":{"title":"Designing eval datasets for LLM applications","previewText":"This is one piece of a series we’re publishing as part of the Langfuse Academy, where we walk through the full AI engineering lifecycle. If you’re new to the series,The AI Engineering Loop is the best","coverImageUrl":"https://pbs.twimg.com/media/HIf8lCLa0AAgR6C.jpg","content":"This is one piece of a series we’re publishing as part of the L[angfuse Academy](https://langfuse.com/academy), where we walk through the full AI engineering lifecycle. If you’re new to the series,[The AI Engineering Loop](https://langfuse.com/academy/ai-engineering-loop) is the best place to start\n\n## A short recap of the AI Engineering Loop\n\nThe AI Engineering Loop is how teams continuously improve AI systems. It connects what’s happening in production (tracing, monitoring) to structured iteration during development (datasets, experiments, evaluation). Each shipped improvement produces new data, and teams loop through this process continuously.\n\n![](https://pbs.twimg.com/media/HIf6DJlasAAudWe.jpg)\n\nYou can read more on this [here](https://langfuse.com/academy/ai-engineering-loop).\n\n# How datasets fit into the loop\n\nSo far, we've covered the first two steps of the [AI engineering loop](https://langfuse.com/academy/ai-engineering-loop): [tracing](https://langfuse.com/academy/tracing) your application and [monitoring](https://langfuse.com/academy/monitoring) its behavior live. Those give you visibility into what your system is actually doing and give you inspiration for improvement.\n\nNow the question becomes: when you spot something worth improving, how do you test a change before deploying it to production? The next three steps of the loop cover exactly this, and it starts with datasets.\n\nA dataset is a collection of test cases that you run your application against each time you make a change (\"an experiment\"). Instead of deploying and hoping for the best, you get a repeatable, consistent check across a set of inputs that represent real-world usage.\n\n# The dataset item\n\nA dataset is made up of items, each item represents one test case: a situation your application should be able to handle. Generally, an item has three fields:\n\n- Input (required)\n\n- Expected output (optional)\n\n- Metadata (optional)\n\n## The three fields of a dataset item\n\n![](https://pbs.twimg.com/media/HINlehfaMAAOPP-.png)\n\nA good mental model:\n\n![](https://pbs.twimg.com/media/HINlidcbAAAsVew.png)\n\n## Common expected output patterns\n\nWhether you need an expected output, and what it looks like, depends on which type of evaluator you use.\n\nReference-based versus reference-free evaluators\nSome evaluators check the output against a predefined expected output (reference-based). Others assess the output without needing a ground truth to compare against (reference-free).\n\nExact match\n\nThe expected output is the literal correct answer. For example:\n\n- A classification task where the correct label is \"billing_inquiry\"\n\n- An extraction task where the expected entities are [\"Paris\", \"Thursday\"]\n\nReference answer\n\nThe expected output is a gold-standard response that shows what a good output looks like. The evaluator can compare the test's output against this example, for instance by checking semantic similarity or whether the key points match.\n\nEvaluation criteria\n\nThe expected output is a list of checks or requirements the output should satisfy. For example:\n\n- \"must mention the refund policy\"\n\n- \"must include a link to the help center\"\n\nThe evaluator checks whether the output meets these criteria.\n\nNothing\n\nSometimes no expected output is required at all. If you're just checking whether:\n\n- the tone is professional\n\n- the response is safe\n\n- the output follows a required format\n\nYour dataset items don't need anything other than an input as you will use a reference-free evaluator.\n\nCombination of the above\n\nBecause you can run a combination of different evaluators on a single dataset item, a dataset item's expected output field can also contain multiple types of reference data. The expected output is a JSON field, so you can store multiple types of reference data without a problem.\n\n# What makes a good dataset\n\nA good dataset mirrors what your system will encounter in production. If passing the dataset gives you confidence before deploying, it's doing its job.\n\nClear in scope. Each dataset should have a well-defined scope. That can be end-to-end if you treat internal steps as implementation details, or it can target an individual step like retrieval or summarization if that's the part you're trying to improve. You'll likely end up with multiple datasets, each with a clear purpose.\n\n![](https://pbs.twimg.com/media/HIf6xeoaAAABhsj.jpg)\n\nThe right size for the workflow. Some datasets are small and fast enough to run on every push as part of your CI/CD pipeline. Others are larger and more comprehensive, and are useful to run periodically but too slow for every minor change.\n\n![](https://pbs.twimg.com/media/HINl6XWbMAA78El.png)\n\n# Where to start\n\nStart with the most concrete examples you have, then expand coverage once you know what you are trying to test.\n\n1. Pull examples from production traces that you spotted and would like to improve, either as-is or anonymized or transformed by AI.\n\n1. Add hand-written cases based on predefined requirements, edge cases, or behaviors your agent must handle reliably.\n\n1. Generate synthetic examples with AI once you know which dimensions you want to cover more broadly.\n\n# What comes next\n\nOnce you have a dataset, the next step is running your system against it to see how changes affect output quality. This is what [experiments](https://langfuse.com/academy/experiments) are for."}}