From d4173ba0e09c56984a269263c796a6e6f24b016b Mon Sep 17 00:00:00 2001
From: Agent <agent@localhost>
Date: Tue, 3 Feb 2026 22:28:35 +0000
Subject: [PATCH] Clean up test files

---
 memory/ai-news-test-hybrid.md   | 104 --------------------------------
 memory/ai-news-test-snippets.md |  66 --------------------
 2 files changed, 170 deletions(-)
 delete mode 100644 memory/ai-news-test-hybrid.md
 delete mode 100644 memory/ai-news-test-snippets.md

diff --git a/memory/ai-news-test-hybrid.md b/memory/ai-news-test-hybrid.md
deleted file mode 100644
index 990f0dd..0000000
--- a/memory/ai-news-test-hybrid.md
+++ /dev/null
@@ -1,104 +0,0 @@
-# AI News Test Dataset - HYBRID (full content for OpenAI)
-
-Write a brief AI news summary covering these 5 articles. 2-3 sentences per topic. German, casual tone.
-
----
-
-## 1. OpenAI: Introducing the Codex App
-**Source:** openai.com | **Date:** Feb 2, 2026
-
-Today, we're introducing the Codex app for macOS—a powerful new interface designed to effortlessly manage multiple agents at once, run work in parallel, and collaborate with agents over long-running tasks.
-
-We're also excited to show more people what's now possible with Codex. For a limited time we're including Codex with ChatGPT Free and Go, and we're doubling the rate limits on Plus, Pro, Business, Enterprise, and Edu plans. Those higher limits apply everywhere you use Codex—in the app, from the CLI, in your IDE, and in the cloud.
-
-The Codex app changes how software gets built and who can build it—from pairing with a single coding agent on targeted edits to supervising coordinated teams of agents across the full lifecycle of designing, building, shipping, and maintaining software.
-
-**The Codex app: A command center for agents**
-
-Since we launched Codex in April 2025, the way developers work with agents has fundamentally changed. Models are now capable of handling complex, long-running tasks end to end and developers are now orchestrating multiple agents across projects: delegating work, running tasks in parallel, and trusting agents to take on substantial projects that can span hours, days, or weeks. The core challenge has shifted from what agents can do to how people can direct, supervise, and collaborate with them at scale—existing IDEs and terminal-based tools are not built to support this way of working.
-
-The Codex app provides a focused space for multi-tasking with agents. Agents run in separate threads organized by projects, so you can seamlessly switch between tasks without losing context. The app lets you review the agent's changes in the thread, comment on the diff, and even open it in your editor to make manual changes.
-
-It also includes built-in support for worktrees, so multiple agents can work on the same repo without conflicts. Each agent works on an isolated copy of your code, allowing you to explore different paths without needing to track how they impact your codebase.
-
-Codex is evolving from an agent that writes code into one that uses code to get work done on your computer. With skills, you can easily extend Codex beyond code generation to tasks that require gathering and synthesizing information, problem-solving, writing, and more.
-
-Skills bundle instructions, resources, and scripts so Codex can reliably connect to tools, run workflows, and complete tasks according to your team's preferences. The Codex app includes a dedicated interface to create and manage skills.
-
-We asked Codex to make a racing game, complete with different racers, eight maps, and even items players could use with the space bar. Using an image generation skill and a web game development skill, Codex built the game by working independently using more than 7 million tokens with just one initial user prompt. It took on the roles of designer, game developer, and QA tester to validate its work by actually playing the game.
-
----
-
-## 2. OpenAI: Inside OpenAI's In-House Data Agent
-**Source:** openai.com | **Date:** Jan 29, 2026
-
-Data powers how systems learn, products evolve, and how companies make choices. But getting answers quickly, correctly, and with the right context is often harder than it should be. To make this easier as OpenAI scales, we built our own bespoke in-house AI data agent that explores and reasons over our own platform.
-
-Our agent is a custom internal-only tool (not an external offering), built specifically around OpenAI's data, permissions, and workflows. The OpenAI tools we used to build it (Codex, GPT-5, the Evals API, and the Embeddings API) are the same tools we make available to developers everywhere.
-
-Our data agent lets employees go from question to insight in minutes, not days. This lowers the bar to pulling data and nuanced analysis across all functions. Today, teams across Engineering, Data Science, Go-To-Market, Finance, and Research at OpenAI lean on the agent to answer high-impact data questions. It can help answer how to evaluate launches and understand business health, all through natural language.
-
-**Why they needed a custom tool:**
-
-OpenAI's data platform serves more than 3.5k internal users working across Engineering, Product, and Research, spanning over 600 petabytes of data across 70k datasets. At that size, simply finding the right table can be one of the most time-consuming parts of doing analysis.
-
-As one internal user put it: "We have a lot of tables that are fairly similar, and I spend tons of time trying to figure out how they're different and which to use."
-
-**How it works:**
-
-The agent handles analysis end-to-end, from understanding the question to exploring the data, running queries, and synthesizing findings. Rather than following a fixed script, the agent evaluates its own progress. If an intermediate result looks wrong, the agent investigates what went wrong, adjusts its approach, and tries again. This closed-loop, self-learning process shifts iteration from the user into the agent itself.
-
-The agent covers the full analytics workflow: discovering data, running SQL, and publishing notebooks and reports. It understands internal company knowledge, can web search for external information, and improves over time through learned usage and memory.
-
-High-quality answers depend on rich, accurate context. The agent uses:
-- Metadata grounding: schema metadata to inform SQL writing, table lineage for relationships
-- Query inference: Ingesting historical queries helps understand how to write queries
-
----
-
-## 3. Simon Willison: Moltbook is the most interesting place on the internet right now
-**Source:** simonwillison.net | **Date:** Jan 30, 2026
-
-Moltbook is Facebook for your Molt (one of the previous names for OpenClaw assistants). It's a social network where digital assistants can talk to each other.
-
-The first neat thing about Moltbook is the way you install it: you show the skill to your agent by sending them a message with a link to https://www.moltbook.com/skill.md. Embedded in that Markdown file are installation instructions.
-
-The hottest project in AI right now is OpenClaw. It's an open source implementation of the digital personal assistant pattern, built by Peter Steinberger. It's two months old, has over 114,000 stars on GitHub and is seeing incredible adoption.
-
-OpenClaw is built around skills, and the community around it are sharing thousands of these on clawhub.ai. A skill is a zip file containing markdown instructions and optional extra scripts which means they act as a powerful plugin system.
-
-Given the inherent risk of prompt injection against this class of software it's Simon's current pick for most likely to result in a Challenger disaster.
-
----
-
-## 4. Simon Willison: The Five Levels - from Spicy Autocomplete to the Dark Factory
-**Source:** simonwillison.net | **Date:** Jan 28, 2026
-
-Dan Shapiro proposes a five level model of AI-assisted programming, inspired by the levels of driving automation:
-
-0. **Spicy autocomplete** - original GitHub Copilot or copying snippets from ChatGPT
-1. **Coding intern** - writing unimportant snippets and boilerplate with full human review
-2. **Junior developer** - pair programming with the model but still reviewing every line
-3. **Developer** - Most code is generated by AI, you take on the role of full-time code reviewer
-4. **Engineering team** - You're more of an engineering manager. You collaborate on specs and plans, the agents do the work
-5. **Dark software factory** - like a factory run by robots where the lights are out because robots don't need to see
-
-About level 5: "At level 5, it's not really a car any more. Your software process isn't really a software process any more. It's a black box that turns specs into software."
-
-Simon talked to one team doing the "dark factory" pattern. Key characteristics:
-- Nobody reviews AI-produced code, ever. They don't even look at it.
-- The goal of the system is to prove that the system works. A huge amount of the coding agent work goes into testing and tooling.
-- The role of the humans is to design that system - to find new patterns that can help the agents work more effectively.
-
----
-
-## 5. Sebastian Raschka: Categories of Inference-Time Scaling for Improved LLM Reasoning
-**Source:** magazine.sebastianraschka.com | **Date:** Recent
-
-Inference scaling has become one of the most effective ways to improve answer quality and accuracy in deployed LLMs. The idea: if we are willing to spend a bit more compute at inference time, we can get the model to produce better answers.
-
-Every major LLM provider relies on some flavor of inference-time scaling today. Back in March, Sebastian wrote an overview of inference scaling and summarized early techniques. This article groups different approaches into clearer categories and highlights the newest work.
-
-As part of drafting a full book chapter on inference scaling for "Build a Reasoning Model (From Scratch)", Sebastian experimented with many of the fundamental flavors of these methods. With hyperparameter tuning, this quickly turned into thousands of runs. The chapter takes the base model from about 15 percent to around 52 percent accuracy.
-
-Categories covered: Chain-of-Thought Prompting, Self-Consistency, Best-of-N, and more.
diff --git a/memory/ai-news-test-snippets.md b/memory/ai-news-test-snippets.md
deleted file mode 100644
index 6d479b0..0000000
--- a/memory/ai-news-test-snippets.md
+++ /dev/null
@@ -1,66 +0,0 @@
-# AI News Test Dataset - SNIPPETS ONLY (for OpenAI)
-
-Write a brief AI news summary covering these 5 articles. 2-3 sentences per topic. German, casual tone.
-
----
-
-## 1. OpenAI: Introducing the Codex App
-**Source:** openai.com | **Date:** Feb 2, 2026
-
-> Introducing the Codex app for macOS—a command center for AI coding and software development with multiple agents, parallel workflows, and long-running tasks.
-
----
-
-## 2. OpenAI: Inside OpenAI's In-House Data Agent
-**Source:** openai.com | **Date:** Jan 29, 2026
-
-> How OpenAI built an in-house AI data agent that uses GPT-5, Codex, and memory to reason over massive datasets and deliver reliable insights in minutes.
-
----
-
-## 3. Simon Willison: Moltbook is the most interesting place on the internet right now
-**Source:** simonwillison.net | **Date:** Jan 30, 2026
-
-Moltbook is Facebook for your Molt (one of the previous names for OpenClaw assistants). It's a social network where digital assistants can talk to each other.
-
-The first neat thing about Moltbook is the way you install it: you show the skill to your agent by sending them a message with a link to https://www.moltbook.com/skill.md. Embedded in that Markdown file are installation instructions.
-
-The hottest project in AI right now is OpenClaw. It's an open source implementation of the digital personal assistant pattern, built by Peter Steinberger. It's two months old, has over 114,000 stars on GitHub and is seeing incredible adoption.
-
-OpenClaw is built around skills, and the community around it are sharing thousands of these on clawhub.ai. A skill is a zip file containing markdown instructions and optional extra scripts which means they act as a powerful plugin system.
-
-Given the inherent risk of prompt injection against this class of software it's Simon's current pick for most likely to result in a Challenger disaster.
-
----
-
-## 4. Simon Willison: The Five Levels - from Spicy Autocomplete to the Dark Factory
-**Source:** simonwillison.net | **Date:** Jan 28, 2026
-
-Dan Shapiro proposes a five level model of AI-assisted programming, inspired by the levels of driving automation:
-
-0. **Spicy autocomplete** - original GitHub Copilot or copying snippets from ChatGPT
-1. **Coding intern** - writing unimportant snippets and boilerplate with full human review
-2. **Junior developer** - pair programming with the model but still reviewing every line
-3. **Developer** - Most code is generated by AI, you take on the role of full-time code reviewer
-4. **Engineering team** - You're more of an engineering manager. You collaborate on specs and plans, the agents do the work
-5. **Dark software factory** - like a factory run by robots where the lights are out because robots don't need to see
-
-About level 5: "At level 5, it's not really a car any more. Your software process isn't really a software process any more. It's a black box that turns specs into software."
-
-Simon talked to one team doing the "dark factory" pattern. Key characteristics:
-- Nobody reviews AI-produced code, ever. They don't even look at it.
-- The goal of the system is to prove that the system works. A huge amount of the coding agent work goes into testing and tooling.
-- The role of the humans is to design that system - to find new patterns that can help the agents work more effectively.
-
----
-
-## 5. Sebastian Raschka: Categories of Inference-Time Scaling for Improved LLM Reasoning
-**Source:** magazine.sebastianraschka.com | **Date:** Recent
-
-Inference scaling has become one of the most effective ways to improve answer quality and accuracy in deployed LLMs. The idea: if we are willing to spend a bit more compute at inference time, we can get the model to produce better answers.
-
-Every major LLM provider relies on some flavor of inference-time scaling today. Back in March, Sebastian wrote an overview of inference scaling and summarized early techniques. This article groups different approaches into clearer categories and highlights the newest work.
-
-As part of drafting a full book chapter on inference scaling for "Build a Reasoning Model (From Scratch)", Sebastian experimented with many of the fundamental flavors of these methods. With hyperparameter tuning, this quickly turned into thousands of runs. The chapter takes the base model from about 15 percent to around 52 percent accuracy.
-
-Categories covered: Chain-of-Thought Prompting, Self-Consistency, Best-of-N, and more.