blog/lifelab/Finally hit the limit
Converted skeptic

Latest LLMs show increasing capability and people are more and more convinced they can do anything. I believe this is because "agents" can solve most of the tasks they imagine. I was a bit conceited and had a suite of tasks that I thought are out-of-domain for previous models (thinking, GPT-4o and the first Opus) and as I suspected, the models failed miserably. I recall it was writing a study guide using jupyterbook, an obscure and rather finnicky codebase to generate html webs with some code. The LLM knew exactly what I wanted, but still failed to make it work and would hallucinate nonsense and try to inject html and javascript into the markdown document. I also had a vibe-coded note taking app completely collapse and not even start after consecutive coding sessions, which left a sour after taste.

Move forward 6 months and "agents" could solve most of the task I imagined. I was blown away given the subjective success rate of my prompts to usable, time saving code. I also became a believer that AI have solved coding. I worked on this web app for 4 months on weekends and after a while I was very confident a PR will not break it and started vibe merging. I had a lot of trust in the robustness of the software. My friends talk about one

Finally, limits of vibe coding
Finally, limits of vibe coding

I mainly use Opus 4.6. I feel it is reliable if I put 1-PR-worth-of-tasks in a prompt with some file pointers. I usually open a PR before going to bed after having it code autonomously. This week, I tried a rather massive change in one go after my coworker showed me agent swarm mode. I asked Claude to rewrite my 30k react frontend to svelte. I am vaguely familiar with a basic svelte app and I was curious if claude can translate a already well defined software with clear patterns to another programming framework.

It was kinda cool. Opus performed analysis and planning and then spawned 20 agents to work on subparts of the new svelte frontend, maintaining 1 to 1 parity of files. It died 3 times, ran out of tokens twice, but it managed to push out 28 thousand new lines of code. Unfortunately, running it did not work at all due to lifecycle_outside_component unhandled rejections in a way that was not possible to debug. I threw 20 bucks at Codex with GPT5.3 but after some noodling and copy pasting console commands I had to throw in the towel. It is no longer code that I nor AI can fix and I burned a couple million tokens on garbage. I hit the limit. 10k codebases won't be one shot just yet!

I thought what could've made this a success? Better prompting? "Do this step by step?" More end2end testing? Some weird openmoltclaw setup? Multiple LLMs with RAG?

Furthermore, this made me reflect on maintainability of a project. Right now, I can barely debug the React frontend code (luckily, JS is quite forgiving), given some time and a book, because I spent a lot of time reviewing it. But if I merged the svelte rewrite, I would be compeltely lost. If AI can't maintain this, and I can't maintain this, who will?

You
Image