I Ran 60 Autoresearch Experiments on a Production Search Algorithm. Here's What Actually Happened.
Everyone's writing about Karpathy's autoresearch. Most of it is "here's how the loop works" or "imagine the possibilities." I wanted to see what happens when you point it at a real codebase with a ...

Source: DEV Community
Everyone's writing about Karpathy's autoresearch. Most of it is "here's how the loop works" or "imagine the possibilities." I wanted to see what happens when you point it at a real codebase with a real metric, not a training script. I wanted to try it! So I ran two rounds. 60 total iterations. The first round improved things. The second round found nothing - and that turned out to be even more interesting. The System I work on a hybrid search system: Cohere embeddings in pgvector for semantic similarity, then a keyword re-ranking layer on top. Django, PostgreSQL, Bedrock. The kind of search stack a lot of teams are probably running right now. The ranking logic lives in one file: utils.py. It takes the top 100 vector search candidates, scores them on keyword and tag matches across location, activity, and general terms, normalizes everything with z-scores, applies adaptive correlation-based weighting to avoid double-counting, and combines it all into a final score: similarity * (1 + keyw