22/48 scenarios carry a flag (mostly a single discerning-judge point or one eatery leak). No tier collapses; the soft spot is dense multi-city MEDIUM trips - see Analysis.
assemblePool from REAL prod-D1 saves + seed top-up; formatV2TripRequest renders the exact shipping prompt (anchors/IF-TIME split + [FULL/HALF/QUICK] intensity tags + soft-priority RULES + seasonality); a Claude Sonnet subagent generates; then the production post-passes (applyScopeFilter -> repairCityCoverage -> filterCrossDayDuplicateItems) are applied. The scored object is the SHIPPED itinerary, not raw model output.destinations.name with verified seed corpus. Pace mixed chill/balanced/packed; trips 3-12 days; intl + India; single-city, multi-city, and day-trip cases.applyScopeFilter + dedup but NOT repairCityCoverage (a V9-era addition), isolating the prompt + repair contribution.| Cohort | n | Doability | Rhythm | Clustering | Narrative | Selection | Overall |
|---|---|---|---|---|---|---|---|
| V9 — all | 48 | 7.3 | 7.58 | 6.99 | 8.47 | 7.56 | 7.29 |
| Heavy saves | 12 | 7.25 | 7.67 | 6.96 | 8.58 | 7.79 | 7.42 |
| Medium saves | 12 | 6.5 | 7.13 | 6.21 | 8.25 | 7.13 | 6.58 |
| Light saves | 12 | 7.88 | 7.79 | 7.25 | 8.38 | 7.58 | 7.63 |
| Cold-start / seed | 12 | 7.58 | 7.75 | 7.54 | 8.67 | 7.75 | 7.54 |
Doability is weighted highest in the rubric (realistic daily load + feasible inter-stop travel + single-zone clustering). Overall V9 doability 7.3/10.
Coverage % higher = better; all the avg-per-trip counts (dup / over-cap / empty / gap / wrong-city / eatery) lower = better.
| Cohort | n | Anchor cov% | Saves util% | City-lock% | Over-cap/trip | Empty/trip | Cross-day dup | Wrong-city | Gap-fill | Eatery-as-act |
|---|---|---|---|---|---|---|---|---|---|---|
| All 48 | 48 | 99% | 87% | 100% | 0.02 | 0 | 0.02 | 0.08 | 0.58 | 0.17 |
| HEAVY | 12 | 97% | 79% | 100% | 0 | 0 | 0.08 | 0.08 | 1.5 | 0.42 |
| MEDIUM | 12 | 100% | 91% | 100% | 0 | 0 | 0 | 0.25 | 0.33 | 0.17 |
| LIGHT | 12 | 100% | 83% | 100% | 0 | 0 | 0 | 0 | 0.08 | 0 |
| SEED | 12 | 100% | 93% | 100% | 0.08 | 0 | 0 | 0 | 0.42 | 0.08 |
Pairwise (which itinerary is better, judges blind to source):
| Dimension | V9 wins | OLD wins | Tie |
|---|---|---|---|
| Doability | 15 | 9 | 0 |
| Rhythm | 16 | 7 | 1 |
| Clustering | 16 | 8 | 0 |
| Narrative | 8 | 8 | 8 |
| Selection | 4 | 15 | 5 |
| Overall | 15 | 9 | 0 |
| Overall winner | 15 | 9 | 0 |
n = 24 pairwise comparisons (12 scenarios x 2 blind judges).
Deterministic, same subset (V9 post-repair vs OLD):
| Metric | V9 | OLD | Delta |
|---|---|---|---|
| Anchor coverage % | 99% | 80% | +19 ✓ |
| Saves utilisation % | 79% | 72% | +7 ✓ |
| City-lock % | 100% | 100% | 0 ✓ |
| Over-cap days/trip | 0 | 0.33 | -0.33 ✓ |
| Cross-day dups/trip | 0 | 0 | 0 ✓ |
| Wrong-city items/trip | 0 | 0.17 | -0.17 ✓ |
| Gap-fill (invented)/trip | 0.5 | 5.25 | -4.75 ✓ |
| Eatery-as-activity/trip | 0.25 | 1.75 | -1.5 ✓ |
1. The shipped itinerary is safe across every tier. The deterministic safety net (repairCityCoverage + dedup) holds: 100% city-lock, 0 empty days, ~0 over-cap days, ~0 cross-day dups in HEAVY / MEDIUM / LIGHT / SEED alike. Cold-start (0-saves) trips are fully seed-driven and score as well as saves-rich ones (overall 7.54), so the seed corpus carries a new user.
2. V9 beats OLD where it matters, and the one loss is by design. Blind pairwise, V9 wins doability (15-9), clustering (16-8), rhythm (16-7) and overall (15-9); narrative ties. OLD wins only selection (15-4) - because OLD invents famous landmarks freely (5.25 invented places/trip vs V9 0.5), so its plans are studded with recognisable sights the traveller never actually saved. That is exactly the trade V9 makes: collection-fidelity over landmark-stuffing. The cost shows up as a lower "selection" vote; the benefit is that V9 does not over-pack or hallucinate, which is why it wins doability + overall. (OLD also schedules 1.75 saved eateries/trip into the day list as activities - a schema bug V9 does not have.)
3. The soft spot is the MEDIUM tier and dense multi-city routing. Weakest cohort = Medium saves (overall 6.58, doability 6.5). The low-doability outliers are: Japan 8d (Tokyo+Hakone day-trip+Kyoto, dense) (3.5); Singapore 4d (Singapore City+Sentosa, family+kids) (3); Meghalaya 5d (Shillong+Cherrapunji+Dawki, THIN, nature) (5); Italy 12d grand (Venice+Florence+Rome+Naples, packed) (5.5); Thailand 9d (Bangkok+Chiang Mai+Phuket+Krabi, dense) (4); Japan 7d (Tokyo+Kyoto+Osaka) (3); Sri Lanka 7d (Colombo+Kandy+Ella+Galle) (5); Singapore 4d (Singapore City+Sentosa Island) (4). Three recurring causes: (a) over-allocated single bases - e.g. 2 full days on Sentosa Island, which judges read as padded; partly a trip-shape input, not the prompt; (b) quirky real collections - a user whose "Tokyo" saves are actually a Mount Nokogiri / Boso Peninsula day-trip cluster, which V9 faithfully sequences into a far, time-heavy day; (c) transit days that get over-filled - the day a traveller changes cities sometimes carries a full sightseeing load on top of the move.
4. Two small correctness nits worth a follow-up (not blockers): 8/48 trips slipped a single saved eatery into the day list as an activity (the pool excludes food, but a borderline-categorised place leaked through categoryToBucket); and 2/48 placed an item in the wrong city's day (caught deterministically as out-of-city; city-lock still 100% because every city is present). The very-dense Thailand case (67-item pool) is the one where anchor coverage dips (79%) - the MAX_ITEMS_PER_DAY cap deliberately favours a doable day over surfacing every anchor.
Recommendation: ship V9. It is materially better than the OLD prompt on the dimensions that determine whether a trip is usable (doability, clustering, no over-pack, no invention), and the deterministic floor guarantees a safe itinerary in every tier including cold-start. Optional hardening, in priority order: (i) damp the load on travel_between_cities days; (ii) tighten the food-category exclusion so eateries never reach the day list; (iii) consider a light "is this save geographically central or a day-trip cluster?" signal so quirky collections route to a labelled day-trip rather than a scattered city day.
Any V9 scenario with a sub-7 doability/overall, a judge-counted hard problem, or a deterministic blemish (over-cap / wrong-city / dup / eatery-as-activity / anchor-coverage gap). Most flags are a single discerning-judge point or one eatery leak, not a failure.
| Scenario | Tier | Flags |
|---|---|---|
| Thailand 7d (dense+hearted, 3 cities) | HEAVY | anchor cov 93%; doability 6; judge: 1 overpacked |
| Japan 5d (dense+thin topup) | MEDIUM | 1 eatery-as-activity |
| India/Rajasthan 7d (all-iconic, no hearts) | SEED | doability 6.5; overall 6.5 |
| Japan 8d (Tokyo+Hakone day-trip+Kyoto, dense) | MEDIUM | doability 3.5; overall 3.5; judge: 2 overpacked |
| Japan 10d (Tokyo+Kyoto+Osaka, very dense, rhythm) | HEAVY | 1 eatery-as-activity |
| Italy 9d (Rome+Lake Como+Dolomites+Positano) | HEAVY | anchor cov 96% |
| Sri Lanka 8d (Kandy+Ella+Nuwara Eliya+Mirissa) | MEDIUM | doability 6; overall 6.5 |
| Bali 6d (Ubud+Canggu+Uluwatu, chill, temple day-trip) | HEAVY | 1 eatery-as-activity |
| Dubai 4d (single-city, family+kids, luxury) | HEAVY | 1 eatery-as-activity |
| Singapore 4d (Singapore City+Sentosa, family+kids) | MEDIUM | doability 3; overall 3 |
| Kerala 6d (Kochi+Munnar, domestic, chill, family) | SEED | 1 eatery-as-activity |
| Meghalaya 5d (Shillong+Cherrapunji+Dawki, THIN, nature) | SEED | doability 5; overall 6; judge: 1 overpacked |
| HEAVY — Italy 12d grand (Venice+Florence+Rome+Naples, packed) | HEAVY | 1 wrong-city item(s); 1 eatery-as-activity; doability 5.5; overall 6; judge: 1.5 overpacked |
| HEAVY — Thailand 9d (Bangkok+Chiang Mai+Phuket+Krabi, dense) | HEAVY | anchor cov 79%; 1 eatery-as-activity; doability 4; overall 5; judge: 2.5 overpacked |
| HEAVY — Indonesia 7d Bali (hearted-heavy, chill) | HEAVY | doability 6.5; judge: 1 overpacked |
| HEAVY — Thailand 8d (Bangkok+Koh Samui+Krabi, beach-heavy) | HEAVY | 1 cross-day dup(s) |
| MEDIUM — Japan 7d (Tokyo+Kyoto+Osaka) | MEDIUM | 1 eatery-as-activity; doability 3; overall 4; judge: 1 overpacked |
| MEDIUM — Sri Lanka 7d (Colombo+Kandy+Ella+Galle) | MEDIUM | 3 wrong-city item(s); doability 5; overall 5; judge: 1.5 overpacked |
| LIGHT — Singapore 4d (Singapore City+Sentosa Island) | LIGHT | doability 4; overall 3.5 |
| COLD — France 8d (Paris+Loire Valley+Provence+Marseille, 0 saves) | SEED | judge: 1.5 overpacked |
| COLD — Greece 7d (Athens+Santorini+Meteora, 0 saves) | SEED | 1 over-cap day(s); doability 6; overall 6.5; judge: 1 overpacked |
| COLD — Vietnam 8d (Hanoi+Ninh Binh+Hoi An+Ho Chi Minh City, 0 saves) | SEED | overall 6.5 |
| Scenario | Tier | Pool | Doab | Overall | AnchCov | Lock | Over | Dup | OOC | Empty |
|---|---|---|---|---|---|---|---|---|---|---|
| Thailand 7d (dense+hearted, 3 cities) | HEAVY | 45 | 6 | 7 | 93% | 100% | 0 | 0 | 0 | 0 |
| Italy 6d (thin->topup, hearted spread) | HEAVY | 16 | 8 | 7.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Japan 5d (dense+thin topup) | MEDIUM | 22 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| India/Rajasthan 7d (all-iconic, no hearts) | SEED | 20 | 6.5 | 6.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Japan 8d (Tokyo+Hakone day-trip+Kyoto, dense) | MEDIUM | 18 | 3.5 | 3.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Japan 10d (Tokyo+Kyoto+Osaka, very dense, rhythm) | HEAVY | 47 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| UK 5d (London single-city dense, solo, packed) | HEAVY | 35 | 8.5 | 8.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Italy 9d (Rome+Lake Como+Dolomites+Positano) | HEAVY | 44 | 7 | 7 | 96% | 100% | 0 | 0 | 0 | 0 |
| Sri Lanka 8d (Kandy+Ella+Nuwara Eliya+Mirissa) | MEDIUM | 24 | 6 | 6.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Bali 6d (Ubud+Canggu+Uluwatu, chill, temple day-trip) | HEAVY | 15 | 9 | 8.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Vietnam 9d (Hanoi+Hoi An+HCMC, THIN->topup) | MEDIUM | 36 | 9 | 9 | 100% | 100% | 0 | 0 | 0 | 0 |
| Dubai 4d (single-city, family+kids, luxury) | HEAVY | 11 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Singapore 4d (Singapore City+Sentosa, family+kids) | MEDIUM | 16 | 3 | 3 | 100% | 100% | 0 | 0 | 0 | 0 |
| Egypt 7d (Cairo+Luxor, heritage) | SEED | 22 | 9 | 9 | 100% | 100% | 0 | 0 | 0 | 0 |
| Peru 8d (Lima+Cusco, altitude rest, MP day-trips) | SEED | 16 | 8.5 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Kerala 6d (Kochi+Munnar, domestic, chill, family) | SEED | 24 | 9 | 9 | 100% | 100% | 0 | 0 | 0 | 0 |
| Meghalaya 5d (Shillong+Cherrapunji+Dawki, THIN, nature) | SEED | 18 | 5 | 6 | 100% | 100% | 0 | 0 | 0 | 0 |
| Ladakh 7d (Leh single-base, altitude, day-trips) | SEED | 28 | 8.5 | 7.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| South Korea 6d (Seoul+Busan, urban, packed) | SEED | 12 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Goa 3d (Panaji+Arpora, short weekend, chill) | SEED | 12 | 8 | 7.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Italy 12d grand (Venice+Florence+Rome+Naples, packed) | HEAVY | 38 | 5.5 | 6 | 100% | 100% | 0 | 0 | 1 | 0 |
| Thailand 9d (Bangkok+Chiang Mai+Phuket+Krabi, dense) | HEAVY | 67 | 4 | 5 | 79% | 100% | 0 | 0 | 0 | 0 |
| Italy 7d (Rome+Florence+Venice, 20-ctry saver) | HEAVY | 25 | 9 | 9 | 100% | 100% | 0 | 0 | 0 | 0 |
| Indonesia 7d Bali (hearted-heavy, chill) | HEAVY | 28 | 6.5 | 7 | 100% | 100% | 0 | 0 | 0 | 0 |
| Thailand 8d (Bangkok+Koh Samui+Krabi, beach-heavy) | HEAVY | 35 | 7.5 | 7.5 | 100% | 100% | 0 | 1 | 0 | 0 |
| Japan 7d (Tokyo+Kyoto+Osaka) | MEDIUM | 24 | 3 | 4 | 100% | 100% | 0 | 0 | 0 | 0 |
| Thailand 6d (Bangkok+Chiang Mai) | MEDIUM | 24 | 8.5 | 8.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Indonesia 7d Bali (14-ctry saver) | MEDIUM | 25 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Japan 6d (Tokyo+Osaka, packed) | MEDIUM | 23 | 7.5 | 7.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Sri Lanka 7d (Colombo+Kandy+Ella+Galle) | MEDIUM | 29 | 5 | 5 | 100% | 100% | 0 | 0 | 3 | 0 |
| Japan 6d (Tokyo+Kyoto, single-country saver, packed) | MEDIUM | 16 | 8.5 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Sri Lanka 7d (Sigiriya+Kandy+Ella+Mirissa) | MEDIUM | 24 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Thailand 5d (Bangkok+Phuket) | LIGHT | 20 | 7 | 7 | 100% | 100% | 0 | 0 | 0 | 0 |
| India/Rajasthan 5d (Jaipur+Udaipur) | LIGHT | 20 | 7.5 | 7.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| India/Kerala 6d (Kochi+Munnar+Alleppey, family) | LIGHT | 24 | 7.5 | 7.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| India/Goa 3d (Panaji+Arpora, short) | LIGHT | 12 | 8 | 7 | 100% | 100% | 0 | 0 | 0 | 0 |
| Maldives 5d (Male+Maafushi+Ari Atoll, chill) | LIGHT | 20 | 9 | 9 | 100% | 100% | 0 | 0 | 0 | 0 |
| Thailand 5d (Bangkok+Koh Samui) | LIGHT | 21 | 9 | 9 | 100% | 100% | 0 | 0 | 0 | 0 |
| Singapore 4d (Singapore City+Sentosa Island) | LIGHT | 16 | 4 | 3.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Japan 5d (Tokyo+Osaka) | LIGHT | 20 | 8.5 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Thailand 4d (Bangkok+Krabi, short) | LIGHT | 16 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| Maldives 4d (Male+Maafushi, honeymoon) | LIGHT | 16 | 9 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| India/Himachal 5d (Manali single-base, day-trips) | LIGHT | 20 | 9 | 9 | 100% | 100% | 0 | 0 | 0 | 0 |
| India/Rajasthan 4d (Jaipur+Udaipur, very thin) | LIGHT | 16 | 8 | 8 | 100% | 100% | 0 | 0 | 0 | 0 |
| France 8d (Paris+Loire Valley+Provence+Marseille, 0 saves) | SEED | 32 | 7 | 7.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Greece 7d (Athens+Santorini+Meteora, 0 saves) | SEED | 28 | 6 | 6.5 | 100% | 100% | 1 | 0 | 0 | 0 |
| Turkey 5d (Istanbul+Cappadocia, 0 saves, packed) | SEED | 20 | 8.5 | 8.5 | 100% | 100% | 0 | 0 | 0 | 0 |
| Vietnam 8d (Hanoi+Ninh Binh+Hoi An+Ho Chi Minh City, 0 saves) | SEED | 32 | 7 | 6.5 | 100% | 100% | 0 | 0 | 0 | 0 |
*Generated 2026-06-25. Tooling: workers/api/scripts/bench/bench48-*. Prompt path validated byte-identical to the benchmarked V9 (parity-check.ts).*