V9 itinerary-generation quality benchmark (2026-06-25)

Verdict

22/48 scenarios carry a flag (mostly a single discerning-judge point or one eatery leak). No tier collapses; the soft spot is dense multi-city MEDIUM trips - see Analysis.

Method

Opus panel — subjective quality (1-10, avg of 2 blind judges)

CohortnDoabilityRhythmClusteringNarrativeSelectionOverall
V9 — all487.37.586.998.477.567.29
Heavy saves127.257.676.968.587.797.42
Medium saves126.57.136.218.257.136.58
Light saves127.887.797.258.387.587.63
Cold-start / seed127.587.757.548.677.757.54
Doability is weighted highest in the rubric (realistic daily load + feasible inter-stop travel + single-zone clustering). Overall V9 doability 7.3/10.

Deterministic objective scores — the SHIPPED itinerary (9 dims)

Coverage % higher = better; all the avg-per-trip counts (dup / over-cap / empty / gap / wrong-city / eatery) lower = better.

CohortnAnchor cov%Saves util%City-lock%Over-cap/tripEmpty/tripCross-day dupWrong-cityGap-fillEatery-as-act
All 484899%87%100%0.0200.020.080.580.17
HEAVY1297%79%100%000.080.081.50.42
MEDIUM12100%91%100%0000.250.330.17
LIGHT12100%83%100%00000.080
SEED12100%93%100%0.080000.420.08

V9 vs OLD names-only prompt (12-scenario subset, blind)

Pairwise (which itinerary is better, judges blind to source):

DimensionV9 winsOLD winsTie
Doability1590
Rhythm1671
Clustering1680
Narrative888
Selection4155
Overall1590
Overall winner1590
n = 24 pairwise comparisons (12 scenarios x 2 blind judges).

Deterministic, same subset (V9 post-repair vs OLD):

MetricV9OLDDelta
Anchor coverage %99%80%+19
Saves utilisation %79%72%+7
City-lock %100%100%0
Over-cap days/trip00.33-0.33
Cross-day dups/trip000
Wrong-city items/trip00.17-0.17
Gap-fill (invented)/trip0.55.25-4.75
Eatery-as-activity/trip0.251.75-1.5

Analysis & recommendations

1. The shipped itinerary is safe across every tier. The deterministic safety net (repairCityCoverage + dedup) holds: 100% city-lock, 0 empty days, ~0 over-cap days, ~0 cross-day dups in HEAVY / MEDIUM / LIGHT / SEED alike. Cold-start (0-saves) trips are fully seed-driven and score as well as saves-rich ones (overall 7.54), so the seed corpus carries a new user.

2. V9 beats OLD where it matters, and the one loss is by design. Blind pairwise, V9 wins doability (15-9), clustering (16-8), rhythm (16-7) and overall (15-9); narrative ties. OLD wins only selection (15-4) - because OLD invents famous landmarks freely (5.25 invented places/trip vs V9 0.5), so its plans are studded with recognisable sights the traveller never actually saved. That is exactly the trade V9 makes: collection-fidelity over landmark-stuffing. The cost shows up as a lower "selection" vote; the benefit is that V9 does not over-pack or hallucinate, which is why it wins doability + overall. (OLD also schedules 1.75 saved eateries/trip into the day list as activities - a schema bug V9 does not have.)

3. The soft spot is the MEDIUM tier and dense multi-city routing. Weakest cohort = Medium saves (overall 6.58, doability 6.5). The low-doability outliers are: Japan 8d (Tokyo+Hakone day-trip+Kyoto, dense) (3.5); Singapore 4d (Singapore City+Sentosa, family+kids) (3); Meghalaya 5d (Shillong+Cherrapunji+Dawki, THIN, nature) (5); Italy 12d grand (Venice+Florence+Rome+Naples, packed) (5.5); Thailand 9d (Bangkok+Chiang Mai+Phuket+Krabi, dense) (4); Japan 7d (Tokyo+Kyoto+Osaka) (3); Sri Lanka 7d (Colombo+Kandy+Ella+Galle) (5); Singapore 4d (Singapore City+Sentosa Island) (4). Three recurring causes: (a) over-allocated single bases - e.g. 2 full days on Sentosa Island, which judges read as padded; partly a trip-shape input, not the prompt; (b) quirky real collections - a user whose "Tokyo" saves are actually a Mount Nokogiri / Boso Peninsula day-trip cluster, which V9 faithfully sequences into a far, time-heavy day; (c) transit days that get over-filled - the day a traveller changes cities sometimes carries a full sightseeing load on top of the move.

4. Two small correctness nits worth a follow-up (not blockers): 8/48 trips slipped a single saved eatery into the day list as an activity (the pool excludes food, but a borderline-categorised place leaked through categoryToBucket); and 2/48 placed an item in the wrong city's day (caught deterministically as out-of-city; city-lock still 100% because every city is present). The very-dense Thailand case (67-item pool) is the one where anchor coverage dips (79%) - the MAX_ITEMS_PER_DAY cap deliberately favours a doable day over surfacing every anchor.

Recommendation: ship V9. It is materially better than the OLD prompt on the dimensions that determine whether a trip is usable (doability, clustering, no over-pack, no invention), and the deterministic floor guarantees a safe itinerary in every tier including cold-start. Optional hardening, in priority order: (i) damp the load on travel_between_cities days; (ii) tighten the food-category exclusion so eateries never reach the day list; (iii) consider a light "is this save geographically central or a day-trip cluster?" signal so quirky collections route to a labelled day-trip rather than a scattered city day.

Flagged scenarios

Any V9 scenario with a sub-7 doability/overall, a judge-counted hard problem, or a deterministic blemish (over-cap / wrong-city / dup / eatery-as-activity / anchor-coverage gap). Most flags are a single discerning-judge point or one eatery leak, not a failure.

ScenarioTierFlags
Thailand 7d (dense+hearted, 3 cities)HEAVYanchor cov 93%; doability 6; judge: 1 overpacked
Japan 5d (dense+thin topup)MEDIUM1 eatery-as-activity
India/Rajasthan 7d (all-iconic, no hearts)SEEDdoability 6.5; overall 6.5
Japan 8d (Tokyo+Hakone day-trip+Kyoto, dense)MEDIUMdoability 3.5; overall 3.5; judge: 2 overpacked
Japan 10d (Tokyo+Kyoto+Osaka, very dense, rhythm)HEAVY1 eatery-as-activity
Italy 9d (Rome+Lake Como+Dolomites+Positano)HEAVYanchor cov 96%
Sri Lanka 8d (Kandy+Ella+Nuwara Eliya+Mirissa)MEDIUMdoability 6; overall 6.5
Bali 6d (Ubud+Canggu+Uluwatu, chill, temple day-trip)HEAVY1 eatery-as-activity
Dubai 4d (single-city, family+kids, luxury)HEAVY1 eatery-as-activity
Singapore 4d (Singapore City+Sentosa, family+kids)MEDIUMdoability 3; overall 3
Kerala 6d (Kochi+Munnar, domestic, chill, family)SEED1 eatery-as-activity
Meghalaya 5d (Shillong+Cherrapunji+Dawki, THIN, nature)SEEDdoability 5; overall 6; judge: 1 overpacked
HEAVY — Italy 12d grand (Venice+Florence+Rome+Naples, packed)HEAVY1 wrong-city item(s); 1 eatery-as-activity; doability 5.5; overall 6; judge: 1.5 overpacked
HEAVY — Thailand 9d (Bangkok+Chiang Mai+Phuket+Krabi, dense)HEAVYanchor cov 79%; 1 eatery-as-activity; doability 4; overall 5; judge: 2.5 overpacked
HEAVY — Indonesia 7d Bali (hearted-heavy, chill)HEAVYdoability 6.5; judge: 1 overpacked
HEAVY — Thailand 8d (Bangkok+Koh Samui+Krabi, beach-heavy)HEAVY1 cross-day dup(s)
MEDIUM — Japan 7d (Tokyo+Kyoto+Osaka)MEDIUM1 eatery-as-activity; doability 3; overall 4; judge: 1 overpacked
MEDIUM — Sri Lanka 7d (Colombo+Kandy+Ella+Galle)MEDIUM3 wrong-city item(s); doability 5; overall 5; judge: 1.5 overpacked
LIGHT — Singapore 4d (Singapore City+Sentosa Island)LIGHTdoability 4; overall 3.5
COLD — France 8d (Paris+Loire Valley+Provence+Marseille, 0 saves)SEEDjudge: 1.5 overpacked
COLD — Greece 7d (Athens+Santorini+Meteora, 0 saves)SEED1 over-cap day(s); doability 6; overall 6.5; judge: 1 overpacked
COLD — Vietnam 8d (Hanoi+Ninh Binh+Hoi An+Ho Chi Minh City, 0 saves)SEEDoverall 6.5

Per-scenario detail (all 48, V9)

ScenarioTierPoolDoabOverallAnchCovLockOverDupOOCEmpty
Thailand 7d (dense+hearted, 3 cities)HEAVY456793%100%0000
Italy 6d (thin->topup, hearted spread)HEAVY1687.5100%100%0000
Japan 5d (dense+thin topup)MEDIUM2288100%100%0000
India/Rajasthan 7d (all-iconic, no hearts)SEED206.56.5100%100%0000
Japan 8d (Tokyo+Hakone day-trip+Kyoto, dense)MEDIUM183.53.5100%100%0000
Japan 10d (Tokyo+Kyoto+Osaka, very dense, rhythm)HEAVY4788100%100%0000
UK 5d (London single-city dense, solo, packed)HEAVY358.58.5100%100%0000
Italy 9d (Rome+Lake Como+Dolomites+Positano)HEAVY447796%100%0000
Sri Lanka 8d (Kandy+Ella+Nuwara Eliya+Mirissa)MEDIUM2466.5100%100%0000
Bali 6d (Ubud+Canggu+Uluwatu, chill, temple day-trip)HEAVY1598.5100%100%0000
Vietnam 9d (Hanoi+Hoi An+HCMC, THIN->topup)MEDIUM3699100%100%0000
Dubai 4d (single-city, family+kids, luxury)HEAVY1188100%100%0000
Singapore 4d (Singapore City+Sentosa, family+kids)MEDIUM1633100%100%0000
Egypt 7d (Cairo+Luxor, heritage)SEED2299100%100%0000
Peru 8d (Lima+Cusco, altitude rest, MP day-trips)SEED168.58100%100%0000
Kerala 6d (Kochi+Munnar, domestic, chill, family)SEED2499100%100%0000
Meghalaya 5d (Shillong+Cherrapunji+Dawki, THIN, nature)SEED1856100%100%0000
Ladakh 7d (Leh single-base, altitude, day-trips)SEED288.57.5100%100%0000
South Korea 6d (Seoul+Busan, urban, packed)SEED1288100%100%0000
Goa 3d (Panaji+Arpora, short weekend, chill)SEED1287.5100%100%0000
Italy 12d grand (Venice+Florence+Rome+Naples, packed)HEAVY385.56100%100%0010
Thailand 9d (Bangkok+Chiang Mai+Phuket+Krabi, dense)HEAVY674579%100%0000
Italy 7d (Rome+Florence+Venice, 20-ctry saver)HEAVY2599100%100%0000
Indonesia 7d Bali (hearted-heavy, chill)HEAVY286.57100%100%0000
Thailand 8d (Bangkok+Koh Samui+Krabi, beach-heavy)HEAVY357.57.5100%100%0100
Japan 7d (Tokyo+Kyoto+Osaka)MEDIUM2434100%100%0000
Thailand 6d (Bangkok+Chiang Mai)MEDIUM248.58.5100%100%0000
Indonesia 7d Bali (14-ctry saver)MEDIUM2588100%100%0000
Japan 6d (Tokyo+Osaka, packed)MEDIUM237.57.5100%100%0000
Sri Lanka 7d (Colombo+Kandy+Ella+Galle)MEDIUM2955100%100%0030
Japan 6d (Tokyo+Kyoto, single-country saver, packed)MEDIUM168.58100%100%0000
Sri Lanka 7d (Sigiriya+Kandy+Ella+Mirissa)MEDIUM2488100%100%0000
Thailand 5d (Bangkok+Phuket)LIGHT2077100%100%0000
India/Rajasthan 5d (Jaipur+Udaipur)LIGHT207.57.5100%100%0000
India/Kerala 6d (Kochi+Munnar+Alleppey, family)LIGHT247.57.5100%100%0000
India/Goa 3d (Panaji+Arpora, short)LIGHT1287100%100%0000
Maldives 5d (Male+Maafushi+Ari Atoll, chill)LIGHT2099100%100%0000
Thailand 5d (Bangkok+Koh Samui)LIGHT2199100%100%0000
Singapore 4d (Singapore City+Sentosa Island)LIGHT1643.5100%100%0000
Japan 5d (Tokyo+Osaka)LIGHT208.58100%100%0000
Thailand 4d (Bangkok+Krabi, short)LIGHT1688100%100%0000
Maldives 4d (Male+Maafushi, honeymoon)LIGHT1698100%100%0000
India/Himachal 5d (Manali single-base, day-trips)LIGHT2099100%100%0000
India/Rajasthan 4d (Jaipur+Udaipur, very thin)LIGHT1688100%100%0000
France 8d (Paris+Loire Valley+Provence+Marseille, 0 saves)SEED3277.5100%100%0000
Greece 7d (Athens+Santorini+Meteora, 0 saves)SEED2866.5100%100%1000
Turkey 5d (Istanbul+Cappadocia, 0 saves, packed)SEED208.58.5100%100%0000
Vietnam 8d (Hanoi+Ninh Binh+Hoi An+Ho Chi Minh City, 0 saves)SEED3276.5100%100%0000

*Generated 2026-06-25. Tooling: workers/api/scripts/bench/bench48-*. Prompt path validated byte-identical to the benchmarked V9 (parity-check.ts).*