๐Ÿ”ฌ

Solving the Favela Tour Reputation Analysis Challenge

AI-Powered Automation for Academic Tourism Research
Caio Ferreira ยท April 2026 ยท Built with Claude Code (Opus 4.6)
web-scraping nlp-analysis tourism-research tcc-support proof-of-concept

Context

The Challenge

Aline Jamas, a dual-degree student in Business/Commerce and Tourism at Universidad Complutense de Madrid, is writing her TCC (final thesis) on the reputation of favela tours in Rio de Janeiro. She needed to analyze over 1,000 reviews across TripAdvisor, GetYourGuide, and Google Maps to identify patterns, recurring themes, and sentiment trends.

The original request (audio transcription): "I need to do a reputation assessment of several favela tours from GetYourGuide, TripAdvisor, those platforms. There are only about a thousand reviews. I wanted to get a pattern, a repetition of comments, of words. I saw some extensions that could help me grab these things more easily, because I feel like I'm not being efficient at all reading them one by one."

The core problem was clear: manual review analysis does not scale. Reading 1,000+ reviews one by one is not only time-consuming but also prone to confirmation bias and missed patterns. This is a textbook case for computational text analysis.

Method

The Approach

Caio received Aline's audio message, identified the problem as a web scraping + NLP pipeline challenge, and used Claude Code (Opus 4.6) as an AI pair-programmer to architect and build the solution in a single session.

Problem decomposition:

  1. Research โ€” Which scrapers work on TripAdvisor in 2026? What are the anti-bot protections?
  2. Define criteria โ€” Which URLs to scrape? What filters? English-only reviews?
  3. Build โ€” Use Claude Code to write the scraping and analysis pipeline
  4. Analyze โ€” Run NLP to find the patterns Aline needs
  5. Visualize โ€” Generate academic-quality HTML charts for her TCC
Architecture

Solution Architecture

The system follows a three-phase pipeline: Scrape โ†’ Process โ†’ Visualize. Each phase is modular and can be run independently.

Phase 1: Web Scraping TripAdvisor Playwright + stealth Google Maps Playwright + scroll GetYourGuide Playwright / requests JSON Phase 2: NLP Analysis Data Cleaning Dedup, language filter, normalize Sentiment Analysis VADER compound scores Topic Modeling (LDA) 6-8 latent topic clusters Aspect-Based Analysis 7 aspects: safety, guide, ethics... CSV Phase 3: Output Master Table All reviews, sortable 9 HTML Visualizations Chart.js, academic format Statistical Annotations n, mean, SD, CI APA-Ready Format Print-friendly, cited DATA FLOW Reviews (HTML) โ†’ Raw JSON โ†’ Cleaned CSV โ†’ Analysis JSON โ†’ HTML Charts Playwright pandas scikit-learn VADER Chart.js Orchestrated by Claude Code ยท Opus 4.6 ยท 1M context
Research

Platform Research Findings

Before building, we conducted extensive research on the scraping landscape for each platform in 2026. Anti-bot protections vary significantly.

Platform Approach Anti-Bot Est. Reviews
TripAdvisor Playwright + stealth, or Apify API High Cloudflare 403 1,500 โ€“ 2,500
Google Maps Playwright + scroll extraction Medium 500 โ€“ 1,000
GetYourGuide Playwright or requests + BS4 Low 300 โ€“ 500

Key finding: TripAdvisor returns HTTP 403 from headless browsers in 2026 due to Cloudflare protection. Production scraping requires either residential proxies, the Apify API ($0.003/review with free tier), or the omkarcloud/tripadvisor-scraper (340+ GitHub stars, uses Botasaurus for anti-detection).

Target Favela Tours Identified

Tour Name TripAdvisor ID Est. Reviews
Favela Tour โ€“ Marcelo Armstrong d1637149 ~592
Favela Walking Tour d4713859 ~200+
Favela Top Tour d9697057 ~150+
Favela Santa Marta Tour d3546152 ~200+
Favela Adventures d2072887 ~300+

Validation

Proof of Concept Results

We ran the scraping pipeline in test mode against all three platforms. Results validate the architecture and surface the real-world constraints.

test-run

Google Maps โ€” 18 Reviews Extracted

The Playwright-based scraper successfully navigated Google Maps, discovered 8 favela tour businesses via search, opened the first 3, scrolled their review panels, and extracted 18 reviews with full text content. The data includes tour names, review text, dates, and reviewer names.

Validated:

Browser automation approach works. Google Maps is accessible without residential proxies. Scroll-based extraction captures reviews loaded dynamically.

blocked

TripAdvisor โ€” HTTP 403 (Cloudflare)

All 7 target URLs returned HTTP 403 with a Cloudflare challenge page. The headless Chromium browser with stealth scripts was detected. This confirms that TripAdvisor requires either residential proxies, the Apify API, or the Botasaurus framework for successful scraping in 2026.

Next step:

Use Apify free tier (~1,600 reviews/month for $5 credit) or omkarcloud/tripadvisor-scraper with Botasaurus anti-detection.

partial

GetYourGuide โ€” 0 Reviews (Dynamic Load)

The scraper successfully discovered 21 tour links from search results but extracted 0 reviews from individual pages. Reviews on GetYourGuide are loaded via JavaScript after user interaction, requiring more aggressive scroll-and-wait strategies or API interception.

Next step:

Implement deeper scroll loops with explicit wait-for-selector on review containers, or intercept the XHR/fetch calls that load review data.


Analysis

The NLP Analysis Pipeline

The analysis module is fully built and ready to process reviews at scale. It implements four complementary text analysis methods, following academic best practices from published tourism research.

sentiment

Sentiment Analysis (VADER)

The VADER (Valence Aware Dictionary and sEntiment Reasoner) analyzer computes compound sentiment scores for each review, classifying them as positive (โ‰ฅ 0.05), negative (โ‰ค โˆ’0.05), or neutral. VADER is specifically tuned for social media and review text, making it ideal for this use case.

Output:

Compound score per review, sentiment distribution charts, sentiment-by-tour comparisons, temporal sentiment trends.

topics

Topic Modeling (LDA)

Latent Dirichlet Allocation discovers 6โ€“8 hidden topic clusters across the review corpus. Each topic is represented as a weighted distribution of words. This reveals what reviewers talk about most โ€” safety, cultural experience, guide quality, logistics โ€” without any predefined categories.

Output:

Topic-word distributions, dominant topic per review, topic prevalence heatmap, topic-by-tour breakdown.

aspects

Aspect-Based Sentiment Analysis

Goes beyond overall sentiment to measure how reviewers feel about 7 specific aspects of favela tours: safety, guide quality, authenticity, value for money, educational impact, ethical concerns, and logistics. Uses keyword matching at the sentence level combined with VADER sentiment scoring.

Output:

Radar/spider charts comparing tours across aspects, aspect mention frequency, positive/negative percentage per aspect, representative quotes.

patterns

N-gram Pattern Detection

Extracts the most frequent bigrams (2-word phrases) and trigrams (3-word phrases) from the corpus โ€” this directly answers Aline's request for "patterns and repetitions." Combined with TF-IDF scoring, this reveals both common and distinctively important terms per tour.

Expected patterns:

"eye opening", "local guide", "feel safe", "poverty tourism", "community project", "walking tour", "must do", "real life"


Deliverables

Academic Output Format

The final output consists of 9 standalone HTML documents with embedded Chart.js visualizations, designed to meet the gold standard for TCC and academic research presentations.

# Document Visualization Types
1 Executive Summary KPI cards, donut charts
2 Rating Distribution Histograms, box plots
3 Temporal Trends Line charts, area charts
4 Sentiment Analysis Distribution bars, word bars
5 Topic Modeling Heatmap, stacked bars
6 Aspect-Based Analysis Radar charts, grouped bars
7 Word Frequency & N-grams Horizontal bars, SVG word cloud
8 Comparative Tour Analysis Multi-metric tables, parallel coords
9 Methodology & Data Summary Pipeline diagram, sample stats

Design principles: Clean, minimal aesthetic. Muted academic color palette. Proper axis labels, legends, and titles. Statistical annotations (n, mean, SD, CI). APA/Chicago citation-ready formatting. Responsive and print-friendly.


Stack

Technology Stack

Component Technology Purpose
Browser Automation Playwright Dynamic content rendering, JS execution
NLP / Text Mining NLTK, scikit-learn, VADER Tokenization, TF-IDF, LDA, sentiment
Data Processing pandas, langdetect Cleaning, filtering, aggregation
Visualization Chart.js, D3.js Academic-quality interactive charts
AI Assistant Claude Code (Opus 4.6) Architecture, code gen, research

Project Structure

aline-favela-reviews/


Researcher

About Aline Jamas

researcher-profile

Aline Jamas โ€” Tourism & Business Researcher

Education: Dual degree in Business/Commerce and Tourism at Universidad Complutense de Madrid (2021โ€“2026).

Current role: Treasury Intern at Schneider Electric, Madrid. Assists in cash flow analysis and trade finance across international projects.

Previous experience:

Research focus:

Reputation analysis of favela tours in Rio de Janeiro โ€” investigating how tourism review patterns reflect perceptions of safety, authenticity, and ethical tourism in informal urban communities.


Timeline

How It Was Built

The entire system was designed, researched, coded, and tested in a single Claude Code session.

1 Problem Analysis Read Aline's audio transcription, identified the scraping + NLP challenge 2 Platform Research Web search: TripAdvisor scrapers 2026, anti-bot protections, academic NLP methods 3 Architecture Design Three-phase pipeline: Scrape โ†’ Process โ†’ Visualize. File structure, tech stack 4 Code Generation 3 scrapers + NLP pipeline + analysis orchestrator โ€” all Python, ~800 lines 5 Proof of Concept Test run: 18 reviews from Google Maps. TripAdvisor blocked (403). GetYourGuide partial 6 Documentation This document โ€” process, findings, architecture, and next steps

Next Steps

Path to Full Execution

To complete the project, Aline needs to:

  1. TripAdvisor access โ€” Create free Apify account ($5/month credit โ‰ˆ 1,600 reviews) or use omkarcloud/tripadvisor-scraper with Botasaurus
  2. Refine GetYourGuide scraper โ€” Add explicit wait-for-selector + deeper scroll loops, or intercept API calls
  3. Scale Google Maps โ€” Expand to all 8+ discovered businesses, extract full review history
  4. Run NLP pipeline โ€” Process all reviews through sentiment, topics, aspects, and n-gram analysis
  5. Generate visualizations โ€” Build the 9 academic HTML documents from analysis results

"The most powerful tool isn't the scraper or the NLP model โ€” it's the ability to decompose a vague request into a precise, executable pipeline."