🔬

Solving the Favela Tour Reputation Analysis Challenge

AI-Powered Automation for Academic Tourism Research

Caio Ferreira · April 2026 · Built with Claude Code (Opus 4.6)

web-scraping nlp-analysis tourism-research tcc-support proof-of-concept

Context

The Challenge

Aline Jamas, a dual-degree student in Business/Commerce and Tourism at Universidad Complutense de Madrid, is writing her TCC (final thesis) on the reputation of favela tours in Rio de Janeiro. She needed to analyze over 1,000 reviews across TripAdvisor, GetYourGuide, and Google Maps to identify patterns, recurring themes, and sentiment trends.

The original request (audio transcription): "I need to do a reputation assessment of several favela tours from GetYourGuide, TripAdvisor, those platforms. There are only about a thousand reviews. I wanted to get a pattern, a repetition of comments, of words. I saw some extensions that could help me grab these things more easily, because I feel like I'm not being efficient at all reading them one by one."

The core problem was clear: manual review analysis does not scale. Reading 1,000+ reviews one by one is not only time-consuming but also prone to confirmation bias and missed patterns. This is a textbook case for computational text analysis.

Method

The Approach

Caio received Aline's audio message, identified the problem as a web scraping + NLP pipeline challenge, and used Claude Code (Opus 4.6) as an AI pair-programmer to architect and build the solution in a single session.

Problem decomposition:

Research — Which scrapers work on TripAdvisor in 2026? What are the anti-bot protections?
Define criteria — Which URLs to scrape? What filters? English-only reviews?
Build — Use Claude Code to write the scraping and analysis pipeline
Analyze — Run NLP to find the patterns Aline needs
Visualize — Generate academic-quality HTML charts for her TCC

Architecture

Solution Architecture

The system follows a three-phase pipeline: Scrape → Process → Visualize. Each phase is modular and can be run independently.

Research

Platform Research Findings

Before building, we conducted extensive research on the scraping landscape for each platform in 2026. Anti-bot protections vary significantly.

Platform	Approach	Anti-Bot	Est. Reviews
TripAdvisor	Playwright + stealth, or Apify API	High Cloudflare 403	1,500 – 2,500
Google Maps	Playwright + scroll extraction	Medium	500 – 1,000
GetYourGuide	Playwright or requests + BS4	Low	300 – 500

Key finding: TripAdvisor returns HTTP 403 from headless browsers in 2026 due to Cloudflare protection. Production scraping requires either residential proxies, the Apify API ($0.003/review with free tier), or the omkarcloud/tripadvisor-scraper (340+ GitHub stars, uses Botasaurus for anti-detection).

Target Favela Tours Identified

Tour Name	TripAdvisor ID	Est. Reviews
Favela Tour – Marcelo Armstrong	`d1637149`	~592
Favela Walking Tour	`d4713859`	~200+
Favela Top Tour	`d9697057`	~150+
Favela Santa Marta Tour	`d3546152`	~200+
Favela Adventures	`d2072887`	~300+

Validation

Proof of Concept Results

We ran the scraping pipeline in test mode against all three platforms. Results validate the architecture and surface the real-world constraints.

test-run

Google Maps — 18 Reviews Extracted

The Playwright-based scraper successfully navigated Google Maps, discovered 8 favela tour businesses via search, opened the first 3, scrolled their review panels, and extracted 18 reviews with full text content. The data includes tour names, review text, dates, and reviewer names.

Validated:

Browser automation approach works. Google Maps is accessible without residential proxies. Scroll-based extraction captures reviews loaded dynamically.

blocked

TripAdvisor — HTTP 403 (Cloudflare)

All 7 target URLs returned HTTP 403 with a Cloudflare challenge page. The headless Chromium browser with stealth scripts was detected. This confirms that TripAdvisor requires either residential proxies, the Apify API, or the Botasaurus framework for successful scraping in 2026.

Next step:

Use Apify free tier (~1,600 reviews/month for $5 credit) or omkarcloud/tripadvisor-scraper with Botasaurus anti-detection.

partial

GetYourGuide — 0 Reviews (Dynamic Load)

The scraper successfully discovered 21 tour links from search results but extracted 0 reviews from individual pages. Reviews on GetYourGuide are loaded via JavaScript after user interaction, requiring more aggressive scroll-and-wait strategies or API interception.

Next step:

Implement deeper scroll loops with explicit wait-for-selector on review containers, or intercept the XHR/fetch calls that load review data.

Analysis

The NLP Analysis Pipeline

The analysis module is fully built and ready to process reviews at scale. It implements four complementary text analysis methods, following academic best practices from published tourism research.

sentiment

Sentiment Analysis (VADER)

The VADER (Valence Aware Dictionary and sEntiment Reasoner) analyzer computes compound sentiment scores for each review, classifying them as positive (≥ 0.05), negative (≤ −0.05), or neutral. VADER is specifically tuned for social media and review text, making it ideal for this use case.

Output:

Compound score per review, sentiment distribution charts, sentiment-by-tour comparisons, temporal sentiment trends.

topics

Topic Modeling (LDA)

Latent Dirichlet Allocation discovers 6–8 hidden topic clusters across the review corpus. Each topic is represented as a weighted distribution of words. This reveals what reviewers talk about most — safety, cultural experience, guide quality, logistics — without any predefined categories.

Output:

Topic-word distributions, dominant topic per review, topic prevalence heatmap, topic-by-tour breakdown.

aspects

Aspect-Based Sentiment Analysis

Goes beyond overall sentiment to measure how reviewers feel about 7 specific aspects of favela tours: safety, guide quality, authenticity, value for money, educational impact, ethical concerns, and logistics. Uses keyword matching at the sentence level combined with VADER sentiment scoring.

Output:

Radar/spider charts comparing tours across aspects, aspect mention frequency, positive/negative percentage per aspect, representative quotes.

patterns

N-gram Pattern Detection

Extracts the most frequent bigrams (2-word phrases) and trigrams (3-word phrases) from the corpus — this directly answers Aline's request for "patterns and repetitions." Combined with TF-IDF scoring, this reveals both common and distinctively important terms per tour.

Expected patterns:

"eye opening", "local guide", "feel safe", "poverty tourism", "community project", "walking tour", "must do", "real life"

Deliverables

Academic Output Format

The final output consists of 9 standalone HTML documents with embedded Chart.js visualizations, designed to meet the gold standard for TCC and academic research presentations.

#	Document	Visualization Types
1	Executive Summary	KPI cards, donut charts
2	Rating Distribution	Histograms, box plots
3	Temporal Trends	Line charts, area charts
4	Sentiment Analysis	Distribution bars, word bars
5	Topic Modeling	Heatmap, stacked bars
6	Aspect-Based Analysis	Radar charts, grouped bars
7	Word Frequency & N-grams	Horizontal bars, SVG word cloud
8	Comparative Tour Analysis	Multi-metric tables, parallel coords
9	Methodology & Data Summary	Pipeline diagram, sample stats

Design principles: Clean, minimal aesthetic. Muted academic color palette. Proper axis labels, legends, and titles. Statistical annotations (n, mean, SD, CI). APA/Chicago citation-ready formatting. Responsive and print-friendly.

Stack

Technology Stack

Component	Technology	Purpose
Browser Automation	`Playwright`	Dynamic content rendering, JS execution
NLP / Text Mining	`NLTK`, `scikit-learn`, `VADER`	Tokenization, TF-IDF, LDA, sentiment
Data Processing	`pandas`, `langdetect`	Cleaning, filtering, aggregation
Visualization	`Chart.js`, `D3.js`	Academic-quality interactive charts
AI Assistant	`Claude Code` (Opus 4.6)	Architecture, code gen, research

Project Structure

aline-favela-reviews/

scrapers/ — Platform-specific scrapers + config
analysis/ — NLP pipeline + text mining functions
data/raw/ — Raw JSON from each scraper
data/processed/ — Cleaned CSV + analysis JSON
output/ — Final HTML visualizations

Researcher

About Aline Jamas

researcher-profile

Aline Jamas — Tourism & Business Researcher

Education: Dual degree in Business/Commerce and Tourism at Universidad Complutense de Madrid (2021–2026).

Current role: Treasury Intern at Schneider Electric, Madrid. Assists in cash flow analysis and trade finance across international projects.

Previous experience:

Revenue Management Intern at XOTELS (Madrid) — data-driven pricing strategies for hotels
Global Markets Analyst at UCM Finance Society — macroeconomic research and financial analysis
Project Analyst at Instituto Social Jejé de Oyá — strategic proposals for government funding

Research focus:

Reputation analysis of favela tours in Rio de Janeiro — investigating how tourism review patterns reflect perceptions of safety, authenticity, and ethical tourism in informal urban communities.

Timeline

How It Was Built

The entire system was designed, researched, coded, and tested in a single Claude Code session.

Next Steps

Path to Full Execution

To complete the project, Aline needs to:

TripAdvisor access — Create free Apify account ($5/month credit ≈ 1,600 reviews) or use omkarcloud/tripadvisor-scraper with Botasaurus
Refine GetYourGuide scraper — Add explicit wait-for-selector + deeper scroll loops, or intercept API calls
Scale Google Maps — Expand to all 8+ discovered businesses, extract full review history
Run NLP pipeline — Process all reviews through sentiment, topics, aspects, and n-gram analysis
Generate visualizations — Build the 9 academic HTML documents from analysis results

"The most powerful tool isn't the scraper or the NLP model — it's the ability to decompose a vague request into a precise, executable pipeline."