Aline Jamas, a dual-degree student in Business/Commerce and Tourism at Universidad Complutense de Madrid, is writing her TCC (final thesis) on the reputation of favela tours in Rio de Janeiro. She needed to analyze over 1,000 reviews across TripAdvisor, GetYourGuide, and Google Maps to identify patterns, recurring themes, and sentiment trends.
The original request (audio transcription): "I need to do a reputation assessment of several favela tours from GetYourGuide, TripAdvisor, those platforms. There are only about a thousand reviews. I wanted to get a pattern, a repetition of comments, of words. I saw some extensions that could help me grab these things more easily, because I feel like I'm not being efficient at all reading them one by one."
The core problem was clear: manual review analysis does not scale. Reading 1,000+ reviews one by one is not only time-consuming but also prone to confirmation bias and missed patterns. This is a textbook case for computational text analysis.
Caio received Aline's audio message, identified the problem as a web scraping + NLP pipeline challenge, and used Claude Code (Opus 4.6) as an AI pair-programmer to architect and build the solution in a single session.
Problem decomposition:
The system follows a three-phase pipeline: Scrape โ Process โ Visualize. Each phase is modular and can be run independently.
Before building, we conducted extensive research on the scraping landscape for each platform in 2026. Anti-bot protections vary significantly.
| Platform | Approach | Anti-Bot | Est. Reviews |
|---|---|---|---|
| TripAdvisor | Playwright + stealth, or Apify API | High Cloudflare 403 | 1,500 โ 2,500 |
| Google Maps | Playwright + scroll extraction | Medium | 500 โ 1,000 |
| GetYourGuide | Playwright or requests + BS4 | Low | 300 โ 500 |
Key finding: TripAdvisor returns HTTP 403 from headless browsers in 2026 due to Cloudflare protection. Production scraping requires either residential proxies, the Apify API ($0.003/review with free tier), or the omkarcloud/tripadvisor-scraper (340+ GitHub stars, uses Botasaurus for anti-detection).
| Tour Name | TripAdvisor ID | Est. Reviews |
|---|---|---|
| Favela Tour โ Marcelo Armstrong | d1637149 |
~592 |
| Favela Walking Tour | d4713859 |
~200+ |
| Favela Top Tour | d9697057 |
~150+ |
| Favela Santa Marta Tour | d3546152 |
~200+ |
| Favela Adventures | d2072887 |
~300+ |
We ran the scraping pipeline in test mode against all three platforms. Results validate the architecture and surface the real-world constraints.
The Playwright-based scraper successfully navigated Google Maps, discovered 8 favela tour businesses via search, opened the first 3, scrolled their review panels, and extracted 18 reviews with full text content. The data includes tour names, review text, dates, and reviewer names.
Browser automation approach works. Google Maps is accessible without residential proxies. Scroll-based extraction captures reviews loaded dynamically.
All 7 target URLs returned HTTP 403 with a Cloudflare challenge page. The headless Chromium browser with stealth scripts was detected. This confirms that TripAdvisor requires either residential proxies, the Apify API, or the Botasaurus framework for successful scraping in 2026.
Use Apify free tier (~1,600 reviews/month for $5 credit) or omkarcloud/tripadvisor-scraper with Botasaurus anti-detection.
The scraper successfully discovered 21 tour links from search results but extracted 0 reviews from individual pages. Reviews on GetYourGuide are loaded via JavaScript after user interaction, requiring more aggressive scroll-and-wait strategies or API interception.
Implement deeper scroll loops with explicit wait-for-selector on review containers, or intercept the XHR/fetch calls that load review data.
The analysis module is fully built and ready to process reviews at scale. It implements four complementary text analysis methods, following academic best practices from published tourism research.
The VADER (Valence Aware Dictionary and sEntiment Reasoner) analyzer computes compound sentiment scores for each review, classifying them as positive (โฅ 0.05), negative (โค โ0.05), or neutral. VADER is specifically tuned for social media and review text, making it ideal for this use case.
Compound score per review, sentiment distribution charts, sentiment-by-tour comparisons, temporal sentiment trends.
Latent Dirichlet Allocation discovers 6โ8 hidden topic clusters across the review corpus. Each topic is represented as a weighted distribution of words. This reveals what reviewers talk about most โ safety, cultural experience, guide quality, logistics โ without any predefined categories.
Topic-word distributions, dominant topic per review, topic prevalence heatmap, topic-by-tour breakdown.
Goes beyond overall sentiment to measure how reviewers feel about 7 specific aspects of favela tours: safety, guide quality, authenticity, value for money, educational impact, ethical concerns, and logistics. Uses keyword matching at the sentence level combined with VADER sentiment scoring.
Radar/spider charts comparing tours across aspects, aspect mention frequency, positive/negative percentage per aspect, representative quotes.
Extracts the most frequent bigrams (2-word phrases) and trigrams (3-word phrases) from the corpus โ this directly answers Aline's request for "patterns and repetitions." Combined with TF-IDF scoring, this reveals both common and distinctively important terms per tour.
"eye opening", "local guide", "feel safe", "poverty tourism", "community project", "walking tour", "must do", "real life"
The final output consists of 9 standalone HTML documents with embedded Chart.js visualizations, designed to meet the gold standard for TCC and academic research presentations.
| # | Document | Visualization Types |
|---|---|---|
| 1 | Executive Summary | KPI cards, donut charts |
| 2 | Rating Distribution | Histograms, box plots |
| 3 | Temporal Trends | Line charts, area charts |
| 4 | Sentiment Analysis | Distribution bars, word bars |
| 5 | Topic Modeling | Heatmap, stacked bars |
| 6 | Aspect-Based Analysis | Radar charts, grouped bars |
| 7 | Word Frequency & N-grams | Horizontal bars, SVG word cloud |
| 8 | Comparative Tour Analysis | Multi-metric tables, parallel coords |
| 9 | Methodology & Data Summary | Pipeline diagram, sample stats |
Design principles: Clean, minimal aesthetic. Muted academic color palette. Proper axis labels, legends, and titles. Statistical annotations (n, mean, SD, CI). APA/Chicago citation-ready formatting. Responsive and print-friendly.
| Component | Technology | Purpose |
|---|---|---|
| Browser Automation | Playwright |
Dynamic content rendering, JS execution |
| NLP / Text Mining | NLTK, scikit-learn, VADER |
Tokenization, TF-IDF, LDA, sentiment |
| Data Processing | pandas, langdetect |
Cleaning, filtering, aggregation |
| Visualization | Chart.js, D3.js |
Academic-quality interactive charts |
| AI Assistant | Claude Code (Opus 4.6) |
Architecture, code gen, research |
aline-favela-reviews/
scrapers/ โ Platform-specific scrapers + configanalysis/ โ NLP pipeline + text mining functionsdata/raw/ โ Raw JSON from each scraperdata/processed/ โ Cleaned CSV + analysis JSONoutput/ โ Final HTML visualizationsEducation: Dual degree in Business/Commerce and Tourism at Universidad Complutense de Madrid (2021โ2026).
Current role: Treasury Intern at Schneider Electric, Madrid. Assists in cash flow analysis and trade finance across international projects.
Previous experience:
Reputation analysis of favela tours in Rio de Janeiro โ investigating how tourism review patterns reflect perceptions of safety, authenticity, and ethical tourism in informal urban communities.
The entire system was designed, researched, coded, and tested in a single Claude Code session.
To complete the project, Aline needs to:
"The most powerful tool isn't the scraper or the NLP model โ it's the ability to decompose a vague request into a precise, executable pipeline."