VietNews Advanced URL Decoder Documentation

What is the Advanced URL Decoder?

The VietNews Advanced URL Decoder is a sophisticated system designed to convert Google News proxy URLs into their original publisher URLs. When you search on Google News, the results often contain proxy URLs that redirect through Google's servers. Our decoder reveals the actual source URLs.

Key Benefits:
  • Access original publisher content directly
  • Bypass Google News proxy redirects
  • Multiple fallback strategies ensure high success rates
  • Smart caching reduces processing time
  • Rate limit protection prevents blocking

How It Works

When you enable URL decoding in your search:

  1. The system identifies Google News proxy URLs in search results
  2. Multiple decoding strategies are applied in sequence
  3. The first successful strategy returns the original URL
  4. Results are cached for faster future lookups
  5. Rate limiting ensures sustainable operation

Advanced Features

🔄 Multi-Strategy Decoding

Automatically tries multiple decoding methods until one succeeds, ensuring maximum reliability.

💾 Smart Caching System

Three-layer caching: in-memory (fastest), file-based (persistent), and cloud (shared).

🛡️ Rate Limit Protection

Circuit breakers and exponential backoff prevent your requests from being blocked.

📊 Real-time Statistics

Monitor decoder performance, success rates, and strategy effectiveness.

⚡ Parallel Processing

Batch operations use concurrent processing for faster results.

🔍 Intelligent Detection

Automatically identifies and skips non-Google News URLs.

Cache Layers

Cache Type Speed Persistence Scope
In-Memory Fastest Session only Current process
File Cache Fast Permanent Local machine
Supabase Moderate Permanent All instances

Decoding Strategies

The decoder uses multiple strategies in sequence. If one fails, the next is automatically tried:

1. Google News Decoder (Primary)

How it works: Uses Google's internal API endpoints to decode URLs officially.

Pros: Most reliable and accurate method

Cons: Subject to rate limiting

Success Rate: ~85-90%

2. Regex Extraction

How it works: Extracts URLs from query parameters using pattern matching.

Pros: Very fast, no network requests

Cons: Only works with specific URL formats

Success Rate: ~60-70%

3. HTTP Redirect Following

How it works: Follows the HTTP redirect chain to find the final destination.

Pros: Works with redirect-based URLs

Cons: Slower, requires network requests

Success Rate: ~40-50%

4. HTML Parsing

How it works: Downloads the page and extracts canonical URLs from meta tags.

Pros: Can find URLs embedded in page content

Cons: Slowest method, bandwidth intensive

Success Rate: ~30-40%

How to Use URL Decoding

In Web Search

  1. Navigate to the search page
  2. Enter your search query
  3. Enable the "Decode URLs" checkbox in Advanced Options
  4. Optionally adjust the decode delay (default: 1.5s)
  5. Click Search
Note: URL decoding will increase search time, especially for large result sets. The delay is necessary to avoid rate limiting.

Understanding the Results

When URL decoding is enabled, you'll see:

  • Decoded URLs: Original publisher links in search results
  • Resolution Summary: Statistics showing successful/failed decodings
  • Cache Performance: How many URLs were retrieved from cache
  • Circuit Breaker Status: Whether rate limiting is active

Optimization Tips

  • Start with smaller result sets (max: 50) to build up the cache
  • Use conservative mode for important searches
  • Check cache statistics to understand performance
  • If you see "sorry" URLs, wait a few minutes before retrying

API Reference

Search API with URL Decoding

GET /api/search?q=your+query&decode_urls=true&decode_delay=1.5&conservative=false

Parameters:

Parameter Type Default Description
decode_urls boolean false Enable URL decoding
decode_delay float 1.5 Delay between decode requests (seconds)
conservative boolean false Use conservative rate limiting

Response Format

{ "results": [ { "title": "Article Title", "url": "https://originalpublisher.com/article", "url_decoded": true, "published": "2024-06-27T10:30:00Z", "source": "Publisher Name" } ], "metadata": { "total": 50, "url_resolution": { "resolved": 45, "failed": 5, "cache_hits": 30, "circuit_breaker": false } } }

Troubleshooting

Common Issues

🚫 Getting "Sorry" URLs

Cause: Google has rate-limited your requests

Solution:

  • Wait 5-10 minutes before retrying
  • Use conservative mode (--conservative flag)
  • Increase decode delay to 3+ seconds
  • Reduce batch size

🐌 Slow Decoding Performance

Cause: No cache hits, all URLs being decoded fresh

Solution:

  • Start with smaller batches to build cache
  • Reuse similar queries to benefit from cache
  • Check cache statistics

❌ Circuit Breaker Tripped

Cause: Too many consecutive failures

Solution:

  • Wait for cooldown period (60-300 seconds)
  • Check problematic URLs list
  • Use different search queries

Cache Management

Cache files are stored in:

~/.cache/vietnews/url_resolution_cache.json # Decoded URLs ~/.cache/vietnews/problematic_urls.txt # Blacklisted URLs ~/.cache/vietnews/circuit_breaker_state.json # Circuit breaker state
To reset the decoder: Delete these cache files and restart the application.