Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

DeepCritical / SERPER_WEBSEARCH_IMPLEMENTATION_PLAN.md

Joseph Pollack

final countdown

e427816 23 days ago

preview code

raw

history blame

19.1 kB

	# SERPER Web Search Implementation Plan

	## Executive Summary

	This plan details the implementation of SERPER-based web search by vendoring code from `folder/tools/web_search.py` into `src/tools/`, creating a protocol-compliant `SerperWebSearchTool`, fixing the existing `WebSearchTool`, and integrating both into the main search flow.

	## Project Structure

	### Project 1: Vendor and Refactor Core Web Search Components
	Goal: Extract and vendor Serper/SearchXNG search logic from `folder/tools/web_search.py` into `src/tools/`

	### Project 2: Create Protocol-Compliant SerperWebSearchTool
	Goal: Implement `SerperWebSearchTool` class that fully complies with `SearchTool` protocol

	### Project 3: Fix Existing WebSearchTool Protocol Compliance
	Goal: Make existing `WebSearchTool` (DuckDuckGo) protocol-compliant

	### Project 4: Integrate Web Search into SearchHandler
	Goal: Add web search tools to main search flow in `src/app.py`

	### Project 5: Update Callers and Dependencies
	Goal: Update all code that uses web search to work with new implementation

	### Project 6: Testing and Validation
	Goal: Add comprehensive tests for all web search implementations

	---

	## Detailed Implementation Plan

	### PROJECT 1: Vendor and Refactor Core Web Search Components

	#### Activity 1.1: Create Vendor Module Structure
	File: `src/tools/vendored/__init__.py`
	- Task 1.1.1: Create `src/tools/vendored/` directory
	- Task 1.1.2: Create `__init__.py` with exports

	File: `src/tools/vendored/web_search_core.py`
	- Task 1.1.3: Vendor `ScrapeResult`, `WebpageSnippet`, `SearchResults` models from `folder/tools/web_search.py` (lines 23-37)
	- Task 1.1.4: Vendor `scrape_urls()` function (lines 274-299)
	- Task 1.1.5: Vendor `fetch_and_process_url()` function (lines 302-348)
	- Task 1.1.6: Vendor `html_to_text()` function (lines 351-368)
	- Task 1.1.7: Vendor `is_valid_url()` function (lines 371-410)
	- Task 1.1.8: Vendor `ssl_context` setup (lines 115-120)
	- Task 1.1.9: Add imports: `aiohttp`, `asyncio`, `BeautifulSoup`, `ssl`
	- Task 1.1.10: Add `CONTENT_LENGTH_LIMIT = 10000` constant
	- Task 1.1.11: Add type hints following project standards
	- Task 1.1.12: Add structlog logging
	- Task 1.1.13: Replace `print()` statements with `logger` calls

	File: `src/tools/vendored/serper_client.py`
	- Task 1.1.14: Vendor `SerperClient` class from `folder/tools/web_search.py` (lines 123-196)
	- Task 1.1.15: Remove dependency on `ResearchAgent` and `ResearchRunner`
	- Task 1.1.16: Replace filter agent with simple relevance filtering or remove it
	- Task 1.1.17: Add `__init__` that takes `api_key: str \| None` parameter
	- Task 1.1.18: Update `search()` method to return `list[WebpageSnippet]` without filtering
	- Task 1.1.19: Remove `_filter_results()` method (or make it optional)
	- Task 1.1.20: Add error handling with `SearchError` and `RateLimitError`
	- Task 1.1.21: Add structlog logging
	- Task 1.1.22: Add type hints

	File: `src/tools/vendored/searchxng_client.py`
	- Task 1.1.23: Vendor `SearchXNGClient` class from `folder/tools/web_search.py` (lines 199-271)
	- Task 1.1.24: Remove dependency on `ResearchAgent` and `ResearchRunner`
	- Task 1.1.25: Replace filter agent with simple relevance filtering or remove it
	- Task 1.1.26: Add `__init__` that takes `host: str` parameter
	- Task 1.1.27: Update `search()` method to return `list[WebpageSnippet]` without filtering
	- Task 1.1.28: Remove `_filter_results()` method (or make it optional)
	- Task 1.1.29: Add error handling with `SearchError` and `RateLimitError`
	- Task 1.1.30: Add structlog logging
	- Task 1.1.31: Add type hints

	#### Activity 1.2: Create Rate Limiting for Web Search
	File: `src/tools/rate_limiter.py`
	- Task 1.2.1: Add `get_serper_limiter()` function (rate: "10/second" with API key)
	- Task 1.2.2: Add `get_searchxng_limiter()` function (rate: "5/second")
	- Task 1.2.3: Use `RateLimiterFactory.get()` pattern

	---

	### PROJECT 2: Create Protocol-Compliant SerperWebSearchTool

	#### Activity 2.1: Implement SerperWebSearchTool Class
	File: `src/tools/serper_web_search.py`
	- Task 2.1.1: Create new file `src/tools/serper_web_search.py`
	- Task 2.1.2: Add imports:
	- `from src.tools.base import SearchTool`
	- `from src.tools.vendored.serper_client import SerperClient`
	- `from src.tools.vendored.web_search_core import scrape_urls, WebpageSnippet`
	- `from src.tools.rate_limiter import get_serper_limiter`
	- `from src.tools.query_utils import preprocess_query`
	- `from src.utils.config import settings`
	- `from src.utils.exceptions import SearchError, RateLimitError`
	- `from src.utils.models import Citation, Evidence`
	- `import structlog`
	- `from tenacity import retry, stop_after_attempt, wait_exponential`

	- Task 2.1.3: Create `SerperWebSearchTool` class
	- Task 2.1.4: Add `__init__(self, api_key: str \| None = None)` method
	- Line 2.1.4.1: Get API key from parameter or `settings.serper_api_key`
	- Line 2.1.4.2: Validate API key is not None, raise `ConfigurationError` if missing
	- Line 2.1.4.3: Initialize `SerperClient(api_key=self.api_key)`
	- Line 2.1.4.4: Get rate limiter: `self._limiter = get_serper_limiter(self.api_key)`

	- Task 2.1.5: Add `@property def name(self) -> str:` returning `"serper"`

	- Task 2.1.6: Add `async def _rate_limit(self) -> None:` method
	- Line 2.1.6.1: Call `await self._limiter.acquire()`

	- Task 2.1.7: Add `@retry(...)` decorator with exponential backoff

	- Task 2.1.8: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method
	- Line 2.1.8.1: Call `await self._rate_limit()`
	- Line 2.1.8.2: Preprocess query: `clean_query = preprocess_query(query)`
	- Line 2.1.8.3: Use `clean_query if clean_query else query`
	- Line 2.1.8.4: Call `search_results = await self._client.search(query, filter_for_relevance=False, max_results=max_results)`
	- Line 2.1.8.5: Call `scraped = await scrape_urls(search_results)`
	- Line 2.1.8.6: Convert `ScrapeResult` to `Evidence` objects:
	- Line 2.1.8.6.1: Create `Citation` with `title`, `url`, `source="serper"`, `date="Unknown"`, `authors=[]`
	- Line 2.1.8.6.2: Create `Evidence` with `content=scraped.text`, `citation`, `relevance=0.0`
	- Line 2.1.8.7: Return `list[Evidence]`
	- Line 2.1.8.8: Add try/except for `httpx.HTTPStatusError`:
	- Line 2.1.8.8.1: Check for 429 status, raise `RateLimitError`
	- Line 2.1.8.8.2: Otherwise raise `SearchError`
	- Line 2.1.8.9: Add try/except for `httpx.TimeoutException`, raise `SearchError`
	- Line 2.1.8.10: Add generic exception handler, log and raise `SearchError`

	#### Activity 2.2: Implement SearchXNGWebSearchTool Class
	File: `src/tools/searchxng_web_search.py`
	- Task 2.2.1: Create new file `src/tools/searchxng_web_search.py`
	- Task 2.2.2: Add imports (similar to SerperWebSearchTool)
	- Task 2.2.3: Create `SearchXNGWebSearchTool` class
	- Task 2.2.4: Add `__init__(self, host: str \| None = None)` method
	- Line 2.2.4.1: Get host from parameter or `settings.searchxng_host`
	- Line 2.2.4.2: Validate host is not None, raise `ConfigurationError` if missing
	- Line 2.2.4.3: Initialize `SearchXNGClient(host=self.host)`
	- Line 2.2.4.4: Get rate limiter: `self._limiter = get_searchxng_limiter()`

	- Task 2.2.5: Add `@property def name(self) -> str:` returning `"searchxng"`

	- Task 2.2.6: Add `async def _rate_limit(self) -> None:` method

	- Task 2.2.7: Add `@retry(...)` decorator

	- Task 2.2.8: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method
	- Line 2.2.8.1-2.2.8.10: Similar structure to SerperWebSearchTool

	---

	### PROJECT 3: Fix Existing WebSearchTool Protocol Compliance

	#### Activity 3.1: Update WebSearchTool Class
	File: `src/tools/web_search.py`
	- Task 3.1.1: Add `@property def name(self) -> str:` method returning `"duckduckgo"` (after line 17)

	- Task 3.1.2: Change `search()` return type from `SearchResult` to `list[Evidence]` (line 19)

	- Task 3.1.3: Update `search()` method body:
	- Line 3.1.3.1: Keep existing search logic (lines 21-43)
	- Line 3.1.3.2: Instead of returning `SearchResult`, return `evidence` list directly (line 44)
	- Line 3.1.3.3: Update exception handler to return empty list `[]` instead of `SearchResult` (line 51)

	- Task 3.1.4: Add imports if needed:
	- Line 3.1.4.1: `from src.utils.exceptions import SearchError`
	- Line 3.1.4.2: Update exception handling to raise `SearchError` instead of returning error `SearchResult`

	- Task 3.1.5: Add query preprocessing:
	- Line 3.1.5.1: Import `from src.tools.query_utils import preprocess_query`
	- Line 3.1.5.2: Add `clean_query = preprocess_query(query)` before search
	- Line 3.1.5.3: Use `clean_query if clean_query else query`

	#### Activity 3.2: Update Retrieval Agent Caller
	File: `src/agents/retrieval_agent.py`
	- Task 3.2.1: Update `search_web()` function (line 31):
	- Line 3.2.1.1: Change `results = await _web_search.search(query, max_results)`
	- Line 3.2.1.2: Change to `evidence = await _web_search.search(query, max_results)`
	- Line 3.2.1.3: Update check: `if not evidence:` instead of `if not results.evidence:`
	- Line 3.2.1.4: Update state update: `new_count = state.add_evidence(evidence)` instead of `results.evidence`
	- Line 3.2.1.5: Update logging: `results_found=len(evidence)` instead of `len(results.evidence)`
	- Line 3.2.1.6: Update output formatting: `for i, r in enumerate(evidence[:max_results], 1):` instead of `results.evidence[:max_results]`
	- Line 3.2.1.7: Update deduplication: `await state.embedding_service.deduplicate(evidence)` instead of `results.evidence`
	- Line 3.2.1.8: Update output message: `Found {len(evidence)} web results` instead of `len(results.evidence)`

	---

	### PROJECT 4: Integrate Web Search into SearchHandler

	#### Activity 4.1: Create Web Search Tool Factory
	File: `src/tools/web_search_factory.py`
	- Task 4.1.1: Create new file `src/tools/web_search_factory.py`
	- Task 4.1.2: Add imports:
	- `from src.tools.web_search import WebSearchTool`
	- `from src.tools.serper_web_search import SerperWebSearchTool`
	- `from src.tools.searchxng_web_search import SearchXNGWebSearchTool`
	- `from src.utils.config import settings`
	- `from src.utils.exceptions import ConfigurationError`
	- `import structlog`

	- Task 4.1.3: Add `logger = structlog.get_logger()`

	- Task 4.1.4: Create `def create_web_search_tool() -> SearchTool \| None:` function
	- Line 4.1.4.1: Check `settings.web_search_provider`
	- Line 4.1.4.2: If `"serper"`:
	- Line 4.1.4.2.1: Check `settings.serper_api_key` or `settings.web_search_available()`
	- Line 4.1.4.2.2: If available, return `SerperWebSearchTool()`
	- Line 4.1.4.2.3: Else log warning and return `None`
	- Line 4.1.4.3: If `"searchxng"`:
	- Line 4.1.4.3.1: Check `settings.searchxng_host` or `settings.web_search_available()`
	- Line 4.1.4.3.2: If available, return `SearchXNGWebSearchTool()`
	- Line 4.1.4.3.3: Else log warning and return `None`
	- Line 4.1.4.4: If `"duckduckgo"`:
	- Line 4.1.4.4.1: Return `WebSearchTool()` (always available)
	- Line 4.1.4.5: If `"brave"` or `"tavily"`:
	- Line 4.1.4.5.1: Log warning "Not yet implemented"
	- Line 4.1.4.5.2: Return `None`
	- Line 4.1.4.6: Default: return `WebSearchTool()` (fallback to DuckDuckGo)

	#### Activity 4.2: Update SearchHandler Initialization
	File: `src/app.py`
	- Task 4.2.1: Add import: `from src.tools.web_search_factory import create_web_search_tool`

	- Task 4.2.2: Update `configure_orchestrator()` function (around line 73):
	- Line 4.2.2.1: Before creating `SearchHandler`, call `web_search_tool = create_web_search_tool()`
	- Line 4.2.2.2: Create tools list: `tools = [PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()]`
	- Line 4.2.2.3: If `web_search_tool is not None`:
	- Line 4.2.2.3.1: Append `web_search_tool` to tools list
	- Line 4.2.2.3.2: Log info: "Web search tool added to search handler"
	- Line 4.2.2.4: Update `SearchHandler` initialization to use `tools` list

	---

	### PROJECT 5: Update Callers and Dependencies

	#### Activity 5.1: Update web_search_adapter
	File: `src/tools/web_search_adapter.py`
	- Task 5.1.1: Update `web_search()` function to use new implementation:
	- Line 5.1.1.1: Import `from src.tools.web_search_factory import create_web_search_tool`
	- Line 5.1.1.2: Remove dependency on `folder.tools.web_search`
	- Line 5.1.1.3: Get tool: `tool = create_web_search_tool()`
	- Line 5.1.1.4: If `tool is None`, return error message
	- Line 5.1.1.5: Call `evidence = await tool.search(query, max_results=5)`
	- Line 5.1.1.6: Convert `Evidence` objects to formatted string:
	- Line 5.1.1.6.1: Format each evidence with title, URL, content preview
	- Line 5.1.1.7: Return formatted string

	#### Activity 5.2: Update Tool Executor
	File: `src/tools/tool_executor.py`
	- Task 5.2.1: Verify `web_search_adapter.web_search()` usage (line 86) still works
	- Task 5.2.2: No changes needed if adapter is updated correctly

	#### Activity 5.3: Update Planner Agent
	File: `src/orchestrator/planner_agent.py`
	- Task 5.3.1: Verify `web_search_adapter.web_search()` usage (line 14) still works
	- Task 5.3.2: No changes needed if adapter is updated correctly

	#### Activity 5.4: Remove Legacy Dependencies
	File: `src/tools/web_search_adapter.py`
	- Task 5.4.1: Remove import of `folder.llm_config` and `folder.tools.web_search`
	- Task 5.4.2: Update error messages to reflect new implementation

	---

	### PROJECT 6: Testing and Validation

	#### Activity 6.1: Unit Tests for Vendored Components
	File: `tests/unit/tools/test_vendored_web_search_core.py`
	- Task 6.1.1: Test `scrape_urls()` function
	- Task 6.1.2: Test `fetch_and_process_url()` function
	- Task 6.1.3: Test `html_to_text()` function
	- Task 6.1.4: Test `is_valid_url()` function

	File: `tests/unit/tools/test_vendored_serper_client.py`
	- Task 6.1.5: Mock SerperClient API calls
	- Task 6.1.6: Test successful search
	- Task 6.1.7: Test error handling
	- Task 6.1.8: Test rate limiting

	File: `tests/unit/tools/test_vendored_searchxng_client.py`
	- Task 6.1.9: Mock SearchXNGClient API calls
	- Task 6.1.10: Test successful search
	- Task 6.1.11: Test error handling
	- Task 6.1.12: Test rate limiting

	#### Activity 6.2: Unit Tests for Web Search Tools
	File: `tests/unit/tools/test_serper_web_search.py`
	- Task 6.2.1: Test `SerperWebSearchTool.__init__()` with valid API key
	- Task 6.2.2: Test `SerperWebSearchTool.__init__()` without API key (should raise)
	- Task 6.2.3: Test `name` property returns `"serper"`
	- Task 6.2.4: Test `search()` returns `list[Evidence]`
	- Task 6.2.5: Test `search()` with mocked SerperClient
	- Task 6.2.6: Test error handling (SearchError, RateLimitError)
	- Task 6.2.7: Test query preprocessing
	- Task 6.2.8: Test rate limiting

	File: `tests/unit/tools/test_searchxng_web_search.py`
	- Task 6.2.9: Similar tests for SearchXNGWebSearchTool

	File: `tests/unit/tools/test_web_search.py`
	- Task 6.2.10: Test `WebSearchTool.name` property returns `"duckduckgo"`
	- Task 6.2.11: Test `WebSearchTool.search()` returns `list[Evidence]`
	- Task 6.2.12: Test `WebSearchTool.search()` with mocked DDGS
	- Task 6.2.13: Test error handling
	- Task 6.2.14: Test query preprocessing

	#### Activity 6.3: Integration Tests
	File: `tests/integration/test_web_search_integration.py`
	- Task 6.3.1: Test `SerperWebSearchTool` with real API (marked `@pytest.mark.integration`)
	- Task 6.3.2: Test `SearchXNGWebSearchTool` with real API (marked `@pytest.mark.integration`)
	- Task 6.3.3: Test `WebSearchTool` with real DuckDuckGo (marked `@pytest.mark.integration`)
	- Task 6.3.4: Test `create_web_search_tool()` factory function
	- Task 6.3.5: Test SearchHandler with web search tool

	#### Activity 6.4: Update Existing Tests
	File: `tests/unit/agents/test_retrieval_agent.py`
	- Task 6.4.1: Update tests to expect `list[Evidence]` instead of `SearchResult`
	- Task 6.4.2: Mock `WebSearchTool.search()` to return `list[Evidence]`

	File: `tests/unit/tools/test_tool_executor.py`
	- Task 6.4.3: Verify tests still pass with updated `web_search_adapter`

	---

	## Implementation Order

	1. PROJECT 1: Vendor core components (foundation)
	2. PROJECT 3: Fix existing WebSearchTool (quick win, unblocks retrieval agent)
	3. PROJECT 2: Create SerperWebSearchTool (new functionality)
	4. PROJECT 4: Integrate into SearchHandler (main integration)
	5. PROJECT 5: Update callers (cleanup dependencies)
	6. PROJECT 6: Testing (validation)

	---

	## Dependencies and Prerequisites

	### External Dependencies
	- `aiohttp` - Already in requirements
	- `beautifulsoup4` - Already in requirements
	- `duckduckgo-search` - Already in requirements
	- `tenacity` - Already in requirements
	- `structlog` - Already in requirements

	### Internal Dependencies
	- `src/tools/base.py` - SearchTool protocol
	- `src/tools/rate_limiter.py` - Rate limiting utilities
	- `src/tools/query_utils.py` - Query preprocessing
	- `src/utils/config.py` - Settings and configuration
	- `src/utils/exceptions.py` - Custom exceptions
	- `src/utils/models.py` - Evidence, Citation models

	### Configuration Requirements
	- `SERPER_API_KEY` - For Serper provider
	- `SEARCHXNG_HOST` - For SearchXNG provider
	- `WEB_SEARCH_PROVIDER` - Environment variable (default: "duckduckgo")

	---

	## Risk Assessment

	### High Risk
	- Breaking changes to retrieval_agent.py: Must update carefully to handle `list[Evidence]` instead of `SearchResult`
	- Legacy folder dependencies: Need to ensure all code is properly vendored

	### Medium Risk
	- Rate limiting: Serper API may have different limits than expected
	- Error handling: Need to handle API failures gracefully

	### Low Risk
	- Query preprocessing: May need adjustment for web search vs PubMed
	- Testing: Integration tests require API keys

	---

	## Success Criteria

	1. ✅ `SerperWebSearchTool` implements `SearchTool` protocol correctly
	2. ✅ `WebSearchTool` implements `SearchTool` protocol correctly
	3. ✅ Both tools can be added to `SearchHandler`
	4. ✅ `web_search_adapter` works with new implementation
	5. ✅ `retrieval_agent` works with updated `WebSearchTool`
	6. ✅ All unit tests pass
	7. ✅ Integration tests pass (with API keys)
	8. ✅ No dependencies on `folder/tools/web_search.py` in `src/` code
	9. ✅ Configuration supports multiple providers
	10. ✅ Error handling is robust

	---

	## Notes

	- The vendored code should be self-contained and not depend on `folder/` modules
	- Filter agent functionality from original code is removed (can be added later if needed)
	- Rate limiting follows the same pattern as PubMed tool
	- Query preprocessing may need web-specific adjustments (less aggressive than PubMed)
	- Consider adding relevance scoring in the future