DeepCritical / SERPER_WEBSEARCH_IMPLEMENTATION_PLAN.md
Joseph Pollack
final countdown
e427816
|
raw
history blame
19.1 kB
# SERPER Web Search Implementation Plan
## Executive Summary
This plan details the implementation of SERPER-based web search by vendoring code from `folder/tools/web_search.py` into `src/tools/`, creating a protocol-compliant `SerperWebSearchTool`, fixing the existing `WebSearchTool`, and integrating both into the main search flow.
## Project Structure
### Project 1: Vendor and Refactor Core Web Search Components
**Goal**: Extract and vendor Serper/SearchXNG search logic from `folder/tools/web_search.py` into `src/tools/`
### Project 2: Create Protocol-Compliant SerperWebSearchTool
**Goal**: Implement `SerperWebSearchTool` class that fully complies with `SearchTool` protocol
### Project 3: Fix Existing WebSearchTool Protocol Compliance
**Goal**: Make existing `WebSearchTool` (DuckDuckGo) protocol-compliant
### Project 4: Integrate Web Search into SearchHandler
**Goal**: Add web search tools to main search flow in `src/app.py`
### Project 5: Update Callers and Dependencies
**Goal**: Update all code that uses web search to work with new implementation
### Project 6: Testing and Validation
**Goal**: Add comprehensive tests for all web search implementations
---
## Detailed Implementation Plan
### PROJECT 1: Vendor and Refactor Core Web Search Components
#### Activity 1.1: Create Vendor Module Structure
**File**: `src/tools/vendored/__init__.py`
- **Task 1.1.1**: Create `src/tools/vendored/` directory
- **Task 1.1.2**: Create `__init__.py` with exports
**File**: `src/tools/vendored/web_search_core.py`
- **Task 1.1.3**: Vendor `ScrapeResult`, `WebpageSnippet`, `SearchResults` models from `folder/tools/web_search.py` (lines 23-37)
- **Task 1.1.4**: Vendor `scrape_urls()` function (lines 274-299)
- **Task 1.1.5**: Vendor `fetch_and_process_url()` function (lines 302-348)
- **Task 1.1.6**: Vendor `html_to_text()` function (lines 351-368)
- **Task 1.1.7**: Vendor `is_valid_url()` function (lines 371-410)
- **Task 1.1.8**: Vendor `ssl_context` setup (lines 115-120)
- **Task 1.1.9**: Add imports: `aiohttp`, `asyncio`, `BeautifulSoup`, `ssl`
- **Task 1.1.10**: Add `CONTENT_LENGTH_LIMIT = 10000` constant
- **Task 1.1.11**: Add type hints following project standards
- **Task 1.1.12**: Add structlog logging
- **Task 1.1.13**: Replace `print()` statements with `logger` calls
**File**: `src/tools/vendored/serper_client.py`
- **Task 1.1.14**: Vendor `SerperClient` class from `folder/tools/web_search.py` (lines 123-196)
- **Task 1.1.15**: Remove dependency on `ResearchAgent` and `ResearchRunner`
- **Task 1.1.16**: Replace filter agent with simple relevance filtering or remove it
- **Task 1.1.17**: Add `__init__` that takes `api_key: str | None` parameter
- **Task 1.1.18**: Update `search()` method to return `list[WebpageSnippet]` without filtering
- **Task 1.1.19**: Remove `_filter_results()` method (or make it optional)
- **Task 1.1.20**: Add error handling with `SearchError` and `RateLimitError`
- **Task 1.1.21**: Add structlog logging
- **Task 1.1.22**: Add type hints
**File**: `src/tools/vendored/searchxng_client.py`
- **Task 1.1.23**: Vendor `SearchXNGClient` class from `folder/tools/web_search.py` (lines 199-271)
- **Task 1.1.24**: Remove dependency on `ResearchAgent` and `ResearchRunner`
- **Task 1.1.25**: Replace filter agent with simple relevance filtering or remove it
- **Task 1.1.26**: Add `__init__` that takes `host: str` parameter
- **Task 1.1.27**: Update `search()` method to return `list[WebpageSnippet]` without filtering
- **Task 1.1.28**: Remove `_filter_results()` method (or make it optional)
- **Task 1.1.29**: Add error handling with `SearchError` and `RateLimitError`
- **Task 1.1.30**: Add structlog logging
- **Task 1.1.31**: Add type hints
#### Activity 1.2: Create Rate Limiting for Web Search
**File**: `src/tools/rate_limiter.py`
- **Task 1.2.1**: Add `get_serper_limiter()` function (rate: "10/second" with API key)
- **Task 1.2.2**: Add `get_searchxng_limiter()` function (rate: "5/second")
- **Task 1.2.3**: Use `RateLimiterFactory.get()` pattern
---
### PROJECT 2: Create Protocol-Compliant SerperWebSearchTool
#### Activity 2.1: Implement SerperWebSearchTool Class
**File**: `src/tools/serper_web_search.py`
- **Task 2.1.1**: Create new file `src/tools/serper_web_search.py`
- **Task 2.1.2**: Add imports:
- `from src.tools.base import SearchTool`
- `from src.tools.vendored.serper_client import SerperClient`
- `from src.tools.vendored.web_search_core import scrape_urls, WebpageSnippet`
- `from src.tools.rate_limiter import get_serper_limiter`
- `from src.tools.query_utils import preprocess_query`
- `from src.utils.config import settings`
- `from src.utils.exceptions import SearchError, RateLimitError`
- `from src.utils.models import Citation, Evidence`
- `import structlog`
- `from tenacity import retry, stop_after_attempt, wait_exponential`
- **Task 2.1.3**: Create `SerperWebSearchTool` class
- **Task 2.1.4**: Add `__init__(self, api_key: str | None = None)` method
- Line 2.1.4.1: Get API key from parameter or `settings.serper_api_key`
- Line 2.1.4.2: Validate API key is not None, raise `ConfigurationError` if missing
- Line 2.1.4.3: Initialize `SerperClient(api_key=self.api_key)`
- Line 2.1.4.4: Get rate limiter: `self._limiter = get_serper_limiter(self.api_key)`
- **Task 2.1.5**: Add `@property def name(self) -> str:` returning `"serper"`
- **Task 2.1.6**: Add `async def _rate_limit(self) -> None:` method
- Line 2.1.6.1: Call `await self._limiter.acquire()`
- **Task 2.1.7**: Add `@retry(...)` decorator with exponential backoff
- **Task 2.1.8**: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method
- Line 2.1.8.1: Call `await self._rate_limit()`
- Line 2.1.8.2: Preprocess query: `clean_query = preprocess_query(query)`
- Line 2.1.8.3: Use `clean_query if clean_query else query`
- Line 2.1.8.4: Call `search_results = await self._client.search(query, filter_for_relevance=False, max_results=max_results)`
- Line 2.1.8.5: Call `scraped = await scrape_urls(search_results)`
- Line 2.1.8.6: Convert `ScrapeResult` to `Evidence` objects:
- Line 2.1.8.6.1: Create `Citation` with `title`, `url`, `source="serper"`, `date="Unknown"`, `authors=[]`
- Line 2.1.8.6.2: Create `Evidence` with `content=scraped.text`, `citation`, `relevance=0.0`
- Line 2.1.8.7: Return `list[Evidence]`
- Line 2.1.8.8: Add try/except for `httpx.HTTPStatusError`:
- Line 2.1.8.8.1: Check for 429 status, raise `RateLimitError`
- Line 2.1.8.8.2: Otherwise raise `SearchError`
- Line 2.1.8.9: Add try/except for `httpx.TimeoutException`, raise `SearchError`
- Line 2.1.8.10: Add generic exception handler, log and raise `SearchError`
#### Activity 2.2: Implement SearchXNGWebSearchTool Class
**File**: `src/tools/searchxng_web_search.py`
- **Task 2.2.1**: Create new file `src/tools/searchxng_web_search.py`
- **Task 2.2.2**: Add imports (similar to SerperWebSearchTool)
- **Task 2.2.3**: Create `SearchXNGWebSearchTool` class
- **Task 2.2.4**: Add `__init__(self, host: str | None = None)` method
- Line 2.2.4.1: Get host from parameter or `settings.searchxng_host`
- Line 2.2.4.2: Validate host is not None, raise `ConfigurationError` if missing
- Line 2.2.4.3: Initialize `SearchXNGClient(host=self.host)`
- Line 2.2.4.4: Get rate limiter: `self._limiter = get_searchxng_limiter()`
- **Task 2.2.5**: Add `@property def name(self) -> str:` returning `"searchxng"`
- **Task 2.2.6**: Add `async def _rate_limit(self) -> None:` method
- **Task 2.2.7**: Add `@retry(...)` decorator
- **Task 2.2.8**: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method
- Line 2.2.8.1-2.2.8.10: Similar structure to SerperWebSearchTool
---
### PROJECT 3: Fix Existing WebSearchTool Protocol Compliance
#### Activity 3.1: Update WebSearchTool Class
**File**: `src/tools/web_search.py`
- **Task 3.1.1**: Add `@property def name(self) -> str:` method returning `"duckduckgo"` (after line 17)
- **Task 3.1.2**: Change `search()` return type from `SearchResult` to `list[Evidence]` (line 19)
- **Task 3.1.3**: Update `search()` method body:
- Line 3.1.3.1: Keep existing search logic (lines 21-43)
- Line 3.1.3.2: Instead of returning `SearchResult`, return `evidence` list directly (line 44)
- Line 3.1.3.3: Update exception handler to return empty list `[]` instead of `SearchResult` (line 51)
- **Task 3.1.4**: Add imports if needed:
- Line 3.1.4.1: `from src.utils.exceptions import SearchError`
- Line 3.1.4.2: Update exception handling to raise `SearchError` instead of returning error `SearchResult`
- **Task 3.1.5**: Add query preprocessing:
- Line 3.1.5.1: Import `from src.tools.query_utils import preprocess_query`
- Line 3.1.5.2: Add `clean_query = preprocess_query(query)` before search
- Line 3.1.5.3: Use `clean_query if clean_query else query`
#### Activity 3.2: Update Retrieval Agent Caller
**File**: `src/agents/retrieval_agent.py`
- **Task 3.2.1**: Update `search_web()` function (line 31):
- Line 3.2.1.1: Change `results = await _web_search.search(query, max_results)`
- Line 3.2.1.2: Change to `evidence = await _web_search.search(query, max_results)`
- Line 3.2.1.3: Update check: `if not evidence:` instead of `if not results.evidence:`
- Line 3.2.1.4: Update state update: `new_count = state.add_evidence(evidence)` instead of `results.evidence`
- Line 3.2.1.5: Update logging: `results_found=len(evidence)` instead of `len(results.evidence)`
- Line 3.2.1.6: Update output formatting: `for i, r in enumerate(evidence[:max_results], 1):` instead of `results.evidence[:max_results]`
- Line 3.2.1.7: Update deduplication: `await state.embedding_service.deduplicate(evidence)` instead of `results.evidence`
- Line 3.2.1.8: Update output message: `Found {len(evidence)} web results` instead of `len(results.evidence)`
---
### PROJECT 4: Integrate Web Search into SearchHandler
#### Activity 4.1: Create Web Search Tool Factory
**File**: `src/tools/web_search_factory.py`
- **Task 4.1.1**: Create new file `src/tools/web_search_factory.py`
- **Task 4.1.2**: Add imports:
- `from src.tools.web_search import WebSearchTool`
- `from src.tools.serper_web_search import SerperWebSearchTool`
- `from src.tools.searchxng_web_search import SearchXNGWebSearchTool`
- `from src.utils.config import settings`
- `from src.utils.exceptions import ConfigurationError`
- `import structlog`
- **Task 4.1.3**: Add `logger = structlog.get_logger()`
- **Task 4.1.4**: Create `def create_web_search_tool() -> SearchTool | None:` function
- Line 4.1.4.1: Check `settings.web_search_provider`
- Line 4.1.4.2: If `"serper"`:
- Line 4.1.4.2.1: Check `settings.serper_api_key` or `settings.web_search_available()`
- Line 4.1.4.2.2: If available, return `SerperWebSearchTool()`
- Line 4.1.4.2.3: Else log warning and return `None`
- Line 4.1.4.3: If `"searchxng"`:
- Line 4.1.4.3.1: Check `settings.searchxng_host` or `settings.web_search_available()`
- Line 4.1.4.3.2: If available, return `SearchXNGWebSearchTool()`
- Line 4.1.4.3.3: Else log warning and return `None`
- Line 4.1.4.4: If `"duckduckgo"`:
- Line 4.1.4.4.1: Return `WebSearchTool()` (always available)
- Line 4.1.4.5: If `"brave"` or `"tavily"`:
- Line 4.1.4.5.1: Log warning "Not yet implemented"
- Line 4.1.4.5.2: Return `None`
- Line 4.1.4.6: Default: return `WebSearchTool()` (fallback to DuckDuckGo)
#### Activity 4.2: Update SearchHandler Initialization
**File**: `src/app.py`
- **Task 4.2.1**: Add import: `from src.tools.web_search_factory import create_web_search_tool`
- **Task 4.2.2**: Update `configure_orchestrator()` function (around line 73):
- Line 4.2.2.1: Before creating `SearchHandler`, call `web_search_tool = create_web_search_tool()`
- Line 4.2.2.2: Create tools list: `tools = [PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()]`
- Line 4.2.2.3: If `web_search_tool is not None`:
- Line 4.2.2.3.1: Append `web_search_tool` to tools list
- Line 4.2.2.3.2: Log info: "Web search tool added to search handler"
- Line 4.2.2.4: Update `SearchHandler` initialization to use `tools` list
---
### PROJECT 5: Update Callers and Dependencies
#### Activity 5.1: Update web_search_adapter
**File**: `src/tools/web_search_adapter.py`
- **Task 5.1.1**: Update `web_search()` function to use new implementation:
- Line 5.1.1.1: Import `from src.tools.web_search_factory import create_web_search_tool`
- Line 5.1.1.2: Remove dependency on `folder.tools.web_search`
- Line 5.1.1.3: Get tool: `tool = create_web_search_tool()`
- Line 5.1.1.4: If `tool is None`, return error message
- Line 5.1.1.5: Call `evidence = await tool.search(query, max_results=5)`
- Line 5.1.1.6: Convert `Evidence` objects to formatted string:
- Line 5.1.1.6.1: Format each evidence with title, URL, content preview
- Line 5.1.1.7: Return formatted string
#### Activity 5.2: Update Tool Executor
**File**: `src/tools/tool_executor.py`
- **Task 5.2.1**: Verify `web_search_adapter.web_search()` usage (line 86) still works
- **Task 5.2.2**: No changes needed if adapter is updated correctly
#### Activity 5.3: Update Planner Agent
**File**: `src/orchestrator/planner_agent.py`
- **Task 5.3.1**: Verify `web_search_adapter.web_search()` usage (line 14) still works
- **Task 5.3.2**: No changes needed if adapter is updated correctly
#### Activity 5.4: Remove Legacy Dependencies
**File**: `src/tools/web_search_adapter.py`
- **Task 5.4.1**: Remove import of `folder.llm_config` and `folder.tools.web_search`
- **Task 5.4.2**: Update error messages to reflect new implementation
---
### PROJECT 6: Testing and Validation
#### Activity 6.1: Unit Tests for Vendored Components
**File**: `tests/unit/tools/test_vendored_web_search_core.py`
- **Task 6.1.1**: Test `scrape_urls()` function
- **Task 6.1.2**: Test `fetch_and_process_url()` function
- **Task 6.1.3**: Test `html_to_text()` function
- **Task 6.1.4**: Test `is_valid_url()` function
**File**: `tests/unit/tools/test_vendored_serper_client.py`
- **Task 6.1.5**: Mock SerperClient API calls
- **Task 6.1.6**: Test successful search
- **Task 6.1.7**: Test error handling
- **Task 6.1.8**: Test rate limiting
**File**: `tests/unit/tools/test_vendored_searchxng_client.py`
- **Task 6.1.9**: Mock SearchXNGClient API calls
- **Task 6.1.10**: Test successful search
- **Task 6.1.11**: Test error handling
- **Task 6.1.12**: Test rate limiting
#### Activity 6.2: Unit Tests for Web Search Tools
**File**: `tests/unit/tools/test_serper_web_search.py`
- **Task 6.2.1**: Test `SerperWebSearchTool.__init__()` with valid API key
- **Task 6.2.2**: Test `SerperWebSearchTool.__init__()` without API key (should raise)
- **Task 6.2.3**: Test `name` property returns `"serper"`
- **Task 6.2.4**: Test `search()` returns `list[Evidence]`
- **Task 6.2.5**: Test `search()` with mocked SerperClient
- **Task 6.2.6**: Test error handling (SearchError, RateLimitError)
- **Task 6.2.7**: Test query preprocessing
- **Task 6.2.8**: Test rate limiting
**File**: `tests/unit/tools/test_searchxng_web_search.py`
- **Task 6.2.9**: Similar tests for SearchXNGWebSearchTool
**File**: `tests/unit/tools/test_web_search.py`
- **Task 6.2.10**: Test `WebSearchTool.name` property returns `"duckduckgo"`
- **Task 6.2.11**: Test `WebSearchTool.search()` returns `list[Evidence]`
- **Task 6.2.12**: Test `WebSearchTool.search()` with mocked DDGS
- **Task 6.2.13**: Test error handling
- **Task 6.2.14**: Test query preprocessing
#### Activity 6.3: Integration Tests
**File**: `tests/integration/test_web_search_integration.py`
- **Task 6.3.1**: Test `SerperWebSearchTool` with real API (marked `@pytest.mark.integration`)
- **Task 6.3.2**: Test `SearchXNGWebSearchTool` with real API (marked `@pytest.mark.integration`)
- **Task 6.3.3**: Test `WebSearchTool` with real DuckDuckGo (marked `@pytest.mark.integration`)
- **Task 6.3.4**: Test `create_web_search_tool()` factory function
- **Task 6.3.5**: Test SearchHandler with web search tool
#### Activity 6.4: Update Existing Tests
**File**: `tests/unit/agents/test_retrieval_agent.py`
- **Task 6.4.1**: Update tests to expect `list[Evidence]` instead of `SearchResult`
- **Task 6.4.2**: Mock `WebSearchTool.search()` to return `list[Evidence]`
**File**: `tests/unit/tools/test_tool_executor.py`
- **Task 6.4.3**: Verify tests still pass with updated `web_search_adapter`
---
## Implementation Order
1. **PROJECT 1**: Vendor core components (foundation)
2. **PROJECT 3**: Fix existing WebSearchTool (quick win, unblocks retrieval agent)
3. **PROJECT 2**: Create SerperWebSearchTool (new functionality)
4. **PROJECT 4**: Integrate into SearchHandler (main integration)
5. **PROJECT 5**: Update callers (cleanup dependencies)
6. **PROJECT 6**: Testing (validation)
---
## Dependencies and Prerequisites
### External Dependencies
- `aiohttp` - Already in requirements
- `beautifulsoup4` - Already in requirements
- `duckduckgo-search` - Already in requirements
- `tenacity` - Already in requirements
- `structlog` - Already in requirements
### Internal Dependencies
- `src/tools/base.py` - SearchTool protocol
- `src/tools/rate_limiter.py` - Rate limiting utilities
- `src/tools/query_utils.py` - Query preprocessing
- `src/utils/config.py` - Settings and configuration
- `src/utils/exceptions.py` - Custom exceptions
- `src/utils/models.py` - Evidence, Citation models
### Configuration Requirements
- `SERPER_API_KEY` - For Serper provider
- `SEARCHXNG_HOST` - For SearchXNG provider
- `WEB_SEARCH_PROVIDER` - Environment variable (default: "duckduckgo")
---
## Risk Assessment
### High Risk
- **Breaking changes to retrieval_agent.py**: Must update carefully to handle `list[Evidence]` instead of `SearchResult`
- **Legacy folder dependencies**: Need to ensure all code is properly vendored
### Medium Risk
- **Rate limiting**: Serper API may have different limits than expected
- **Error handling**: Need to handle API failures gracefully
### Low Risk
- **Query preprocessing**: May need adjustment for web search vs PubMed
- **Testing**: Integration tests require API keys
---
## Success Criteria
1. βœ… `SerperWebSearchTool` implements `SearchTool` protocol correctly
2. βœ… `WebSearchTool` implements `SearchTool` protocol correctly
3. βœ… Both tools can be added to `SearchHandler`
4. βœ… `web_search_adapter` works with new implementation
5. βœ… `retrieval_agent` works with updated `WebSearchTool`
6. βœ… All unit tests pass
7. βœ… Integration tests pass (with API keys)
8. βœ… No dependencies on `folder/tools/web_search.py` in `src/` code
9. βœ… Configuration supports multiple providers
10. βœ… Error handling is robust
---
## Notes
- The vendored code should be self-contained and not depend on `folder/` modules
- Filter agent functionality from original code is removed (can be added later if needed)
- Rate limiting follows the same pattern as PubMed tool
- Query preprocessing may need web-specific adjustments (less aggressive than PubMed)
- Consider adding relevance scoring in the future