Spaces:
Running
Running
| # SERPER Web Search Implementation Plan | |
| ## Executive Summary | |
| This plan details the implementation of SERPER-based web search by vendoring code from `folder/tools/web_search.py` into `src/tools/`, creating a protocol-compliant `SerperWebSearchTool`, fixing the existing `WebSearchTool`, and integrating both into the main search flow. | |
| ## Project Structure | |
| ### Project 1: Vendor and Refactor Core Web Search Components | |
| **Goal**: Extract and vendor Serper/SearchXNG search logic from `folder/tools/web_search.py` into `src/tools/` | |
| ### Project 2: Create Protocol-Compliant SerperWebSearchTool | |
| **Goal**: Implement `SerperWebSearchTool` class that fully complies with `SearchTool` protocol | |
| ### Project 3: Fix Existing WebSearchTool Protocol Compliance | |
| **Goal**: Make existing `WebSearchTool` (DuckDuckGo) protocol-compliant | |
| ### Project 4: Integrate Web Search into SearchHandler | |
| **Goal**: Add web search tools to main search flow in `src/app.py` | |
| ### Project 5: Update Callers and Dependencies | |
| **Goal**: Update all code that uses web search to work with new implementation | |
| ### Project 6: Testing and Validation | |
| **Goal**: Add comprehensive tests for all web search implementations | |
| --- | |
| ## Detailed Implementation Plan | |
| ### PROJECT 1: Vendor and Refactor Core Web Search Components | |
| #### Activity 1.1: Create Vendor Module Structure | |
| **File**: `src/tools/vendored/__init__.py` | |
| - **Task 1.1.1**: Create `src/tools/vendored/` directory | |
| - **Task 1.1.2**: Create `__init__.py` with exports | |
| **File**: `src/tools/vendored/web_search_core.py` | |
| - **Task 1.1.3**: Vendor `ScrapeResult`, `WebpageSnippet`, `SearchResults` models from `folder/tools/web_search.py` (lines 23-37) | |
| - **Task 1.1.4**: Vendor `scrape_urls()` function (lines 274-299) | |
| - **Task 1.1.5**: Vendor `fetch_and_process_url()` function (lines 302-348) | |
| - **Task 1.1.6**: Vendor `html_to_text()` function (lines 351-368) | |
| - **Task 1.1.7**: Vendor `is_valid_url()` function (lines 371-410) | |
| - **Task 1.1.8**: Vendor `ssl_context` setup (lines 115-120) | |
| - **Task 1.1.9**: Add imports: `aiohttp`, `asyncio`, `BeautifulSoup`, `ssl` | |
| - **Task 1.1.10**: Add `CONTENT_LENGTH_LIMIT = 10000` constant | |
| - **Task 1.1.11**: Add type hints following project standards | |
| - **Task 1.1.12**: Add structlog logging | |
| - **Task 1.1.13**: Replace `print()` statements with `logger` calls | |
| **File**: `src/tools/vendored/serper_client.py` | |
| - **Task 1.1.14**: Vendor `SerperClient` class from `folder/tools/web_search.py` (lines 123-196) | |
| - **Task 1.1.15**: Remove dependency on `ResearchAgent` and `ResearchRunner` | |
| - **Task 1.1.16**: Replace filter agent with simple relevance filtering or remove it | |
| - **Task 1.1.17**: Add `__init__` that takes `api_key: str | None` parameter | |
| - **Task 1.1.18**: Update `search()` method to return `list[WebpageSnippet]` without filtering | |
| - **Task 1.1.19**: Remove `_filter_results()` method (or make it optional) | |
| - **Task 1.1.20**: Add error handling with `SearchError` and `RateLimitError` | |
| - **Task 1.1.21**: Add structlog logging | |
| - **Task 1.1.22**: Add type hints | |
| **File**: `src/tools/vendored/searchxng_client.py` | |
| - **Task 1.1.23**: Vendor `SearchXNGClient` class from `folder/tools/web_search.py` (lines 199-271) | |
| - **Task 1.1.24**: Remove dependency on `ResearchAgent` and `ResearchRunner` | |
| - **Task 1.1.25**: Replace filter agent with simple relevance filtering or remove it | |
| - **Task 1.1.26**: Add `__init__` that takes `host: str` parameter | |
| - **Task 1.1.27**: Update `search()` method to return `list[WebpageSnippet]` without filtering | |
| - **Task 1.1.28**: Remove `_filter_results()` method (or make it optional) | |
| - **Task 1.1.29**: Add error handling with `SearchError` and `RateLimitError` | |
| - **Task 1.1.30**: Add structlog logging | |
| - **Task 1.1.31**: Add type hints | |
| #### Activity 1.2: Create Rate Limiting for Web Search | |
| **File**: `src/tools/rate_limiter.py` | |
| - **Task 1.2.1**: Add `get_serper_limiter()` function (rate: "10/second" with API key) | |
| - **Task 1.2.2**: Add `get_searchxng_limiter()` function (rate: "5/second") | |
| - **Task 1.2.3**: Use `RateLimiterFactory.get()` pattern | |
| --- | |
| ### PROJECT 2: Create Protocol-Compliant SerperWebSearchTool | |
| #### Activity 2.1: Implement SerperWebSearchTool Class | |
| **File**: `src/tools/serper_web_search.py` | |
| - **Task 2.1.1**: Create new file `src/tools/serper_web_search.py` | |
| - **Task 2.1.2**: Add imports: | |
| - `from src.tools.base import SearchTool` | |
| - `from src.tools.vendored.serper_client import SerperClient` | |
| - `from src.tools.vendored.web_search_core import scrape_urls, WebpageSnippet` | |
| - `from src.tools.rate_limiter import get_serper_limiter` | |
| - `from src.tools.query_utils import preprocess_query` | |
| - `from src.utils.config import settings` | |
| - `from src.utils.exceptions import SearchError, RateLimitError` | |
| - `from src.utils.models import Citation, Evidence` | |
| - `import structlog` | |
| - `from tenacity import retry, stop_after_attempt, wait_exponential` | |
| - **Task 2.1.3**: Create `SerperWebSearchTool` class | |
| - **Task 2.1.4**: Add `__init__(self, api_key: str | None = None)` method | |
| - Line 2.1.4.1: Get API key from parameter or `settings.serper_api_key` | |
| - Line 2.1.4.2: Validate API key is not None, raise `ConfigurationError` if missing | |
| - Line 2.1.4.3: Initialize `SerperClient(api_key=self.api_key)` | |
| - Line 2.1.4.4: Get rate limiter: `self._limiter = get_serper_limiter(self.api_key)` | |
| - **Task 2.1.5**: Add `@property def name(self) -> str:` returning `"serper"` | |
| - **Task 2.1.6**: Add `async def _rate_limit(self) -> None:` method | |
| - Line 2.1.6.1: Call `await self._limiter.acquire()` | |
| - **Task 2.1.7**: Add `@retry(...)` decorator with exponential backoff | |
| - **Task 2.1.8**: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method | |
| - Line 2.1.8.1: Call `await self._rate_limit()` | |
| - Line 2.1.8.2: Preprocess query: `clean_query = preprocess_query(query)` | |
| - Line 2.1.8.3: Use `clean_query if clean_query else query` | |
| - Line 2.1.8.4: Call `search_results = await self._client.search(query, filter_for_relevance=False, max_results=max_results)` | |
| - Line 2.1.8.5: Call `scraped = await scrape_urls(search_results)` | |
| - Line 2.1.8.6: Convert `ScrapeResult` to `Evidence` objects: | |
| - Line 2.1.8.6.1: Create `Citation` with `title`, `url`, `source="serper"`, `date="Unknown"`, `authors=[]` | |
| - Line 2.1.8.6.2: Create `Evidence` with `content=scraped.text`, `citation`, `relevance=0.0` | |
| - Line 2.1.8.7: Return `list[Evidence]` | |
| - Line 2.1.8.8: Add try/except for `httpx.HTTPStatusError`: | |
| - Line 2.1.8.8.1: Check for 429 status, raise `RateLimitError` | |
| - Line 2.1.8.8.2: Otherwise raise `SearchError` | |
| - Line 2.1.8.9: Add try/except for `httpx.TimeoutException`, raise `SearchError` | |
| - Line 2.1.8.10: Add generic exception handler, log and raise `SearchError` | |
| #### Activity 2.2: Implement SearchXNGWebSearchTool Class | |
| **File**: `src/tools/searchxng_web_search.py` | |
| - **Task 2.2.1**: Create new file `src/tools/searchxng_web_search.py` | |
| - **Task 2.2.2**: Add imports (similar to SerperWebSearchTool) | |
| - **Task 2.2.3**: Create `SearchXNGWebSearchTool` class | |
| - **Task 2.2.4**: Add `__init__(self, host: str | None = None)` method | |
| - Line 2.2.4.1: Get host from parameter or `settings.searchxng_host` | |
| - Line 2.2.4.2: Validate host is not None, raise `ConfigurationError` if missing | |
| - Line 2.2.4.3: Initialize `SearchXNGClient(host=self.host)` | |
| - Line 2.2.4.4: Get rate limiter: `self._limiter = get_searchxng_limiter()` | |
| - **Task 2.2.5**: Add `@property def name(self) -> str:` returning `"searchxng"` | |
| - **Task 2.2.6**: Add `async def _rate_limit(self) -> None:` method | |
| - **Task 2.2.7**: Add `@retry(...)` decorator | |
| - **Task 2.2.8**: Add `async def search(self, query: str, max_results: int = 10) -> list[Evidence]:` method | |
| - Line 2.2.8.1-2.2.8.10: Similar structure to SerperWebSearchTool | |
| --- | |
| ### PROJECT 3: Fix Existing WebSearchTool Protocol Compliance | |
| #### Activity 3.1: Update WebSearchTool Class | |
| **File**: `src/tools/web_search.py` | |
| - **Task 3.1.1**: Add `@property def name(self) -> str:` method returning `"duckduckgo"` (after line 17) | |
| - **Task 3.1.2**: Change `search()` return type from `SearchResult` to `list[Evidence]` (line 19) | |
| - **Task 3.1.3**: Update `search()` method body: | |
| - Line 3.1.3.1: Keep existing search logic (lines 21-43) | |
| - Line 3.1.3.2: Instead of returning `SearchResult`, return `evidence` list directly (line 44) | |
| - Line 3.1.3.3: Update exception handler to return empty list `[]` instead of `SearchResult` (line 51) | |
| - **Task 3.1.4**: Add imports if needed: | |
| - Line 3.1.4.1: `from src.utils.exceptions import SearchError` | |
| - Line 3.1.4.2: Update exception handling to raise `SearchError` instead of returning error `SearchResult` | |
| - **Task 3.1.5**: Add query preprocessing: | |
| - Line 3.1.5.1: Import `from src.tools.query_utils import preprocess_query` | |
| - Line 3.1.5.2: Add `clean_query = preprocess_query(query)` before search | |
| - Line 3.1.5.3: Use `clean_query if clean_query else query` | |
| #### Activity 3.2: Update Retrieval Agent Caller | |
| **File**: `src/agents/retrieval_agent.py` | |
| - **Task 3.2.1**: Update `search_web()` function (line 31): | |
| - Line 3.2.1.1: Change `results = await _web_search.search(query, max_results)` | |
| - Line 3.2.1.2: Change to `evidence = await _web_search.search(query, max_results)` | |
| - Line 3.2.1.3: Update check: `if not evidence:` instead of `if not results.evidence:` | |
| - Line 3.2.1.4: Update state update: `new_count = state.add_evidence(evidence)` instead of `results.evidence` | |
| - Line 3.2.1.5: Update logging: `results_found=len(evidence)` instead of `len(results.evidence)` | |
| - Line 3.2.1.6: Update output formatting: `for i, r in enumerate(evidence[:max_results], 1):` instead of `results.evidence[:max_results]` | |
| - Line 3.2.1.7: Update deduplication: `await state.embedding_service.deduplicate(evidence)` instead of `results.evidence` | |
| - Line 3.2.1.8: Update output message: `Found {len(evidence)} web results` instead of `len(results.evidence)` | |
| --- | |
| ### PROJECT 4: Integrate Web Search into SearchHandler | |
| #### Activity 4.1: Create Web Search Tool Factory | |
| **File**: `src/tools/web_search_factory.py` | |
| - **Task 4.1.1**: Create new file `src/tools/web_search_factory.py` | |
| - **Task 4.1.2**: Add imports: | |
| - `from src.tools.web_search import WebSearchTool` | |
| - `from src.tools.serper_web_search import SerperWebSearchTool` | |
| - `from src.tools.searchxng_web_search import SearchXNGWebSearchTool` | |
| - `from src.utils.config import settings` | |
| - `from src.utils.exceptions import ConfigurationError` | |
| - `import structlog` | |
| - **Task 4.1.3**: Add `logger = structlog.get_logger()` | |
| - **Task 4.1.4**: Create `def create_web_search_tool() -> SearchTool | None:` function | |
| - Line 4.1.4.1: Check `settings.web_search_provider` | |
| - Line 4.1.4.2: If `"serper"`: | |
| - Line 4.1.4.2.1: Check `settings.serper_api_key` or `settings.web_search_available()` | |
| - Line 4.1.4.2.2: If available, return `SerperWebSearchTool()` | |
| - Line 4.1.4.2.3: Else log warning and return `None` | |
| - Line 4.1.4.3: If `"searchxng"`: | |
| - Line 4.1.4.3.1: Check `settings.searchxng_host` or `settings.web_search_available()` | |
| - Line 4.1.4.3.2: If available, return `SearchXNGWebSearchTool()` | |
| - Line 4.1.4.3.3: Else log warning and return `None` | |
| - Line 4.1.4.4: If `"duckduckgo"`: | |
| - Line 4.1.4.4.1: Return `WebSearchTool()` (always available) | |
| - Line 4.1.4.5: If `"brave"` or `"tavily"`: | |
| - Line 4.1.4.5.1: Log warning "Not yet implemented" | |
| - Line 4.1.4.5.2: Return `None` | |
| - Line 4.1.4.6: Default: return `WebSearchTool()` (fallback to DuckDuckGo) | |
| #### Activity 4.2: Update SearchHandler Initialization | |
| **File**: `src/app.py` | |
| - **Task 4.2.1**: Add import: `from src.tools.web_search_factory import create_web_search_tool` | |
| - **Task 4.2.2**: Update `configure_orchestrator()` function (around line 73): | |
| - Line 4.2.2.1: Before creating `SearchHandler`, call `web_search_tool = create_web_search_tool()` | |
| - Line 4.2.2.2: Create tools list: `tools = [PubMedTool(), ClinicalTrialsTool(), EuropePMCTool()]` | |
| - Line 4.2.2.3: If `web_search_tool is not None`: | |
| - Line 4.2.2.3.1: Append `web_search_tool` to tools list | |
| - Line 4.2.2.3.2: Log info: "Web search tool added to search handler" | |
| - Line 4.2.2.4: Update `SearchHandler` initialization to use `tools` list | |
| --- | |
| ### PROJECT 5: Update Callers and Dependencies | |
| #### Activity 5.1: Update web_search_adapter | |
| **File**: `src/tools/web_search_adapter.py` | |
| - **Task 5.1.1**: Update `web_search()` function to use new implementation: | |
| - Line 5.1.1.1: Import `from src.tools.web_search_factory import create_web_search_tool` | |
| - Line 5.1.1.2: Remove dependency on `folder.tools.web_search` | |
| - Line 5.1.1.3: Get tool: `tool = create_web_search_tool()` | |
| - Line 5.1.1.4: If `tool is None`, return error message | |
| - Line 5.1.1.5: Call `evidence = await tool.search(query, max_results=5)` | |
| - Line 5.1.1.6: Convert `Evidence` objects to formatted string: | |
| - Line 5.1.1.6.1: Format each evidence with title, URL, content preview | |
| - Line 5.1.1.7: Return formatted string | |
| #### Activity 5.2: Update Tool Executor | |
| **File**: `src/tools/tool_executor.py` | |
| - **Task 5.2.1**: Verify `web_search_adapter.web_search()` usage (line 86) still works | |
| - **Task 5.2.2**: No changes needed if adapter is updated correctly | |
| #### Activity 5.3: Update Planner Agent | |
| **File**: `src/orchestrator/planner_agent.py` | |
| - **Task 5.3.1**: Verify `web_search_adapter.web_search()` usage (line 14) still works | |
| - **Task 5.3.2**: No changes needed if adapter is updated correctly | |
| #### Activity 5.4: Remove Legacy Dependencies | |
| **File**: `src/tools/web_search_adapter.py` | |
| - **Task 5.4.1**: Remove import of `folder.llm_config` and `folder.tools.web_search` | |
| - **Task 5.4.2**: Update error messages to reflect new implementation | |
| --- | |
| ### PROJECT 6: Testing and Validation | |
| #### Activity 6.1: Unit Tests for Vendored Components | |
| **File**: `tests/unit/tools/test_vendored_web_search_core.py` | |
| - **Task 6.1.1**: Test `scrape_urls()` function | |
| - **Task 6.1.2**: Test `fetch_and_process_url()` function | |
| - **Task 6.1.3**: Test `html_to_text()` function | |
| - **Task 6.1.4**: Test `is_valid_url()` function | |
| **File**: `tests/unit/tools/test_vendored_serper_client.py` | |
| - **Task 6.1.5**: Mock SerperClient API calls | |
| - **Task 6.1.6**: Test successful search | |
| - **Task 6.1.7**: Test error handling | |
| - **Task 6.1.8**: Test rate limiting | |
| **File**: `tests/unit/tools/test_vendored_searchxng_client.py` | |
| - **Task 6.1.9**: Mock SearchXNGClient API calls | |
| - **Task 6.1.10**: Test successful search | |
| - **Task 6.1.11**: Test error handling | |
| - **Task 6.1.12**: Test rate limiting | |
| #### Activity 6.2: Unit Tests for Web Search Tools | |
| **File**: `tests/unit/tools/test_serper_web_search.py` | |
| - **Task 6.2.1**: Test `SerperWebSearchTool.__init__()` with valid API key | |
| - **Task 6.2.2**: Test `SerperWebSearchTool.__init__()` without API key (should raise) | |
| - **Task 6.2.3**: Test `name` property returns `"serper"` | |
| - **Task 6.2.4**: Test `search()` returns `list[Evidence]` | |
| - **Task 6.2.5**: Test `search()` with mocked SerperClient | |
| - **Task 6.2.6**: Test error handling (SearchError, RateLimitError) | |
| - **Task 6.2.7**: Test query preprocessing | |
| - **Task 6.2.8**: Test rate limiting | |
| **File**: `tests/unit/tools/test_searchxng_web_search.py` | |
| - **Task 6.2.9**: Similar tests for SearchXNGWebSearchTool | |
| **File**: `tests/unit/tools/test_web_search.py` | |
| - **Task 6.2.10**: Test `WebSearchTool.name` property returns `"duckduckgo"` | |
| - **Task 6.2.11**: Test `WebSearchTool.search()` returns `list[Evidence]` | |
| - **Task 6.2.12**: Test `WebSearchTool.search()` with mocked DDGS | |
| - **Task 6.2.13**: Test error handling | |
| - **Task 6.2.14**: Test query preprocessing | |
| #### Activity 6.3: Integration Tests | |
| **File**: `tests/integration/test_web_search_integration.py` | |
| - **Task 6.3.1**: Test `SerperWebSearchTool` with real API (marked `@pytest.mark.integration`) | |
| - **Task 6.3.2**: Test `SearchXNGWebSearchTool` with real API (marked `@pytest.mark.integration`) | |
| - **Task 6.3.3**: Test `WebSearchTool` with real DuckDuckGo (marked `@pytest.mark.integration`) | |
| - **Task 6.3.4**: Test `create_web_search_tool()` factory function | |
| - **Task 6.3.5**: Test SearchHandler with web search tool | |
| #### Activity 6.4: Update Existing Tests | |
| **File**: `tests/unit/agents/test_retrieval_agent.py` | |
| - **Task 6.4.1**: Update tests to expect `list[Evidence]` instead of `SearchResult` | |
| - **Task 6.4.2**: Mock `WebSearchTool.search()` to return `list[Evidence]` | |
| **File**: `tests/unit/tools/test_tool_executor.py` | |
| - **Task 6.4.3**: Verify tests still pass with updated `web_search_adapter` | |
| --- | |
| ## Implementation Order | |
| 1. **PROJECT 1**: Vendor core components (foundation) | |
| 2. **PROJECT 3**: Fix existing WebSearchTool (quick win, unblocks retrieval agent) | |
| 3. **PROJECT 2**: Create SerperWebSearchTool (new functionality) | |
| 4. **PROJECT 4**: Integrate into SearchHandler (main integration) | |
| 5. **PROJECT 5**: Update callers (cleanup dependencies) | |
| 6. **PROJECT 6**: Testing (validation) | |
| --- | |
| ## Dependencies and Prerequisites | |
| ### External Dependencies | |
| - `aiohttp` - Already in requirements | |
| - `beautifulsoup4` - Already in requirements | |
| - `duckduckgo-search` - Already in requirements | |
| - `tenacity` - Already in requirements | |
| - `structlog` - Already in requirements | |
| ### Internal Dependencies | |
| - `src/tools/base.py` - SearchTool protocol | |
| - `src/tools/rate_limiter.py` - Rate limiting utilities | |
| - `src/tools/query_utils.py` - Query preprocessing | |
| - `src/utils/config.py` - Settings and configuration | |
| - `src/utils/exceptions.py` - Custom exceptions | |
| - `src/utils/models.py` - Evidence, Citation models | |
| ### Configuration Requirements | |
| - `SERPER_API_KEY` - For Serper provider | |
| - `SEARCHXNG_HOST` - For SearchXNG provider | |
| - `WEB_SEARCH_PROVIDER` - Environment variable (default: "duckduckgo") | |
| --- | |
| ## Risk Assessment | |
| ### High Risk | |
| - **Breaking changes to retrieval_agent.py**: Must update carefully to handle `list[Evidence]` instead of `SearchResult` | |
| - **Legacy folder dependencies**: Need to ensure all code is properly vendored | |
| ### Medium Risk | |
| - **Rate limiting**: Serper API may have different limits than expected | |
| - **Error handling**: Need to handle API failures gracefully | |
| ### Low Risk | |
| - **Query preprocessing**: May need adjustment for web search vs PubMed | |
| - **Testing**: Integration tests require API keys | |
| --- | |
| ## Success Criteria | |
| 1. β `SerperWebSearchTool` implements `SearchTool` protocol correctly | |
| 2. β `WebSearchTool` implements `SearchTool` protocol correctly | |
| 3. β Both tools can be added to `SearchHandler` | |
| 4. β `web_search_adapter` works with new implementation | |
| 5. β `retrieval_agent` works with updated `WebSearchTool` | |
| 6. β All unit tests pass | |
| 7. β Integration tests pass (with API keys) | |
| 8. β No dependencies on `folder/tools/web_search.py` in `src/` code | |
| 9. β Configuration supports multiple providers | |
| 10. β Error handling is robust | |
| --- | |
| ## Notes | |
| - The vendored code should be self-contained and not depend on `folder/` modules | |
| - Filter agent functionality from original code is removed (can be added later if needed) | |
| - Rate limiting follows the same pattern as PubMed tool | |
| - Query preprocessing may need web-specific adjustments (less aggressive than PubMed) | |
| - Consider adding relevance scoring in the future | |