Automating the Discovery of Unauthenticated API Endpoints with Python

Identifying unauthenticated API endpoints is a critical step in any web application penetration test. The goal is to programmatically discover API paths that respond with data or perform actions without requiring a valid authentication token, often exposing sensitive information or functionality. This write-up details a Python-centric approach to automating this discovery process.

Initial Reconnaissance: Beyond the Browser

Before any automated scanning begins, a robust reconnaissance phase is essential. This isn't just about clicking around; it involves deep inspection of client-side code, network traffic, and publicly available documentation.

Passive Information Gathering

Start by scrutinizing public-facing documentation like Swagger UI or OpenAPI specifications. These often explicitly list endpoints, their methods, and expected parameters. While they might indicate authentication requirements, it's a prime source for potential paths. Additionally, examine the target application's JavaScript files. Many modern web applications make API calls directly from the browser, and these calls are often hardcoded within the JavaScript. Look for instances of fetch(), XMLHttpRequest, axios.get(), or similar constructs that reveal API paths.

Active Spidering and Proxying

Utilize a web proxy like Burp Suite Professional to spider the application comprehensively. Configure your browser to proxy traffic through Burp, then manually navigate through all accessible parts of the application. The Burp Proxy history and Site map will populate with all observed requests, including those to API endpoints. Pay close attention to calls made during page loads, form submissions, and AJAX interactions. Exporting the sitemap or the HTTP history can provide a rich list of potential API paths to feed into an automated script.

For a quick manual check of a suspected endpoint, curl is indispensable. Testing without any authentication headers provides a baseline:

curl -s -i "https://api.example.com/v1/users"

Observe the HTTP status code and response body. A 200 OK with meaningful data without any prior authentication is a strong indicator of an unauthenticated endpoint. Conversely, a 401 Unauthorized, 403 Forbidden, or a redirect to a login page suggests protection.

Building the Python Endpoint Scanner

Python's requests library is the cornerstone of this automation. It provides an intuitive way to interact with HTTP services, making it ideal for crafting and sending requests to potential API endpoints.

The requests Library: Your HTTP Workhorse

The requests library simplifies HTTP requests. We'll use it to send GET, POST, and potentially other HTTP methods to a list of suspected API paths. It handles connection pooling, SSL verification, and cookie management, making our script robust.

import requests

def make_request(url, method='GET', headers=None, data=None):
    try:
        response = requests.request(method, url, headers=headers, data=data, timeout=5, allow_redirects=False)
        return response
    except requests.exceptions.RequestException as e:
        print(f"Error making request to {url}: {e}")
        return None

# Example usage:
# response = make_request("https://api.example.com/status")
# if response:
#     print(f"Status: {response.status_code}, Length: {len(response.text)}")

Endpoint Enumeration: Wordlists and Fuzzing

Beyond endpoints discovered via passive and active reconnaissance, a significant portion of the discovery process involves brute-forcing common API path patterns. API endpoints often follow predictable structures like /api/v1/users, /v2/products, /admin/config.

Leverage comprehensive wordlists from resources like SecLists, specifically those targeting directories and common API endpoints. Good starting points include Discovery/Web-Content/raft-medium-directories.txt or Discovery/Web-Content/common.txt, and more API-specific lists if available.

The strategy involves combining a base API path (e.g., /api/v1/) with a list of common endpoint suffixes (e.g., users, items, status, health, config, admin, profile). It's also critical to consider different HTTP methods (GET, POST, PUT, DELETE) for each path, as an endpoint might be unauthenticated for one method but protected for another.

def load_wordlist(filepath):
    try:
        with open(filepath, 'r') as f:
            return [line.strip() for line in f if line.strip() and not line.startswith('#')]
    except FileNotFoundError:
        print(f"Error: Wordlist not found at {filepath}")
        return []

# Common API path suffixes
common_suffixes = load_wordlist("seclists/Discovery/Web-Content/raft-medium-directories.txt")
# Or a custom list
# common_suffixes = ["users", "products", "items", "status", "health", "metrics", "config", "admin", "profile", "data", "search"]

Parsing JavaScript for Endpoint Hints

JavaScript files are a treasure trove of API endpoint information. Extracting these hints requires fetching the JavaScript and then applying regular expressions or parsing libraries to identify potential paths. This can be integrated into the overall scanning process.

import re
from bs4 import BeautifulSoup

def extract_js_from_html(html_content, base_url):
    soup = BeautifulSoup(html_content, 'html.parser')
    js_urls = []
    for script in soup.find_all('script', src=True):
        src = script['src']
        if src.startswith('http'):
            js_urls.append(src)
        elif src.startswith('//'): # Protocol-relative URL
            js_urls.append("https:" + src) # Assume HTTPS
        else: # Relative URL
            js_urls.append(requests.compat.urljoin(base_url, src))
    return js_urls

def extract_api_paths_from_js(js_content):
    # Regex for common API path patterns within JS strings
    # Looks for /api/vX/..., /vX/..., or just /something...
    # Excludes common static file extensions
    api_path_patterns = [
        re.compile(r'(?:"|\')(/api/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/users/[a-zA-Z0-9_\-./]+)(?:"|\')'), # Specific patterns
        re.compile(r'(?:"|\')(/auth/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/admin/[a-zA-Z0-9_\-./]+)(?:"|\')'),
    ]
    found_paths = set()
    for pattern in api_path_patterns:
        matches = pattern.findall(js_content)
        for match in matches:
            # Simple sanitization to remove common JS string artifacts
            cleaned_path = match.strip("'\"")
            # Further filtering: Exclude paths likely to be static assets
            if not any(cleaned_path.endswith(ext) for ext in ['.js', '.css', '.png', '.jpg', '.gif', '.svg', '.woff', '.ttf', '.map']):
                found_paths.add(cleaned_path)
    return list(found_paths)

The Automation Script: Core Logic

The core automation script ties together the reconnaissance and enumeration techniques. It iterates through a list of potential API paths, sends requests, and analyzes the responses for indicators of unauthenticated access. We'll define "unauthenticated" as receiving a 200 OK status with meaningful content, or a response that is identical to a request made *without* an authentication header, implying no authentication check occurred.

import requests
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

# --- Helper functions (from above) ---
def make_request(url, method='GET', headers=None, data=None):
    try:
        # User-Agent header for better emulation and avoiding blocks
        default_headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
        }
        if headers:
            default_headers.update(headers)

        response = requests.request(method, url, headers=default_headers, data=data, timeout=7, allow_redirects=False)
        return response
    except requests.exceptions.RequestException as e:
        # print(f"Error making request to {url}: {e}")
        return None

def load_wordlist(filepath):
    try:
        with open(filepath, 'r') as f:
            return [line.strip() for line in f if line.strip() and not line.startswith('#')]
    except FileNotFoundError:
        # print(f"Error: Wordlist not found at {filepath}. Using empty list.")
        return []

def extract_js_from_html(html_content, base_url):
    soup = BeautifulSoup(html_content, 'html.parser')
    js_urls = []
    for script in soup.find_all('script', src=True):
        src = script['src']
        if src.startswith('http') or src.startswith('//'):
            js_urls.append(src if src.startswith('http') else "https:" + src)
        else: # Relative URL
            js_urls.append(urljoin(base_url, src))
    return js_urls

def extract_api_paths_from_js(js_content):
    api_path_patterns = [
        re.compile(r'(?:"|\')(/api/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/users/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/auth/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/admin/[a-zA-Z0-9_\-./]+)(?:"|\')'),
        re.compile(r'(?:"|\')(/status)(?:"|\')'),
        re.compile(r'(?:"|\')(/health)(?:"|\')'),
        re.compile(r'(?:"|\')(/metrics)(?:"|\')'),
        re.compile(r'(?:"|\')(/config)(?:"|\')'),
    ]
    found_paths = set()
    static_extensions = ['.js', '.css', '.png', '.jpg', '.jpeg', '.gif', '.svg', '.ico', '.woff', '.ttf', '.map', '.json'] # Added .json to exclude common static config files
    for pattern in api_path_patterns:
        matches = pattern.findall(js_content)
        for match in matches:
            cleaned_path = match.strip("'\"")
            if not any(cleaned_path.lower().endswith(ext) for ext in static_extensions):
                found_paths.add(cleaned_path)
    return list(found_paths)

# --- Main Scanner Logic ---
def scan_unauthenticated_endpoints(base_url, wordlist_path="seclists/Discovery/Web-Content/common.txt"):
    print(f"[+] Starting scan for unauthenticated API endpoints on: {base_url}")
    parsed_base_url = urlparse(base_url)
    if not parsed_base_url.scheme or not parsed_base_url.netloc:
        print(f"[-] Invalid base URL: {base_url}")
        return

    potential_paths = set()

    # Step 1: Initial crawl to find base paths and JS files
    print("[*] Initial crawl to discover JS files and basic paths...")
    initial_response = make_request(base_url)
    if initial_response and initial_response.status_code == 200:
        # Extract paths from HTML content
        # This part could be expanded for more aggressive HTML parsing for hrefs/forms
        
        # Extract JS files
        js_urls = extract_js_from_html(initial_response.text, base_url)
        print(f"[+] Found {len(js_urls)} JavaScript files.")
        for js_url in js_urls:
            js_response = make_request(js_url)
            if js_response and js_response.status_code == 200:
                extracted_paths = extract_api_paths_from_js(js_response.text)
                for path in extracted_paths:
                    potential_paths.add(path)
                # print(f"    Extracted {len(extracted_paths)} paths from {js_url}")

    # Step 2: Load wordlist for brute-forcing
    wordlist_suffixes = load_wordlist(wordlist_path)
    print(f"[+] Loaded {len(wordlist_suffixes)} wordlist entries.")

    # Combine known base paths (e.g., from JS or manual recon) with wordlist suffixes
    # This example focuses on common API patterns relative to the base URL
    # In a real scenario, you might have explicit /api/v1/ as a base
    
    # Simple base path examples for fuzzing
    api_bases = ["/api/", "/api/v1/", "/v1/", "/api/v2/", "/v2/"] 
    for base in api_bases:
        for suffix in wordlist_suffixes:
            # Ensure paths are clean, no double slashes
            full_path = urljoin(base, suffix).replace('//', '/')
            potential_paths.add(full_path)
    
    # Add potential paths discovered from JS
    # The `potential_paths` set already handles uniqueness

    print(f"[*] Total unique paths to test: {len(potential_paths)}")
    
    unauthenticated_endpoints = []
    
    http_methods = ['GET', 'POST'] # Commonly unauthenticated for discovery

    for path in sorted(list(potential_paths)):
        target_url = urljoin(base_url, path)
        for method in http_methods:
            # Test without authentication headers
            response_no_auth = make_request(target_url, method=method)
            
            # Test with an empty/invalid Authorization header for comparison
            # This helps differentiate truly public endpoints from those that simply ignore a bad token
            response_with_empty_auth = make_request(target_url, method=method, headers={"Authorization": "Bearer invalid_token"})

            if response_no_auth:
                status_code_no_auth = response_no_auth.status_code
                content_length_no_auth = len(response_no_auth.content)
                
                # Heuristics for unauthenticated access:
                # 1. 200 OK without any auth headers, and response content is non-empty and not a simple error message
                # 2. Response content/status is identical with and without a bogus auth header, suggesting auth isn't checked
                
                is_unauthenticated = False
                if status_code_no_auth == 200 and content_length_no_auth > 50: # Arbitrary content length to avoid empty responses
                    # Check for common "not found" or "bad request" phrases in 200 responses that aren't real data
                    if not (any(phrase in response_no_auth.text.lower() for phrase in ["not found", "bad request", "invalid parameters"])):
                        is_unauthenticated = True
                
                if response_with_empty_auth:
                    if status_code_no_auth == response_with_empty_auth.status_code and \
                       response_no_auth.content == response_with_empty_auth.content and \
                       status_code_no_auth not in: # If status is 401/403, it's definitely authenticated
                        is_unauthenticated = True

                if is_unauthenticated:
                    print(f"[!] Unauthenticated Endpoint Found: {method} {target_url} (Status: {status_code_no_auth}, Length: {content_length_no_auth})")
                    unauthenticated_endpoints.append(f"{method} {target_url}")
            
    if not unauthenticated_endpoints:
        print("[-] No unauthenticated API endpoints found with current heuristics.")
    else:
        print("\n[+] Summary of Unauthenticated Endpoints:")
        for ep in unauthenticated_endpoints:
            print(f"    - {ep}")

if __name__ == "__main__":
    target_app_url = "http://localhost:8000" # Replace with your target URL
    seclists_path = "path/to/SecLists/Discovery/Web-Content/common.txt" # Adjust path to your SecLists
    
    # To run, ensure you have a local web server (e.g., Python's SimpleHTTPServer or a dummy Flask app)
    # and adjust the `seclists_path` variable.

    # Example output:
    # [+] Starting scan for unauthenticated API endpoints on: http://localhost:8000
    # [*] Initial crawl to discover JS files and basic paths...
    # [+] Found 1 JavaScript files.
    # [*] Total unique paths to test: 2500
    # [!] Unauthenticated Endpoint Found: GET http://localhost:8000/api/v1/health (Status: 200, Length: 20)
    # [!] Unauthenticated Endpoint Found: GET http://localhost:8000/status (Status: 200, Length: 15)
    # [!] Unauthenticated Endpoint Found: POST http://localhost:8000/data (Status: 200, Length: 100)
    # ...
    # [+] Summary of Unauthenticated Endpoints:
    #     - GET http://localhost:8000/api/v1/health
    #     - GET http://localhost:8000/status
    #     - POST http://localhost:8000/data

Analyzing Responses and False Positives

A 200 OK status code alone is insufficient to declare an endpoint "unauthenticated and vulnerable." Many applications return 200 OK for public endpoints that aren't particularly sensitive, or even for 404-like custom error pages. The key is to analyze the content and consistency of the response.

  • Compare the response to a request made without any authentication headers versus one made with an invalid or empty Authorization header. If the responses (status code, content length, body content) are identical, it's a strong indicator that the endpoint is not checking for authentication or is ignoring it.
  • Look for JSON or XML data that appears to be application-specific rather than a generic error message.
  • Check for redirects. If an unauthenticated request redirects to a login page, the endpoint is protected. Our script disables allow_redirects to prevent false positives from redirects.
  • The content length heuristic (e.g., content_length_no_auth > 50) helps filter out trivial responses, but it's not foolproof. A small, but sensitive, piece of data could still be exposed. Manual verification of identified endpoints is always required.

Expanding the Attack Surface

The provided script is a starting point. Further enhancements include:

  • **Discovering Hidden Parameters**: Use tools like ffuf or extend the Python script to fuzz for common parameter names (e.g., id, user_id, name, query) on discovered paths.
  • **Different HTTP Methods**: While GET and POST are common for initial discovery, PUT, DELETE, and PATCH methods should also be tested, as misconfigurations can expose dangerous functionality.
  • **Content Types**: Experiment with different Content-Type headers (e.g., application/json, application/x-www-form-urlencoded) if the application expects specific input formats.
  • **Dynamic Path Segments**: Many APIs use dynamic segments like /users/{id}/profile. Fuzzing these segments with common IDs (1, admin, self) can reveal additional endpoints.

This automated approach provides a systematic way to uncover API attack surface, significantly reducing manual effort and increasing the likelihood of identifying exploitable unauthenticated endpoints.