Beyond the Docs: Scripting Python for Shadow API Discovery Beyond the Docs: Scripting Python for Shadow API Discovery

Shadow API discovery is a critical phase in modern penetration testing, moving beyond officially documented endpoints to uncover hidden or unadvertised API surfaces. This isn't about parsing an OpenAPI spec; it's about active reconnaissance and traffic analysis to expose APIs not meant for public consumption or those residing on overlooked subdomains. Python, with its robust HTTP libraries, regular expression capabilities, and automation potential, serves as an invaluable asset in this hunt, transforming tedious manual analysis into repeatable, scalable scripts.

Initial Foothold: Passive Recon for API Leads

Before hitting targets with active requests, gather intelligence passively. Identifying subdomains often reveals internal-facing applications or forgotten API gateways. Similarly, static assets like JavaScript files frequently contain hardcoded API paths that aren't immediately obvious from network traffic.

Subdomain Enumeration for API Gateways

Tools like subfinder or assetfinder efficiently enumerate subdomains. The output from these tools can then be piped into a Python script for further analysis, particularly looking for keywords like api, dev, test, or specific product names. We're sifting for domains that might host API instances rather than user-facing web applications.


subfinder -d example.com -silent | tee subdomains.txt

Once you have a list of subdomains, a quick Python script can perform HTTP status checks and categorize them. This helps prune non-responsive or irrelevant hosts before deeper dives.


import requests
import sys
from concurrent.futures import ThreadPoolExecutor

def check_subdomain(subdomain):
    url = f"http://{subdomain}"
    try:
        response = requests.get(url, timeout=5, allow_redirects=True)
        print(f"[+] {url} - Status: {response.status_code}")
        return url, response.status_code
    except requests.exceptions.RequestException as e:
        # print(f"[-] {url} - Error: {e}") # Uncomment for verbose error
        return url, None

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python subdomain_checker.py <subdomains_file.txt>")
        sys.exit(1)

    subdomains_file = sys.argv
    with open(subdomains_file, 'r') as f:
        subdomains = [line.strip() for line in f if line.strip()]

    print(f"Checking {len(subdomains)} subdomains...")
    
    with ThreadPoolExecutor(max_workers=20) as executor:
        results = list(executor.map(check_subdomain, subdomains))
    
    # Optional: Further processing of results, e.g., filtering 200s

JavaScript Scrutiny for Hidden Endpoints

JavaScript files are a goldmine for undocumented API endpoints. Developers often hardcode paths, sensitive strings, or even full API schemas within client-side code. After identifying JavaScript files from crawled pages or identified subdomains, fetch them and apply regular expressions to extract potential API paths.

Tools like waybackurls, gau, or getallurls (go install github.com/lc/gau@latest) can quickly pull URLs, including JS files, associated with a domain. We'll then use Python to automate fetching and scanning these scripts.


gau example.com | grep '.js$' | tee js_files.txt

import requests
import re
import sys
from concurrent.futures import ThreadPoolExecutor

# Regex patterns for common API endpoint structures
# Examples: /api/v1/users, /users/id, /endpoint
API_PATTERNS = [
    re.compile(r'/(?:api|v\d+)/[a-zA-Z0-9_\-/]+'),
    re.compile(r'/[a-zA-Z0-9_\-]+/v\d+/[a-zA-Z0-9_\-/]+'), # e.g., /service/v1/resource
    re.compile(r'/[a-zA-Z0-9_\-]+/(?:create|read|update|delete|get|post|put|patch|info|data)'), # common actions
    re.compile(r'"(?:/api|/v\d+|/[a-zA-Z0-9_\-]+/(?:api|v\d+))/[a-zA-Z0-9_\-/]+\b"') # paths in quotes
]

def fetch_and_scan_js(js_url):
    try:
        response = requests.get(js_url, timeout=10)
        if response.status_code == 200 and 'javascript' in response.headers.get('Content-Type', ''):
            print(f"[+] Processing JS: {js_url}")
            found_endpoints = set()
            for pattern in API_PATTERNS:
                matches = pattern.finditer(response.text)
                for match in matches:
                    endpoint = match.group(0).strip('"') # Remove quotes if matched with quote pattern
                    if endpoint.startswith('/') and len(endpoint) > 2: # Basic sanity check
                        found_endpoints.add(endpoint)
            return list(found_endpoints)
        else:
            # print(f"[-] Not a valid JS file or non-200 status for {js_url}") # Verbose error
            pass
    except requests.exceptions.RequestException as e:
        # print(f"[-] Error fetching {js_url}: {e}") # Verbose error
        pass
    return []

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python js_scanner.py <js_urls_file.txt>")
        sys.exit(1)

    js_urls_file = sys.argv
    with open(js_urls_file, 'r') as f:
        js_urls = [line.strip() for line in f if line.strip()]

    print(f"Scanning {len(js_urls)} JavaScript files for API endpoints...")

    all_discovered_endpoints = set()
    with ThreadPoolExecutor(max_workers=10) as executor:
        for endpoints_list in executor.map(fetch_and_scan_js, js_urls):
            all_discovered_endpoints.update(endpoints_list)
    
    print("\n--- Discovered API Endpoints ---")
    for ep in sorted(list(all_discovered_endpoints)):
        print(ep)

Active Analysis: Intercepting and Deciphering Traffic

Passive reconnaissance provides leads; active traffic analysis confirms and expands them. Proxy tools are indispensable here, but raw proxy logs quickly become unmanageable. Python can process these logs efficiently.

Proxying with a Purpose (Burp Suite)

Route all browser traffic, mobile app traffic, or even command-line tool traffic through Burp Suite. This captures every HTTP request and response. The crucial part is not just capturing, but observing patterns. Look for:

  • Requests to non-standard ports or subdomains.
  • JSON or XML request/response bodies.
  • Authorization headers (Bearer tokens, API keys).
  • Unusual HTTP methods (PATCH, PROPFIND).

Exporting Burp's HTTP history, typically in XML format, allows for programmatic analysis. While Burp offers extensions for deeper analysis, external Python scripting gives more flexibility and integration with other custom tools.


# Manual steps in Burp Suite:
# 1. Proxy -> HTTP history
# 2. Right-click anywhere in the history table -> "Save item" -> "All items"
# 3. Choose a file name (e.g., burp_history.xml)

Programmatic Traffic Analysis with Python

Once Burp history is exported, Python can parse the XML and extract interesting data. We're looking for unique endpoints, parameters, and headers that signal API interaction. The xml.etree.ElementTree module is standard for XML parsing in Python.


import xml.etree.ElementTree as ET
import re
import sys
from urllib.parse import urlparse, parse_qs

def analyze_burp_history(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()

    discovered_endpoints = set()
    discovered_parameters = set()
    
    print(f"Analyzing Burp Suite history from {xml_file}...")

    for item in root.findall('item'):
        host = item.find('host').text
        port = item.find('port').text
        protocol = item.find('protocol').text
        url_path = item.find('url').text # Full URL from Burp

        parsed_url = urlparse(url_path)
        path = parsed_url.path
        query_params = parse_qs(parsed_url.query)

        full_endpoint = f"{protocol}://{host}:{port}{path}"
        discovered_endpoints.add(full_endpoint)

        for param_name in query_params:
            discovered_parameters.add(param_name)

        # Look for parameters in POST bodies if available
        request_base64 = item.find('request').text
        if request_base64:
            # Burp's request/response elements are often Base64 encoded
            import base64
            try:
                decoded_request = base64.b64decode(request_base64).decode('utf-8', errors='ignore')
                # Simple regex to find common parameter patterns in POST body (e.g., key=value or "key":"value")
                body_params = re.findall(r'(?:[a-zA-Z0-9_\-]+)=(?:[^&]+)|"(?:[a-zA-Z0-9_\-]+)":"(?:[^"]+)"', decoded_request)
                for param in body_params:
                    # Extract just the key part
                    if '=' in param:
                        discovered_parameters.add(param.split('='))
                    elif ':' in param:
                        discovered_parameters.add(param.split(':').strip('"'))
            except Exception as e:
                # print(f"Error decoding request: {e}") # Verbose error
                pass


    print("\n--- Unique Discovered Endpoints ---")
    for ep in sorted(list(discovered_endpoints)):
        print(ep)

    print("\n--- Unique Discovered Parameters ---")
    for param in sorted(list(discovered_parameters)):
        print(param)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python burp_parser.py <burp_history.xml>")
        sys.exit(1)

    burp_file = sys.argv
    analyze_burp_history(burp_file)

Probing the Unknown: Fuzzing for Shadow Endpoints

Even after extensive passive and active analysis, many API endpoints remain hidden. These often follow predictable patterns but are simply not linked or referenced. Fuzzing with intelligent wordlists becomes crucial here.

Intelligent Wordlist Generation

Generic wordlists are a start, but custom, target-specific wordlists are far more effective. The previously extracted parameters and common API keywords (user, admin, v1, auth, dashboard, report) are excellent seeds. Combine these with common HTTP verbs and known API conventions to generate targeted lists.


import itertools

def generate_api_wordlist(base_paths, common_terms, versions):
    wordlist = set()
    
    # Base paths with common terms
    for path in base_paths:
        for term in common_terms:
            wordlist.add(f"{path}/{term}")
            wordlist.add(f"{path}/v1/{term}") # Common versioning
            for ver in versions:
                 wordlist.add(f"{path}/{ver}/{term}")

    # Combinations of terms and versions
    for term_combo in itertools.permutations(common_terms, 2):
        wordlist.add(f"/{term_combo}/{term_combo}")
        for ver in versions:
            wordlist.add(f"/{ver}/{term_combo}/{term_combo}")

    # Add common API route segments directly
    for term in common_terms:
        wordlist.add(f"/api/{term}")
        wordlist.add(f"/{term}s") # Plural forms
        for ver in versions:
            wordlist.add(f"/api/{ver}/{term}")
            wordlist.add(f"/{ver}/{term}")

    return sorted(list(wordlist))

if __name__ == "__main__":
    base_paths_discovered = ["/api", "/admin", "/service"] # From prior recon
    common_api_terms = ["users", "products", "orders", "auth", "login", "register", "status", "data", "report", "config"]
    api_versions = ["v1", "v2", "v3", "alpha", "beta"]

    custom_wordlist = generate_api_wordlist(base_paths_discovered, common_api_terms, api_versions)
    
    with open("custom_api_wordlist.txt", "w") as f:
        for item in custom_wordlist:
            f.write(item + "\n")
    print(f"Generated custom API wordlist with {len(custom_wordlist)} entries: custom_api_wordlist.txt")

    # Example entries from the generated list
    print("\nExample entries:")
    for i in range(min(10, len(custom_wordlist))):
        print(custom_wordlist[i])

Targeted Endpoint Fuzzing with ffuf and Python

ffuf is an incredibly fast fuzzer ideal for discovering hidden endpoints. Pair it with the custom wordlists generated above. Focus on common API prefixes like /api/, /v1/, or subdomains identified as potential API gateways.


# Example ffuf command targeting a base API path
# -w: wordlist
# -u: URL with FUZZ keyword
# -H: Add header (e.g., Content-Type for JSON APIs)
# -mc: Match status codes (e.g., 200, 401, 403, 500 can indicate API presence)
# -recursion: Follow redirects and fuzz again (useful but can be noisy)
# -recursion-depth: Depth for recursion
# -sa: Skip asset extensions (e.g., .js, .css)
# -fs: Filter size (e.g., ignore common 404 page sizes)

ffuf -w custom_api_wordlist.txt -u https://api.example.comFUZZ -H "Content-Type: application/json" -mc 200,401,403,500 -fs 1234,4567 -o api_fuzz_results.json -of json

The -of json flag for ffuf makes output parsing straightforward. A Python script can then ingest these results, filter out noise, and identify unique, responsive endpoints. Focus on non-404/non-302 responses that don't match typical static asset sizes or known error page sizes.


import json
import sys

def parse_ffuf_results(json_file):
    discovered_endpoints = set()
    
    with open(json_file, 'r') as f:
        data = json.load(f)
    
    print(f"Parsing ffuf results from {json_file}...")

    for result in data['results']:
        url = result['url']
        status_code = result['status']
        length = result['length']
        
        # Heuristic: Filter out common 404-like responses by status and/or size
        # Adjust 'known_404_sizes' based on your target's typical error page sizes
        known_404_sizes = {1234, 4567, 8910} # Example sizes to filter

        if status_code not in and length not in known_404_sizes:
            discovered_endpoints.add(url)
            # Optionally print more details
            # print(f"Found: {url} (Status: {status_code}, Size: {length})")

    print("\n--- Fuzzed Discovered API Endpoints ---")
    for ep in sorted(list(discovered_endpoints)):
        print(ep)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python ffuf_parser.py <ffuf_output.json>")
        sys.exit(1)

    ffuf_output_file = sys.argv
    parse_ffuf_results(ffuf_output_file)

Beyond Discovery: Parameter and Authentication Nuances

Uncovering Hidden Parameters

Discovering an endpoint is only half the battle. Many APIs hide parameters. Once an endpoint is identified, probe it with common API parameter names (e.g., id, user_id, token, callback, format, debug) using various HTTP methods. Python's requests library makes this trivial, allowing rapid iteration through potential parameters and values.


import requests

def test_parameters(endpoint, common_params, headers=None):
    print(f"Testing parameters for: {endpoint}")
    found_params = []
    
    for param in common_params:
        # Test with a GET request
        try:
            params = {param: "testvalue"}
            response = requests.get(endpoint, params=params, headers=headers, timeout=5)
            if response.status_code not in: # Look for non-explicit errors
                print(f"  [+] GET with '{param}' (Status: {response.status_code})")
                found_params.append(param)
        except requests.exceptions.RequestException:
            pass
        
        # Test with a POST request (if the endpoint might accept POST)
        try:
            data = {param: "testvalue"}
            response = requests.post(endpoint, json=data, headers=headers, timeout=5)
            if response.status_code not in:
                print(f"  [+] POST with '{param}' (Status: {response.status_code})")
                if param not in found_params:
                    found_params.append(param)
        except requests.exceptions.RequestException:
            pass
            
    return found_params

if __name__ == "__main__":
    target_endpoint = "https://api.example.com/v1/users" # Example from earlier discovery
    common_api_parameters = ["id", "user_id", "name", "email", "query", "limit", "offset", 
                             "callback", "debug", "version", "sort_by", "filter"]
    
    # Example headers (e.g., if you have a known API key or JWT)
    # auth_headers = {"Authorization": "Bearer YOUR_TOKEN_HERE"}
    auth_headers = None # Start without auth if unknown

    discovered = test_parameters(target_endpoint, common_api_parameters, auth_headers)
    if discovered:
        print(f"\nDiscovered parameters for {target_endpoint}: {', '.join(discovered)}")
    else:
        print(f"\nNo new parameters discovered for {target_endpoint}.")

Scripted Auth Testing

Shadow APIs often have weaker or misconfigured authentication. Once an API is discovered, Python scripts can be used to test various authentication bypasses: no authentication, invalid tokens, default credentials, or even iterating through common weak JWTs. This requires a targeted approach for each discovered API, potentially chaining with previously found authorization headers or cookies.


import requests

def test_auth_bypass(endpoint, method="GET", json_data=None, headers=None):
    print(f"Testing auth bypass for {method} {endpoint}")
    try:
        if method.upper() == "GET":
            response = requests.get(endpoint, headers=headers, timeout=7)
        elif method.upper() == "POST":
            response = requests.post(endpoint, json=json_data, headers=headers, timeout=7)
        else:
            print(f"Unsupported method: {method}")
            return
        
        print(f"  [+] Status: {response.status_code}")
        print(f"  [+] Response snippet: {response.text[:200]}...") # Show first 200 chars
        
        if response.status_code == 200:
            print("  [!!!] Potential bypass: 200 OK without expected authentication.")
        elif response.status_code == 401:
            print("  [-] Authentication required (401 Unauthorized).")
        elif response.status_code == 403:
            print("  [-] Forbidden (403 Forbidden).")
        
    except requests.exceptions.RequestException as e:
        print(f"  [-] Request error: {e}")

if __name__ == "__main__":
    sensitive_endpoint = "https://api.example.com/v1/admin/users" # Found admin endpoint
    
    print("--- Testing without any authentication ---")
    test_auth_bypass(sensitive_endpoint, method="GET")

    print("\n--- Testing with a potentially invalid/expired token ---")
    invalid_headers = {"Authorization": "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.invalid.signature"}
    test_auth_bypass(sensitive_endpoint, method="GET", headers=invalid_headers)

    # Example of testing a POST endpoint if relevant
    # post_data = {"action": "list_all_users"}
    # print("\n--- Testing POST with a potentially invalid token ---")
    # test_auth_bypass(sensitive_endpoint, method="POST", json_data=post_data, headers=invalid_headers)