Identifying unauthenticated API endpoints is a critical step in any web application penetration test. The goal is to programmatically discover API paths that respond with data or perform actions without requiring a valid authentication token, often exposing sensitive information or functionality. This write-up details a Python-centric approach to automating this discovery process.
Initial Reconnaissance: Beyond the Browser
Before any automated scanning begins, a robust reconnaissance phase is essential. This isn't just about clicking around; it involves deep inspection of client-side code, network traffic, and publicly available documentation.
Passive Information Gathering
Start by scrutinizing public-facing documentation like Swagger UI or OpenAPI specifications. These often explicitly list endpoints, their methods, and expected parameters. While they might indicate authentication requirements, it's a prime source for potential paths. Additionally, examine the target application's JavaScript files. Many modern web applications make API calls directly from the browser, and these calls are often hardcoded within the JavaScript. Look for instances of fetch(), XMLHttpRequest, axios.get(), or similar constructs that reveal API paths.
Active Spidering and Proxying
Utilize a web proxy like Burp Suite Professional to spider the application comprehensively. Configure your browser to proxy traffic through Burp, then manually navigate through all accessible parts of the application. The Burp Proxy history and Site map will populate with all observed requests, including those to API endpoints. Pay close attention to calls made during page loads, form submissions, and AJAX interactions. Exporting the sitemap or the HTTP history can provide a rich list of potential API paths to feed into an automated script.
For a quick manual check of a suspected endpoint, curl is indispensable. Testing without any authentication headers provides a baseline:
curl -s -i "https://api.example.com/v1/users"
Observe the HTTP status code and response body. A 200 OK with meaningful data without any prior authentication is a strong indicator of an unauthenticated endpoint. Conversely, a 401 Unauthorized, 403 Forbidden, or a redirect to a login page suggests protection.
Building the Python Endpoint Scanner
Python's requests library is the cornerstone of this automation. It provides an intuitive way to interact with HTTP services, making it ideal for crafting and sending requests to potential API endpoints.
The requests Library: Your HTTP Workhorse
The requests library simplifies HTTP requests. We'll use it to send GET, POST, and potentially other HTTP methods to a list of suspected API paths. It handles connection pooling, SSL verification, and cookie management, making our script robust.
import requests
def make_request(url, method='GET', headers=None, data=None):
try:
response = requests.request(method, url, headers=headers, data=data, timeout=5, allow_redirects=False)
return response
except requests.exceptions.RequestException as e:
print(f"Error making request to {url}: {e}")
return None
# Example usage:
# response = make_request("https://api.example.com/status")
# if response:
# print(f"Status: {response.status_code}, Length: {len(response.text)}")
Endpoint Enumeration: Wordlists and Fuzzing
Beyond endpoints discovered via passive and active reconnaissance, a significant portion of the discovery process involves brute-forcing common API path patterns. API endpoints often follow predictable structures like /api/v1/users, /v2/products, /admin/config.
Leverage comprehensive wordlists from resources like SecLists, specifically those targeting directories and common API endpoints. Good starting points include Discovery/Web-Content/raft-medium-directories.txt or Discovery/Web-Content/common.txt, and more API-specific lists if available.
The strategy involves combining a base API path (e.g., /api/v1/) with a list of common endpoint suffixes (e.g., users, items, status, health, config, admin, profile). It's also critical to consider different HTTP methods (GET, POST, PUT, DELETE) for each path, as an endpoint might be unauthenticated for one method but protected for another.
def load_wordlist(filepath):
try:
with open(filepath, 'r') as f:
return [line.strip() for line in f if line.strip() and not line.startswith('#')]
except FileNotFoundError:
print(f"Error: Wordlist not found at {filepath}")
return []
# Common API path suffixes
common_suffixes = load_wordlist("seclists/Discovery/Web-Content/raft-medium-directories.txt")
# Or a custom list
# common_suffixes = ["users", "products", "items", "status", "health", "metrics", "config", "admin", "profile", "data", "search"]
Parsing JavaScript for Endpoint Hints
JavaScript files are a treasure trove of API endpoint information. Extracting these hints requires fetching the JavaScript and then applying regular expressions or parsing libraries to identify potential paths. This can be integrated into the overall scanning process.
import re
from bs4 import BeautifulSoup
def extract_js_from_html(html_content, base_url):
soup = BeautifulSoup(html_content, 'html.parser')
js_urls = []
for script in soup.find_all('script', src=True):
src = script['src']
if src.startswith('http'):
js_urls.append(src)
elif src.startswith('//'): # Protocol-relative URL
js_urls.append("https:" + src) # Assume HTTPS
else: # Relative URL
js_urls.append(requests.compat.urljoin(base_url, src))
return js_urls
def extract_api_paths_from_js(js_content):
# Regex for common API path patterns within JS strings
# Looks for /api/vX/..., /vX/..., or just /something...
# Excludes common static file extensions
api_path_patterns = [
re.compile(r'(?:"|\')(/api/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/users/[a-zA-Z0-9_\-./]+)(?:"|\')'), # Specific patterns
re.compile(r'(?:"|\')(/auth/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/admin/[a-zA-Z0-9_\-./]+)(?:"|\')'),
]
found_paths = set()
for pattern in api_path_patterns:
matches = pattern.findall(js_content)
for match in matches:
# Simple sanitization to remove common JS string artifacts
cleaned_path = match.strip("'\"")
# Further filtering: Exclude paths likely to be static assets
if not any(cleaned_path.endswith(ext) for ext in ['.js', '.css', '.png', '.jpg', '.gif', '.svg', '.woff', '.ttf', '.map']):
found_paths.add(cleaned_path)
return list(found_paths)
The Automation Script: Core Logic
The core automation script ties together the reconnaissance and enumeration techniques. It iterates through a list of potential API paths, sends requests, and analyzes the responses for indicators of unauthenticated access. We'll define "unauthenticated" as receiving a 200 OK status with meaningful content, or a response that is identical to a request made *without* an authentication header, implying no authentication check occurred.
import requests
import re
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
# --- Helper functions (from above) ---
def make_request(url, method='GET', headers=None, data=None):
try:
# User-Agent header for better emulation and avoiding blocks
default_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
if headers:
default_headers.update(headers)
response = requests.request(method, url, headers=default_headers, data=data, timeout=7, allow_redirects=False)
return response
except requests.exceptions.RequestException as e:
# print(f"Error making request to {url}: {e}")
return None
def load_wordlist(filepath):
try:
with open(filepath, 'r') as f:
return [line.strip() for line in f if line.strip() and not line.startswith('#')]
except FileNotFoundError:
# print(f"Error: Wordlist not found at {filepath}. Using empty list.")
return []
def extract_js_from_html(html_content, base_url):
soup = BeautifulSoup(html_content, 'html.parser')
js_urls = []
for script in soup.find_all('script', src=True):
src = script['src']
if src.startswith('http') or src.startswith('//'):
js_urls.append(src if src.startswith('http') else "https:" + src)
else: # Relative URL
js_urls.append(urljoin(base_url, src))
return js_urls
def extract_api_paths_from_js(js_content):
api_path_patterns = [
re.compile(r'(?:"|\')(/api/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/v\d+/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/users/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/auth/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/admin/[a-zA-Z0-9_\-./]+)(?:"|\')'),
re.compile(r'(?:"|\')(/status)(?:"|\')'),
re.compile(r'(?:"|\')(/health)(?:"|\')'),
re.compile(r'(?:"|\')(/metrics)(?:"|\')'),
re.compile(r'(?:"|\')(/config)(?:"|\')'),
]
found_paths = set()
static_extensions = ['.js', '.css', '.png', '.jpg', '.jpeg', '.gif', '.svg', '.ico', '.woff', '.ttf', '.map', '.json'] # Added .json to exclude common static config files
for pattern in api_path_patterns:
matches = pattern.findall(js_content)
for match in matches:
cleaned_path = match.strip("'\"")
if not any(cleaned_path.lower().endswith(ext) for ext in static_extensions):
found_paths.add(cleaned_path)
return list(found_paths)
# --- Main Scanner Logic ---
def scan_unauthenticated_endpoints(base_url, wordlist_path="seclists/Discovery/Web-Content/common.txt"):
print(f"[+] Starting scan for unauthenticated API endpoints on: {base_url}")
parsed_base_url = urlparse(base_url)
if not parsed_base_url.scheme or not parsed_base_url.netloc:
print(f"[-] Invalid base URL: {base_url}")
return
potential_paths = set()
# Step 1: Initial crawl to find base paths and JS files
print("[*] Initial crawl to discover JS files and basic paths...")
initial_response = make_request(base_url)
if initial_response and initial_response.status_code == 200:
# Extract paths from HTML content
# This part could be expanded for more aggressive HTML parsing for hrefs/forms
# Extract JS files
js_urls = extract_js_from_html(initial_response.text, base_url)
print(f"[+] Found {len(js_urls)} JavaScript files.")
for js_url in js_urls:
js_response = make_request(js_url)
if js_response and js_response.status_code == 200:
extracted_paths = extract_api_paths_from_js(js_response.text)
for path in extracted_paths:
potential_paths.add(path)
# print(f" Extracted {len(extracted_paths)} paths from {js_url}")
# Step 2: Load wordlist for brute-forcing
wordlist_suffixes = load_wordlist(wordlist_path)
print(f"[+] Loaded {len(wordlist_suffixes)} wordlist entries.")
# Combine known base paths (e.g., from JS or manual recon) with wordlist suffixes
# This example focuses on common API patterns relative to the base URL
# In a real scenario, you might have explicit /api/v1/ as a base
# Simple base path examples for fuzzing
api_bases = ["/api/", "/api/v1/", "/v1/", "/api/v2/", "/v2/"]
for base in api_bases:
for suffix in wordlist_suffixes:
# Ensure paths are clean, no double slashes
full_path = urljoin(base, suffix).replace('//', '/')
potential_paths.add(full_path)
# Add potential paths discovered from JS
# The `potential_paths` set already handles uniqueness
print(f"[*] Total unique paths to test: {len(potential_paths)}")
unauthenticated_endpoints = []
http_methods = ['GET', 'POST'] # Commonly unauthenticated for discovery
for path in sorted(list(potential_paths)):
target_url = urljoin(base_url, path)
for method in http_methods:
# Test without authentication headers
response_no_auth = make_request(target_url, method=method)
# Test with an empty/invalid Authorization header for comparison
# This helps differentiate truly public endpoints from those that simply ignore a bad token
response_with_empty_auth = make_request(target_url, method=method, headers={"Authorization": "Bearer invalid_token"})
if response_no_auth:
status_code_no_auth = response_no_auth.status_code
content_length_no_auth = len(response_no_auth.content)
# Heuristics for unauthenticated access:
# 1. 200 OK without any auth headers, and response content is non-empty and not a simple error message
# 2. Response content/status is identical with and without a bogus auth header, suggesting auth isn't checked
is_unauthenticated = False
if status_code_no_auth == 200 and content_length_no_auth > 50: # Arbitrary content length to avoid empty responses
# Check for common "not found" or "bad request" phrases in 200 responses that aren't real data
if not (any(phrase in response_no_auth.text.lower() for phrase in ["not found", "bad request", "invalid parameters"])):
is_unauthenticated = True
if response_with_empty_auth:
if status_code_no_auth == response_with_empty_auth.status_code and \
response_no_auth.content == response_with_empty_auth.content and \
status_code_no_auth not in: # If status is 401/403, it's definitely authenticated
is_unauthenticated = True
if is_unauthenticated:
print(f"[!] Unauthenticated Endpoint Found: {method} {target_url} (Status: {status_code_no_auth}, Length: {content_length_no_auth})")
unauthenticated_endpoints.append(f"{method} {target_url}")
if not unauthenticated_endpoints:
print("[-] No unauthenticated API endpoints found with current heuristics.")
else:
print("\n[+] Summary of Unauthenticated Endpoints:")
for ep in unauthenticated_endpoints:
print(f" - {ep}")
if __name__ == "__main__":
target_app_url = "http://localhost:8000" # Replace with your target URL
seclists_path = "path/to/SecLists/Discovery/Web-Content/common.txt" # Adjust path to your SecLists
# To run, ensure you have a local web server (e.g., Python's SimpleHTTPServer or a dummy Flask app)
# and adjust the `seclists_path` variable.
# Example output:
# [+] Starting scan for unauthenticated API endpoints on: http://localhost:8000
# [*] Initial crawl to discover JS files and basic paths...
# [+] Found 1 JavaScript files.
# [*] Total unique paths to test: 2500
# [!] Unauthenticated Endpoint Found: GET http://localhost:8000/api/v1/health (Status: 200, Length: 20)
# [!] Unauthenticated Endpoint Found: GET http://localhost:8000/status (Status: 200, Length: 15)
# [!] Unauthenticated Endpoint Found: POST http://localhost:8000/data (Status: 200, Length: 100)
# ...
# [+] Summary of Unauthenticated Endpoints:
# - GET http://localhost:8000/api/v1/health
# - GET http://localhost:8000/status
# - POST http://localhost:8000/data
Analyzing Responses and False Positives
A 200 OK status code alone is insufficient to declare an endpoint "unauthenticated and vulnerable." Many applications return 200 OK for public endpoints that aren't particularly sensitive, or even for 404-like custom error pages. The key is to analyze the content and consistency of the response.
- Compare the response to a request made without any authentication headers versus one made with an invalid or empty
Authorizationheader. If the responses (status code, content length, body content) are identical, it's a strong indicator that the endpoint is not checking for authentication or is ignoring it. - Look for JSON or XML data that appears to be application-specific rather than a generic error message.
- Check for redirects. If an unauthenticated request redirects to a login page, the endpoint is protected. Our script disables
allow_redirectsto prevent false positives from redirects. - The content length heuristic (e.g.,
content_length_no_auth > 50) helps filter out trivial responses, but it's not foolproof. A small, but sensitive, piece of data could still be exposed. Manual verification of identified endpoints is always required.
Expanding the Attack Surface
The provided script is a starting point. Further enhancements include:
- **Discovering Hidden Parameters**: Use tools like
ffufor extend the Python script to fuzz for common parameter names (e.g.,id,user_id,name,query) on discovered paths. - **Different HTTP Methods**: While GET and POST are common for initial discovery, PUT, DELETE, and PATCH methods should also be tested, as misconfigurations can expose dangerous functionality.
- **Content Types**: Experiment with different
Content-Typeheaders (e.g.,application/json,application/x-www-form-urlencoded) if the application expects specific input formats. - **Dynamic Path Segments**: Many APIs use dynamic segments like
/users/{id}/profile. Fuzzing these segments with common IDs (1, admin, self) can reveal additional endpoints.
This automated approach provides a systematic way to uncover API attack surface, significantly reducing manual effort and increasing the likelihood of identifying exploitable unauthenticated endpoints.