Content Discovery - THM Walkthrough

Content Discovery Techniques and Their Importance in Web Application Security

What Is Content Discovery?

Content discovery in the context of web application security involves identifying hidden or non-obvious content on a website that was not intended for public access. This content can include files, videos, pictures, backup files, configuration files, administrative portals, older versions of the website, and more. Discovering such content is crucial for penetration testers to identify potential vulnerabilities and security risks.

Content discovery can be performed through three main methods:

  1. Manual Discovery
  2. Automated Discovery
  3. OSINT (Open-Source Intelligence)

Manual Discovery Techniques

Robots.txt

The robots.txt file is used by websites to tell search engines which pages should not be crawled. This file can provide penetration testers with valuable information about directories and files that the website owner does not want to be publicly accessible.

Command:

curl http://MACHINE_IP/robots.txt

Example:

Disallow: /staff-portal

This indicates that the /staff-portal directory is restricted from search engines.

Favicon

Favicons are small icons displayed in the browser's address bar. If a website uses a default favicon from a framework, it can give clues about the technologies in use.

Command to get MD5 hash of the favicon:

curl https://static-labs.tryhackme.cloud/sites/favicon/images/favicon.ico | md5sum

Using the OWASP favicon database, the MD5 hash can be matched to identify the framework.

Sitemap.xml

The sitemap.xml file provides a list of URLs that the website owner wants to be indexed by search engines. It can reveal hidden or obscure pages.

Command:

curl http://MACHINE_IP/sitemap.xml

Example:

<url><loc>http://example.com/s3cr3t-area</loc></url>

This indicates a secret area /s3cr3t-area on the website.

HTTP Headers

HTTP headers in responses can provide information about the web server software and other technologies in use.

Command:

curl http://MACHINE_IP -v

Framework Stack

Examining page source comments or other elements can reveal the framework used by a website. This information can be used to find specific vulnerabilities associated with that framework.

Example:

<!-- Page generated by THM Web Framework 1.0 -->

This indicates the use of a specific web framework.

OSINT Techniques

Google Hacking / Dorking

Using Google search operators to find specific types of information about a website.

Examples:

Wappalyzer

An online tool and browser extension that identifies technologies used by a website, such as frameworks, CMS, payment processors, etc.

Website:

https://www.wappalyzer.com/

Wayback Machine

An archive service that stores historical snapshots of web pages, useful for uncovering old or changed content.

Website:

https://archive.org/web/

GitHub

Searching public repositories on GitHub can reveal source code, credentials, and other sensitive information related to a target website.

Example Search:

site:github.com "example.com"

S3 Buckets

Amazon S3 buckets can sometimes be misconfigured to allow public access. The URL format is typically http(s)://{name}.s3.amazonaws.com.

Automated Discovery Techniques

Using Tools with Wordlists

Automated discovery involves using tools that make multiple requests to a web server to find hidden content. These tools often use wordlists to guess common file and directory names.

Example Tools:

Common Discoveries:

Importance of Content Discovery

Content discovery is essential for identifying potential security vulnerabilities in a web application. By uncovering hidden or restricted content, penetration testers can:

Effective content discovery enhances the security posture of a web application by allowing for the identification and remediation of hidden threats before they can be exploited by malicious actors.

---- Satvik's Hacking Garden