Google Index Retriever
24/04/15. Version 0.1
Have you ever found a webpage that seems to talk exactly about what you need, but it has been removed? Yes, Google cache is the answer but… What if the cache has been removed too? What if the site is just in Google Index page? You can´t get the webpage back, but you know it was there. Google Index Retriever will try to retrieve back the index in Google, so you can get part of the text back, and maybe that removed content you need.
Google cache is not there forever. From time to time, they are just removed for good. Archive.org and its WayBackMachine does not take as many snapshots of the less popular pages… so, there are some situations where the only part of a web that is left is in the Google Index.
Google index is that little part of text in the results page that Google search engine shows when the user searches for anything. That is the index where the searched words matching appear in bold. Google Index is the last part of a web to disappear. So there will be situations where that is the only part left. Google keeps different “indexes” from the same webpage, so, if they could be all put together, the text would be reconstructed and it would maybe come up. But that is not the only situation where the tool may be useful. What if the index contains passwords, credit card numbers or any other sensitive information? In fact that was one of the reasons to create the tool: to demonstrate that removing the webpage and cache with offensive or sensitive content is not enough. The content may be still reachable.
This is all explained in this presentation: «No me indexes que me cacheo».
How does the tool work?
It is very easy. The tool is fed with a Google Search that produces an index result. It will try, brute forcing the Google Search (“stimulating it”) to retrieve as much as possible. Then, it has some different options:
- One Shot button: It just searches once with the information provided. Use this to try to be the more specific you can with the search string before hitting on “start button”.
- Start button: It starts searching in automatic mode. Result box will display the time elapsed since the search started, the word that made the information to come up, and finally the sentence if it differs from the last one, so the user may reconstruct the webpage.
The logic to try to “stimulate” the index and get back the information is:
- First, try to stimulate the index with the words already found in the first index result “around” the main word searched, so it tries to retrieve the whole sentences again and again.
- If there are no more results or “words around” left, the search is repeated with keywords provided by the user, like a “dictionary attack”. When this occurs, the progress bar changes its color.
Google, of course will launch a CAPTCHA from time to time because of the continuous use of their service. This is perfectly ok. Google Index Retrievers will capture the CAPTCHA so it is easy to resolve and keep on going.
This tool may be used as well to check if a site has been probably compromised and injected with spam and black SEO. It is usual that attackers compromise webpages and inject spam words in them so the “steal” their pagerank. This content is not visible for visitors but only to Google robot and spider, so it is usually visible in this index. This tab works exactly the same as the other, but with another logic:
- It directly tries to search from a different set of keywords (related to spam) in a Google index result.
So this way, it is easier to know if a webpage has been compromised and injected with SEO spam.
The program is written in Java, so it should work under any system and version. The results may be exported to an html document in the local computer. Keywords are completely customizable. They may be added individually or edited directly from a TXT file.