I essentially want to spider my local site and create a list of all the titles and URLs as in:
http://localhost/mySite/Default.aspx My Home Page http://localhost/mySite/Preferences.aspx My Preferences http://localhost/mySite/Messages.aspx Messages
I'm running Windows. I'm open to anything that works--a C# console app, PowerShell, some existing tool, etc. We can assume that the tag does exist in the document.
Note: I need to actually spider the files since the title may be set in code rather than markup.
-
Ok, I'm not familiar with Windows, but to get you in the right direction: use a XSLT transformation with
<xsl:value-of select="/head/title" /> in there to get the title back or if you can, use the XPath '/head/title' to get the title back.
-
A quick and dirty Cygwin Bash script which does the job:
#!/bin/bash for file in $(find $WWWROOT -iname \*.aspx); do echo -en $file '\t' cat $file | tr '\n' ' ' | sed -i 's/.*<title>\([^<]*\)<\/title>.*/\1/' doneExplanation: this finds every .aspx file under the root directory $WWWROOT, replaces all newlines with spaces so that there are no newlines between the
<title>and</title>, and then grabs out the text between those tags.Larsenal : This doesn't seem to quite work. What am I doing wrong? -
I think a script similar to what Adam Rosenfield suggested is what you want, but if you want the actual URLs, try using
wget. With some appropriate options, it will print out a list of all the pages on your site (plus download them, which maybe you can suppress with--spider). Thewgetprogram is avaliable through the normal Cygwin installer.Dustin : Yeah, that is what I was trying to get working to post here! Here's a snippet: site=mysite.com wget --recursive --accept \*.html http://$site ;for file in $( find $site -name *.html ); do // adam's for-body -
I would use wget as detailed above. Be sure you don't have any spider traps on your site.
-
you should consider using scrapy shell
check out
http://doc.scrapy.org/intro/tutorial.html
in console put something like this :
hxs.x('/html/head/title/text()').extract()
if you want all titles, you should do a spider...it really easy.
Also consider to move to linux :P
0 comments:
Post a Comment