Archive for April 2007
In powerpoint here
No tags
- We have a C# example of GeoIP working. It’s not implemented yet, but it’s simple.
- I fixed the exception for “queue is empty”.
if(queueURLS.count != 0)
{ uri=(MyUri)queueURLS.Dequeue(); }
- I made the crawler stop saving files to the hard disk because it probably gave us some overhead we could do without.
- Added some more innapropriate types of urls.
- New record of 12,897 items saved in DB.
No tags
I made a check for certain urls that were causing that exception where it was expecting a port, but the format of the url was something absurd like http://javascript. While debugging I found a bunch of other ones that need to be checked for:
- http://https
- http://\
- http://ymsgr //yahoo’s messenger I’m guessing
- http://javascript
However, I still had one or two times where a url came up with “javascript” in it. I’m confused about how that could happen since I’m checking the whole string.
I’m sure there are many more out there that I haven’t checked yet, but for now I don’t seem to be getting that exception much at all.
I did a quick trial run. There were about 2,130 claimed urls when I clicked stop. There is 2063 in the database. Thankfully, there’s only a very small amount lost. This was with 100 threads.
No tags
A few exceptions is the only thing holding back our crawler from running at full speed. I hope we can solve these problems quickly.
In Visual Studio I went to the Debug menu -> Exceptions. I checked all of the exceptions so we would catch every single one. We then stepped through the errors and recorded the line numbers of each error.
Exceptions:
- InvalidOperationException – line 1603 – This exception occurs because dequeue (remove element) is being called on a queue that is empty. There is no check to see if the queue is empty. I made a quick fix by inserting something to it so it wouldn’t give me a “Queue is empty” error. This wont work in the long run though.
- URIFormatException – lines 2649, 2136, 2753 – “Invalid URI: A port was expected because there is a colon (:) present but the port could not be parsed.” The line of code uses the Another line reports, “The format of the URI could not be determined.”
- SocketException – lines 1864, 2720, 2733 – one of the lines reports, “Closed by remote host”.
- IOException – line 2121 – caused by writing to text file. Its easy to just comment out this line for a feature that’s unnecessary.
No tags
The Internal Revenue Service has been trying for years to upgrade its antiquated mainframe computers, which process Americans’ tax returns by churning through millions of lines of assembly code written by hand in the early 1960s.
But after more than 20 years and over $5 billion, there’s still no end in sight. Not all computer systems can talk to each other, information isn’t available in real time, and tax returns filed on paper are often manually entered by typists.
No tags
We drove onto a dirt road created by a dozer. There were heavy tree fuels and a Haines index category of 5 (6 being the highest possible). We worked for awhile, then walked into a large clearing. A dozer had created a very large dirt safety zone in the form of a cross (+). We waited here for awhile and walked up to the edge of the fireline not too far to our west side. Eventually the fire blew up much more than expected and started to become overwhelming. We were immediately told that we were going to create a fireline to help stop it. It became so intense that fire retardant planes were called in. We sat in the shade a little bit away to eat some lunch. The planes came in over us and …
This was the day when some lunches had bad lunch meat and people got sick from eating them. At least one person ate their sandwich before lunch and it turned out badly for them.
Anyways, the retardant plane came in over us and we ducked to try and avoid them dropping on us. The fire blew up quickly into a smoke cloud that covered most of the sky in one direction. We went back to the clearing with the safety zone.
We stayed in the safety zone for awhile and were eventually told that the road we came in on wasn’t safe to exit on because the fire actually crossed it. Due to two bad events occurring at the same time, a sick person and being surrounded, a chopper was called in to airlift us to safety.
No tags
Kevin Garrad of the 3rd Infantry Division looks to have gotten a little assistance from an unexpected source while on a street patrol in Iraq recently, when the iPod in his pocket got in the path of a bullet fired at close range, slowing it down enough that it didn’t pierce his body armor. As if that wasn’t a rare enough occurrence, the iPod in question was an HP iPod — imagine the odds!
http://www.flickr.com/photos/tiki/445618364/in/pool-appleusers/
No tags
The crawler is now able to update the database while crawling web pages.
Sample command to insert entry in database:
MySqlCommand acmd = new MySqlCommand(”INSERT INTO url_history (id, name, domain, js_annoyances, js_events, html_annoyances, unsolicited_popups) VALUES (’2′, ‘teststring’, ‘teststring’, true, true, true, true)”, conn);
An error occurs when the id or the name are the same. In a real environment this would most likely not occur because the id will be incremented and no URL can be the same.
No tags
Regular expression test page
The following page contains examples of all of the regular expressions our web crawler tests. This will be a valuable resource for testing how well our web-crawler hits the mark on matching privacy items.
http://oregonstate.edu/~lundquja/regExTest.html
No tags
X
No tags

Social Links