Project Series: Tinder for your Craigslist apartment hunt

Part I: A twist on a Slack bot craigslist tutorial & Script + App

MM

This is a project in collaboration with SmellsLikeML. Many thanks to dataquest for the starting scripts that helped head start this project.

The bay area housing market is a hot mess and every year it is a scramble to try to find a spot that is affordable, close to campus and not a closet. We took inspiration from dataquest's blog post on scrapping craigslist posts. Instead of building a slack bot at the end, we decided to make an application to quickly iterate through the posts and make the application process faster. By extracting and storing the data from the craigslist ads, we can try building a recommender system (if we get enough data).For now, we want to focus on collecting and store data on the posts that peak our interest.

The Script

We'll be making plenty of use of AWS services for this project. Go ahead and make a dynamodb table with the sort key called id (string). Also, create an S3 bucket in your prefered region. I called both my bucket and dynamodb table 'apthunt'. If all of this sounds like klingon, here are some helpful tutorials to get you aquainted with AWS:

Let's start with a script that will scrape craigslist throughout the day and store data. We used this API to comb through craigslist posts with some basic filters.


from craigslist import CraigslistHousing
cl_h = CraigslistHousing(site='sfbay', area='eby', category='apa', filters={'max_price': 2000, 'min_price': 1000, 'dogs_ok':True})

for result in cl_h.get_results(sort_by='newest', geotagged=True, limit=200):
    print result
Go ahead and play some more with the API, there are many other filters you can choose. Like in dataquest's post, after using the API, we filtered the results down by area and neighborhood. If a post went through one of these two filters, it would go on to get more data extracted.

We wanted to add a few more things to extract to make our decision easier. Since we were looking at multiple areas within east bay and SF and we have a car we wanted to see what our transportation options could be. We added proximity to a BART station (which we lovingly refer to as the BARF). With a simple for loop, we iterate through all the stations until we find the closest BARF station and return its station name and distance from the apartment. We also have a car so we computed the distances to our two locations we would commute to: UC Berkeley and Redwood City. We calculate the drive times to those places from the location of the post if it has a geotag.


#Finding closest BART (BARF) station

if geotag:
	min_dist = 100.
	for station, coords in TRANSIT_STATIONS.items():     #TRANSIT_STATIONS is a csv file with coordinates to stations
		BART_dist = vincenty(coords, geotag)
		BART_dist = BART_dist.miles
		if BART_dist < min_dist:
			min_dist = BART_dist
			close_stn = station


#Calculating driving distance

dest_coord_UCB = '37.8719034,-122.2607286'
dest_coord_redwood = '37.4997475,-122.2970704'

def drive_times(geotag, destination):
    gmaps_url = "http://maps.googleapis.com/maps/api/distancematrix/json?origins={0}&destinations={1}&mode=driving&language=en-EN&sensor=false".format(",".join(geotag), destination)
    res_dest = simplejson.load(urllib.urlopen(gmaps_url))
    driving_time = res_dest['rows'][0]['elements'][0]['duration']['text']
    return driving_time
If the post has a geotag, we can also get a peek at what the neighborhood looks like by generating the street view image of the coordinates. We use the API like so:

SaveLoc = "/path/to/save/location/"
#key for google maps 
key = "&key=" + GOOGLE_MAPS_KEY #imported from a config file

def GetStreet(Add, SaveLoc, name):
    base = "https://maps.googleapis.com/maps/api/streetview?size=1200x800&location="
    Add = ','.join(Add)            # pass lat/lng instead of street address
    MyUrl = base + Add + key
    urllib.urlretrieve(MyUrl, os.path.join(SaveLoc, name))
The text from a post is also a great place to find more information that could be useful in finding that awesome spot. Some of the things we thought could be useful to are the number of bedrooms, what kind of space it is (apartment, condo, studio, #-plex, house?), does it require a pet deposit, among others. Simple regex and checking if a string was present allows us to transform unstructured text into new features we can learn on later.

#Some examples on how to use regex to extract more information

#How many bedrooms?
bdrm_lst = empties(re.search(r'\d+ (bdrm|bed|room)', text).group()[0] \
                       if re.search(r'\d+ (bdrm|bed|room)', text) else None)

#Pet rent/deposit?
pet_rent = empties(re.search(r'\d pet deposit|\d pet rent', text).group() \
                       if re.search(r'\d pet deposit|\d pet rent', text) else None)

#Easy boolean feature to tell whether it is an apartment or not
apt = Decimal(int('apartment' in text))
text_features['apt'] = apt #Add this value to a dictionary of values

There is also plenty of data to be extracted from the images featured in the posts. We'll store all the photos in S3 for easy access and storage. We could use boto3 to upload the images, however, this requires us to open and close each imagine individually which is really slow. Instead, we'll use an aws s3 sync command which is much faster.


subprocess.call("aws s3 sync ./%s s3://mybucket/ --acl public-read" %result['id'], shell=True)

Finally, we need to store all of the data we have extracted from relevant posts. We'll use dynamodb, a NoSQL database. Dynamodb likes receiving dictionaries for each row so we'll construct one for each post we want to save. Our table uses the post id as a sort key so we make sure to include that.


import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('MyTable')
table.put_item(Item={'id': 'postid','your':'constructed', 'dictionary':'here'})

To put it all together

Going to the github for this project and going into the scraper folder you'll find the entire code.

Getting it running

Run git clone https://github.com/mayorquinmachines/AptHunt.git in the terminal. Then go into the project's directory cd AptHunt and cd scraper to get to the scraper scripts.

You want to configure the scraper to filter on information you're interested in so open the config file with vim config.py. Some of the filters are already filled in but you can change any of those to your liking. Fill in the other variables as needed (i.e. fill in your dynamodb table name, s3 bucket name, etc...)

The helper_func.py file in that directory contains all of the helper functions to extract data from posts and filter relevant posts. You can add more functions to filter on some other criteria and reference them in the scraper.py file.

To run the scraper, run python scraper.py. You should start seeing posts trickle into your dynamodb table. Check out part 2 for the front end of the application.