Archive for Programming

3 Ways Artificial Intelligence Is Changing Medicine

By India Bottomley

From smart speakers in the operating room to virtual diagnosis and treatment plans, The AEDITION looks at a few of the ways artificial intelligence is shaking up the medical industry.

We may not be at the point where you overhear your surgeon saying, “Hey, Google, pass the scalpel,” but artificial intelligence (AI) is gradually making its way into the healthcare industry and, by extension, dermatology and plastic surgery practices, too. Even in its limited use, AI is already helping providers offer their patients better care — whether it’s pre-op, in the OR, or during the recovery process. Here are three ways artificial intelligence is shaking up medicine.

1. AI Patient Care

Your experience with a medical practice starts as soon as you look for information online. You might have questions for the practitioner or want to book an appointment. In the past, you would have emailed or called the practice, but you may now find yourself speaking to an AI assistant on the practice’s website. Going forward, the carefully programmed software could even help patients with aftercare following surgery.

New York city-based board certified plastic surgeon, Philip Miller, MD, who is an expert in rhinoplasty, is at the forefront of the integration of AI in aesthetic practices. Dr. Miller developed an AI Chatbot called Aestheti.Bot that can instantly answer common patient questions. While his interface works by having patients text it directly, other chatbots are available via Facebook Messenger and can be integrated with Amazon smart speakers. While the chatbot does not replace medical advice, it does offer patients instant answers to frequently asked questions — any time, anywhere.

There are already apps driven by artificial intelligence that help patients recover from orthopedic surgery, and the technology could be adapted for cosmetic procedures and plastic surgery. Apps like allow surgeons to tailor recovery programs for their patients and track their progress as they recover, while patients glean a better understanding of how best to take care of themselves.

But AI isn’t just tackling of medical questions and care — it is also looking after patients and practitioners’ schedules. You may find that the next time you make an appointment for a procedure, an AI interface will remind you what aftercare protocol to follow when you get home and help you to schedule your follow up appointments accordingly.

2. A New Kind of Surgical Assistant

While artificial intelligence robots may not be ready to replace nurses and physician assistants, they responsible for some basic tasks in the operating room. Dr. Miller, for one, uses a smart speaker to control the lighting and some machinery in his OR.

Surgeons are using smart speakers to set timers and keep track of information while operating. During treatment, doctors can ask the smart speaker to access medical information about a patient, which may allow them to make decisions without having to ‘break’ their sterile scrub. AI technology has also been developed to allow providers to ask for recommendations about things like implant usage based on patient data, and, while the software has not been created specifically for cosmetic procedures, it’s likely only a matter of time.

AI is also helping doctors provide more personalized care. Cleveland Clinic and IBM implemented a program that analyzes data from thousands of medical papers and other sources to create more efficient treatment plans for patients. There are also virtual nursing services that allow providers to digitally answer questions about minor concerns — without a phone call or in-office visits.

And let us not forget that as voice-to-text technology has improved, so too have the workflows in many doctors’ offices, with dictation simplifying everything from note taking to prescriptions.

3. Computer-Assisted Diagnosis

On the more mindblowing end of the AI revolution, Google is developing a technology that it claims is able to diagnose skin concerns as accurately as a dermatologist.

Developers created a so-called “deep learning system” and taught it to recognize 26 common skin conditions that often lead to patients consulting a dermatologist. The system analyzes photos and basic data about the user, before suggesting a list of possible diagnoses — from most to least likely. In a similar vein, Stanford University tested an AI algorithm designed to detect skin cancers. When tested against practicing dermatologists, the tech performed as well as its human counterparts.

Additionally, there has been early testing to see if a deep learning system could help identify breast cancer metastasis. MIT, meanwhile, is leading research into AI-assisted medical imaging analysis, in which algorithms could help doctors in remote areas to analyze test results without telemedicine, and there is hope that the technology will reach the point that a simple smartphone photo of a rash or cut is enough to determine the type of care a patient needs.

The Takeaway

While we’re still a ways away from having robots greeting us at the doctor’s office and trusting computers with clinical decisions, AI is steadily making inroads in the healthcare industry. For now, look out for automated booking systems and AI-generated aftercare, but don’t be surprised in a few years’ time if there’s an AI application waiting to diagnose and prescribe treatment — without a trip to the doctor’s office.

This article was written by India Bottomley and originally posted at


Web Scraping with Scrapy: Advanced Examples

By Zac Clancy for

Table of Contents

  • Introduction to Web Scraping
  • Scrapy concepts
  • Reddit-less front page
  • Extracting amazon price data
  • Considerations at scale

Introduction to web scraping

Web scraping is one of the tools at a developer’s disposal when looking to gather data from the internet. While consuming data via an API has become commonplace, most of the websites online don’t have an API for delivering data to consumers. In order to access the data they’re looking for, web scrapers and crawlers read a website’s pages and feeds, analyzing the site’s structure and markup language for clues. Generally speaking, information collected from scraping is fed into other programs for validation, cleaning, and input into a datastore or its fed onto other processes such as natural language processing (NLP) toolchains or machine learning (ML) models. There are a few Python packages we could use to illustrate with, but we’ll focus on Scrapy for these examples. Scrapy makes it very easy for us to quickly prototype and develop web scrapers with Python.

Scrapy concepts

Before we start looking at specific examples and use cases, let’s brush up a bit on Scrapy and how it works.

Spiders: Scrapy uses Spiders to define how a site (or a bunch of sites) should be scraped for information. Scrapy lets us determine how we want the spider to crawl, what information we want to extract, and how we can extract it. Specifically, Spiders are Python classes where we’ll put all of our custom logic and behavior.

import scrapy

class NewsSpider(scrapy.Spider):
	name = 'news'

Selectors: Selectors are Scrapy’s mechanisms for finding data within the website’s pages. They’re called selectors because they provide an interface for “selecting” certain parts of the HTML page, and these selectors can be in either CSS or XPath expressions.

Items: Items are the data that is extracted from selectors in a common data model. Since our goal is a structured result from unstructured inputs, Scrapy provides an Item class which we can use to define how our scraped data should be structured and what fields it should have.

import scrapy

class Article(scrapy.Item):
	headline = scrapy.Field()

Reddit-less front page

Suppose we love the images posted to Reddit, but don’t want any of the comments or self posts. We can use Scrapy to make a Reddit Spider that will fetch all the photos from the front page and put them on our own HTML page which we can then browse instead of Reddit.

To start, we’ll create a RedditSpider which we can use traverse the front page and handle custom behavior.

import scrapy

class RedditSpider(scrapy.Spider):
	name = 'reddit'
	start_urls = [

Above, we’ve defined a RedditSpider, inheriting Scrapy’s Spider. We’ve named it reddit and have populated the class’ start_urls attribute with a URL to Reddit from which we’ll extract the images.

At this point, we’ll need to begin defining our parsing logic. We need to figure out an expression that the RedditSpider can use to determine whether it’s found an image. If we look at Reddit’s robots.txt file, we can see that our spider can’t crawl any comment pages without being in violation of the robots.txt file, so we’ll need to grab our image URLs without following through to the comment pages.

By looking at Reddit, we can see that external links are included on the homepage directly next to the post’s title. We’ll update RedditSpider to include a parser to grab this URL. Reddit includes the external URL as a link on the page, so we should be able to just loop through the links on the page and find URLs that are for images.

class RedditSpider(scrapy.Spider):
    def parse(self, response):
       links = response.xpath('//a/@href')
    	 for link in links:

In a parse method on our RedditSpider class, I’ve started to define how we’ll be parsing our response for results. To start, we grab all of the href attributes from the page’s links using a basic XPath selector. Now that we’re enumerating the page’s links, we can start to analyze the links for images.

def parse(self, response):
    links = response.xpath('//a/@href')
    for link in links:
        # Extract the URL text from the element
        url = link.get()
        # Check if the URL contains an image extension
        if any(extension in url for extension in ['.jpg', '.gif', '.png']):

To actually access the text information from the link’s href attribute, we use Scrapy’s .get() function which will return the link destination as a string. Next, we check to see if the URL contains an image file extension. We use Python’s any() built-in function for this. This isn’t all-encompassing for all image file extensions, but it’s a start. From here we can push our images into a local HTML file for viewing.

def parse(self, response):
    links = response.xpath('//img/@src')
    html = ''

    for link in links:
        # Extract the URL text from the element
        url = link.get()
        # Check if the URL contains an image extension
        if any(extension in url for extension in ['.jpg', '.gif', '.png']):
            html += '''
            < a href="{url}" target="_blank">
                < img src="{url}" height="33%" width="33%" />
            < /a>

    	# Open an HTML file, save the results
    	    with open('frontpage.html', 'a') as page:
    	    # Close the file

To start, we begin collecting the HTML file contents as a string which will be written to a file called frontpage.html at the end of the process. You’ll notice that instead of pulling the image location from the ‘//a/@href/‘, we’ve updated our links selector to use the image’s src attribute: ‘//img/@src’. This will give us more consistent results, and select only images.

As our RedditSpider’s parser finds images it builds a link with a preview image and dumps the string to our html variable. Once we’ve collected all of the images and generated the HTML, we open the local HTML file (or create it) and overwrite it with our new HTML content before closing the file again with page.close(). If we run scrapy runspider, we can see that this file is built properly and contains images from Reddit’s front page.

But, it looks like it contains all of the images from Reddit’s front page – not just user-posted content. Let’s update our parse command a bit to blacklist certain domains from our results.

If we look at frontpage.html, we can see that most of Reddit’s assets come from and We’ll just filter those results out and retain everything else. With these updates, our RedditSpider class now looks like the below:

import scrapy

class RedditSpider(scrapy.Spider):
    name = 'reddit'
    start_urls = [

    def parse(self, response):
        links = response.xpath('//img/@src')
    	  html = ''

    	  for link in links:
            # Extract the URL text from the element
        	url = link.get()
        	# Check if the URL contains an image extension
        	if any(extension in url for extension in ['.jpg', '.gif', '.png'])\
               and not any(domain in url for domain in ['', '']):
                html += '''
                < a href="{url}" target="_blank">
                    < img src="{url}" height="33%" width="33%" />
                < /a>

    	   # Open an HTML file, save the results
    	       with open('frontpage.html', 'w') as page:

    	   # Close the file

We’re simply adding our domain whitelist to an exclusionary any()expression. These statements could be tweaked to read from a separate configuration file, local database, or cache – if need be.

Extracting Amazon price data

If you’re running an ecommerce website, intelligence is key. With Scrapy we can easily automate the process of collecting information about our competitors, our market, or our listings.

For this task, we’ll extract pricing data from search listings on Amazon and use the results to provide some basic insights. If we visit Amazon’s search results page and inspect it, we notice that Amazon stores the price in a series of divs, most notably using a class called .a-offscreen. We can formulate a CSS selector that extracts the price off the page:

prices = response.css('.a-price .a-offscreen::text').getall()

With this CSS selector in mind, let’s build our AmazonSpider.

import scrapy

from re import sub
from decimal import Decimal

def convert_money(money):
	return Decimal(sub(r'[^\d.]', '', money))

class AmazonSpider(scrapy.Spider):
	name = 'amazon'
	start_urls = [

	def parse(self, response):
    	    # Find the Amazon price element
    	    prices = response.css('.a-price .a-offscreen::text').getall()

    	    # Initialize some counters and stats objects
    	    stats = dict()
    	    values = []

    	    for price in prices:
        	  value = convert_money(price)

    	    # Sort our values before calculating

    	    # Calculate price statistics
    	    stats['average_price'] = round(sum(values) / len(values), 2)
    	    stats['lowest_price'] = values[0]
    	    stats['highest_price'] = values[-1]
    	    Stats['total_prices'] = len(values)


A few things to note about our AmazonSpider class: convert_money(): This helper simply converts strings formatted like ‘$45.67’ and casts them to a Python Decimal type which can be used for computations and avoids issues with locale by not including a ‘$’ anywhere in the regular expression. getall(): The .getall() function is a Scrapy function that works similar to the .get() function we used before, but this returns all the extracted values as a list which we can work with. Running the command scrapy runspider in the project folder will dump output resembling the following:

{'average_price': Decimal('38.23'), 'lowest_price': Decimal('3.63'), 'highest_price': Decimal('689.95'), 'total_prices': 58}

It’s easy to imagine building a dashboard that allows you to store scraped values in a datastore and visualize data as you see fit.

Considerations at scale

As you build more web crawlers and you continue to follow more advanced scraping workflows you’ll likely notice a few things:

  1. Sites change, now more than ever.
  2. Getting consistent results across thousands of pages is tricky.
  3. Performance considerations can be crucial.

Sites change, now more than ever

On occasion, AliExpress for example, will return a login page rather than search listings. Sometimes Amazon will decide to raise a Captcha, or Twitter will return an error. While these errors can sometimes simply be flickers, others will require a complete re-architecture of your web scrapers. Nowadays, modern front-end frameworks are oftentimes pre-compiled for the browser which can mangle class names and ID strings, sometimes a designer or developer will change an HTML class name during a redesign. It’s important that our Scrapy crawlers are resilient, but keep in mind that changes will occur over time.

Getting consistent results across thousands of pages is tricky

Slight variations of user-inputted text can really add up. Think of all of the different spellings and capitalizations you may encounter in just usernames. Pre-processing text, normalizing text, and standardizing text before performing an action or storing the value is best practice before most NLP or ML software processes for best results.

Performance considerations can be crucial

You’ll want to make sure you’re operating at least moderately efficiently before attempting to process 10,000 websites from your laptop one night. As your dataset grows it becomes more and more costly to manipulate it in terms of memory or processing power. In a similar regard, you may want to extract the text from one news article at a time, rather than downloading all 10,000 articles at once. As we’ve seen in this tutorial, performing advanced scraping operations is actually quite easy using Scrapy’s framework. Some advanced next steps might include loading selectors from a database and scraping using very generic Spider classes, or by using proxies or modified user-agents to see if the HTML changes based on location or device type. Scraping in the real world becomes complicated because of all the edge cases, Scrapy provides an easy way to build this logic in Python.

This article originally appeared on



The Pandas Library for Python

By Zachary Wilson for

Tables of Contents

  • Introduction to Pandas
  • About the Data
  • Setup
  • Loading Data
  • Basic Operations
  • The Dtype
  • Cleansing and Transforming Data
  • Performing Basic Operations
    • Calculations
    • Booleans
    • Grouping
    • Plotting
  • Exporting Transformed Data
  • Final Notes

Introduction to Pandas

So, what is Pandas – practically speaking? In short, it’s the major data analysis library for Python. For scientists, students, and professional developers alike, Pandas represents a central reason for any learning or interaction with Python, as opposed to a statistics-specific language like R, or a proprietary academic package like SPSS or Matlab. (Fun fact – Pandas is named after the term Panel Data, and was originally created for the analysis of financial data tables). I like to think that the final “s” stands for Series or Statistics.

Although there are plenty of ways to explore numerical data with Python out-of-the box, these will universally involve some fairly low-performance results, with a ton of boilerplate. It may sound hard to believe, but Pandas is often recommended as the next stop for Excel users who are ready to take their data analysis to the next level. Nearly any problem that can be solved with a spreadsheet program can be solved in Pandas – without all the graphical cruft.

More importantly, because problems can be solved in Pandas via Python, solutions are already automated, or could be run as a service in the cloud. Further, Pandas makes heavy use of Numpy, relying on its low level calls to produce linear math results orders of magnitude more quickly than they would be handled by Python alone. These are just a few of the reasons Pandas is recommended as one of the first libraries to learn for all Pythonistas, and remains absolutely critical to Data Scientists.

About the Data

In this post, we’re going to be using a fascinating data set to demonstrate a useful slice of the Pandas library. This data set is particularly interesting as it’s part of a real world example, and we can all imagine people lined up at an airport (a place where things do occasionally go wrong). When looking at the data, I imagine people people sitting in those uncomfortable airport seats having just found out that their luggage is missing – not just temporarily, but it’s nowhere to be found in the system! Or, better yet, imagine that a hardworking TSA employee accidentally broke a precious family heirloom.

So it’s time to fill out another form, of course. Now, getting data from forms is an interesting process as far as data gathering is concerned, as we have a set of data that happens at specific times. This actually means we can interpret the entries as a Time Series. Also, because people are submitting the information, we can learn things about a group of people, too.

Back to our example: let’s say we work for the TSA and we’ve been tasked with getting some insights about when these accidents are most likely to happen, and make some recommendations for improving the service.

Pandas, luckily, is a one-stop shop for exploring and analyzing this data set. Feel free to download the excel file into your project folder to get started, or run the curl command below. Yes, pandas can read .xls or .xlsx files with a single call to pd.read_excel()! In fact, it’s often helpful for beginners experienced with .csv or excel files to think about how they would solve a problem in excel, and then experience how much easier it can be in Pandas.

So, without further ado, open your terminal, a text editor, or your favorite IDE, and take a look for yourself with the guidance below.

Example data:

Take for example, some claims made against the TSA during a screening process of persons or a passenger’s property due to an injury, loss, or damage. The claims data information includes claim number, incident date, claim type, claim amount, status, and disposition.

Directory: TSA Claims Data
Our Data Download: claims-2014.xls


To start off, let’s create a clean directory. You can put this wherever you’d like, or create a project folder in an IDE. Use your install method of choice to get Pandas: Pip is probably the easiest.

$ mkdir -p ~/Desktop/pandas-tutorial/data && cd ~/Desktop/pandas-tutorial

Install pandas along with xldr for loading Excel formatted files, matplotlib for plotting graphs, and Numpy for high-level mathematical functions.

$ pip3 install matplotlib numpy pandas xldr

Optional: download the example data with curl:

$ curl -O

Launch Python:

$ python3
Python 3.7.1 (default, Nov  6 2018, 18:46:03)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Import packages:

>>> import matplotlib.pyplot as plt
>>> import numpy as np
>>> import pandas as pd

Loading Data

Loading data with Pandas is easy. Pandas can accurately read data from almost any common format including JSON, CSV, and SQL. Data is loaded into Pandas’ “flagship” data structure, the DataFrame.

That’s a term you’ll want to remember. You’ll be hearing a lot about DataFrames. If that term seems confusing – think about a table in a database, or a sheet in Excel. The main point is that there is more than one column: each row or entry has multiple fields which are consistent from one row to the next.

You can load the example data straight from the web:

>>> df = pd.read_excel(io='', index_col='Claim Number')

Less cooly, data can be loaded from a file:

$ curl -O

>>> df = pd.read_excel(io='claims-2014.xls', index_col='Claim Number')

Basic Operations

Print information about a DataFrame including the index dtype and column dtypes, non-null values, and memory usage. is one of the more useful and versatile methods attached to DataFrames (there are nearly 150!).

Int64Index: 8855 entries, 2013081805991 to 2015012220083
Data columns (total 10 columns):
Date Received    8855 non-null datetime64[ns]
Incident Date    8855 non-null datetime64[ns]
Airport Code     8855 non-null object
Airport Name     8855 non-null object
Airline Name     8855 non-null object
Claim Type       8855 non-null object
Claim Site       8855 non-null object
Item Category    8855 non-null object
Close Amount     8855 non-null object
Disposition      8855 non-null object
dtypes: datetime64[ns](2), object(8)
memory usage: 761.0+ KB

View the first n rows:

 '="">>> df.head(n=3)  # see also df.tail()
    Claim Number Date Received       Incident Date Airport Code       ...              Claim Site                   Item Category Close Amount      Disposition
0  2013081805991    2014-01-13 2012-12-21 00:00:00          HPN       ...         Checked Baggage  Audio/Video; Jewelry & Watches            0             Deny
1  2014080215586    2014-07-17 2014-06-30 18:38:00          MCO       ...         Checked Baggage                               -            0             Deny
2  2014010710583    2014-01-07 2013-12-27 22:00:00          SJU       ...         Checked Baggage                    Food & Drink           50  Approve in Full

[3 rows x 11 columns]

List all the columns in the DataFrame:

>>> df.columns
Index(['Claim Number', 'Date Received', 'Incident Date', 'Airport Code',
       'Airport Name', 'Airline Name', 'Claim Type', 'Claim Site',
       'Item Category', 'Close Amount', 'Disposition'],

Return a single column (important – also referred to as a Series):

>>> df['Claim Type'].head()
0    Personal Injury
1    Property Damage
2    Property Damage
3    Property Damage
4    Property Damage
Name: Claim Type, dtype: object

Hopefully, you’re starting to get an idea of what claims-2014.xls’s data is all about.

The Dtype

Data types are a fundamental concept that you’ll want to have a solid grasp of in order to avoid frustration later. Pandas adopts the nomenclature of Numpy, referring to a column’s data type as its dtype. Pandas also attempts to infer dtypes upon DataFrame construction (i.e. initialization).

To take advantage of the performance boosts intrinsic to Numpy, we need to become familiar with these types, and learn about how they roughly translate to native Python types.

Look again at and note the dtype assigned to each column of our DataFrame:

RangeIndex: 8855 entries, 0 to 8854
Data columns (total 11 columns):
Date Received    8855 non-null datetime64[ns]
Incident Date    8855 non-null datetime64[ns]
Airport Code     8855 non-null object
Airport Name     8855 non-null object
Airline Name     8855 non-null object
Claim Type       8855 non-null object
Claim Site       8855 non-null object
Item Category    8855 non-null object
Close Amount     8855 non-null object
Disposition      8855 non-null object
dtypes: datetime64[ns](2), object(8)
memory usage: 761.1+ KB

dtypes are analogous to text/number format settings typical of most spreadsheet applications, and Pandas uses dtypes to determine which kind(s) of operations may be performed the data in a specific column. For example, mathematical operations can only be performed on numeric data types such as int64 or float64. Columns containing valid Dates and/or time values are assigned the datetime dtype and text and or binary data is assigned the catchall object dtype.

In short, Pandas attempts to infer dtypes upon DataFrame construction. However, like many data analysis applications, the process isn’t always perfect.

It’s important to note that Pandas dtype inference errs on the side of caution: if a Series appears to contain more than one type of data, it’s assigned a catch-all dtype of ‘object’. This behavior is less flexible than a typical spreadsheet application and is intended to ensure dtypes are not inferred incorrectly but also requires the analyst to ensure the data is “clean” after it’s loaded.

Cleansing and Transforming Data

Data is almost always dirty: it almost always contains some datum with atypical formatting; some artifact unique to its medium of origin. Therefore, cleansing data is crucial to ensuring analysis derived therefrom is sound. The work of cleansing with Pandas primarily involves identifying and re-casting incorrectly inferred dtypes.

>>> df.dtypes
Date Received    datetime64[ns]
Incident Date    datetime64[ns]
Airport Code             object
Airport Name             object
Airline Name             object
Claim Type               object
Claim Site               object
Item Category            object
Close Amount             object
Disposition              object
dtype: object

Looking again at our DataFrame’s dtypes we can see that Pandas correctly inferred the dtypes of Date Received and Incident Date as datetime64 dtypes. Thus, datetime attributes of the column’s data are accessible during operations. For example, to summarize our data by the hour of the day when each incident occurred we can group and summarize our data by the hour element of a datetime64 column to determine which hours of the day certain types of incidents occur.

>>> grp = df.groupby(by=df['Incident Date'].dt.hour)
>>> grp['Item Category'].describe()
              count unique                   top freq
Incident Date
0              3421    146  Baggage/Cases/Purses  489
1                 6      5                 Other    2
2                11      9                     -    2
3                 5      5     Jewelry & Watches    1
4                49     18  Baggage/Cases/Purses    6
5               257     39                     -   33
6               357     54                     -   43
7               343     43              Clothing   41
8               299     47                     -   35
9               305     41                     -   31
10              349     45                 Other   43
11              343     41                     -   45
12              363     51                 Other   41
13              359     55                     -   45
14              386     60  Baggage/Cases/Purses   49
15              376     51                 Other   41
16              351     43  Personal Electronics   35
17              307     52                 Other   34
18              289     43  Baggage/Cases/Purses   37
19              241     46  Baggage/Cases/Purses   26
20              163     31  Baggage/Cases/Purses   23
21              104     32  Baggage/Cases/Purses   20
22              106     33  Baggage/Cases/Purses   19
23               65     25  Baggage/Cases/Purses   14

This works out quite perfectly – however, note that Close Amount was loaded as an object. Words like “Amount” are a good indicator that a column contains numeric values.

Let’s take a look at the values in Close Amount.

>>> df['Close Amount'].head()
0     0
1     0
2    50
3     0
4     0
Name: Close Amount, dtype: object

Those look like numeric values to me. So let’s take a look at the other end

>>> df['Close Amount'].tail()
8850      0
8851    800
8852      0
8853    256
8854      -
Name: Close Amount, dtype: object

There’s the culprit: index # 8854 is a string value.

If Pandas can’t objectively determine that all of the values contained in a DataFrame column are the same numeric or date/time dtype, it defaults to an object.

Luckily, I know from experience that Excel’s “Accounting” number format typically formats 0.00 as a dash, -.

So how do we fix this? Pandas provides a general method, DataFrame.apply, which can be used to apply any single-argument function to each value of one or more of its columns.

In this case, we’ll use it to simultaneously convert the – to the value it represents in Excel, 0.0 and re-cast the entire column’s initial object dtype to its correct dtype a float64.

First, we’ll define a new function to perform the conversion:

>>> def dash_to_zero(x):
>>>    if '-' in str(x):
>>>        return float() # 0.0
>>>    else:
>>>        return x  # just return the input value as-is

Then, we’ll apply the function to each value of Close Amount:

>>> df['Close Amount'] = df['Close Amount'].apply(dash_to_zero)
>>> df['Close Amount'].dtype

These two steps can also be combined into a single-line operation using Python’s lambda:

>>> df['Close Amount'].apply(lambda x: 0. if '-' in str(x) else x)

Performing Basic Analysis

Once you’re confident that your dataset is “clean,” you’re ready for some data analysis! Aggregation is the process of getting summary data that may be more useful than the finely grained values we are given to start with.


>>> df.sum()
Close Amount    538739.51
dtype: float64

>>> df.min()
Date Received              2014-01-01 00:00:00
Incident Date              2011-08-24 08:30:00
Airport Code                                 -
Airport Name      Albert J Ellis, Jacksonville
Airline Name                                 -
Claim Type                                   -
Claim Site                                   -
Item Category                                -
Close Amount                                 0
Disposition                                  -

>>> df.max()
Date Received                       2014-12-31 00:00:00
Incident Date                       2014-12-31 00:00:00
Airport Code                                        ZZZ
Airport Name                 Yuma International Airport
Airline Name                                 XL Airways
Claim Type                              Property Damage
Claim Site                                        Other
Item Category    Travel Accessories; Travel Accessories
Close Amount                                    25483.4
Disposition                                      Settle
dtype: object


Find all of the rows where Close Amount is greater than zero. This is helpful because we’d like to see some patterins where the amount is actually positive, and show how conditional operators work.

>>> df[df['Close Amount'] > 0].describe()
       Close Amount
count   2360.000000
mean     228.279453
std      743.720179
min        1.250000
25%       44.470000
50%      100.000000
75%      240.942500
max    25483.440000


In this example, we’ll walk through how to group by a single column’s values.

The Groupby object is an intermediate step that allows us to aggregate on several rows which share something in common – in this case, the disposition value. This is useful because we get a birds-eye view of different categories of data. Ultimately, we use describe() to see several aggregates at once.

>>> grp = df.groupby(by='Disposition')
>>> grp.describe()
                Close Amount
                       count        mean          std   min       25%      50%       75%       max
-                     3737.0    0.000000     0.000000  0.00    0.0000    0.000    0.0000      0.00
Approve in Full       1668.0  158.812116   314.532028  1.25   32.9625   79.675  159.3375   6183.36
Deny                  2758.0    0.000000     0.000000  0.00    0.0000    0.000    0.0000      0.00
Settle                 692.0  395.723844  1268.818458  6.00  100.0000  225.000  425.6100  25483.44

Group by multiple columns:

>>> grp = df.groupby(by=['Disposition', 'Claim Site'])
>>> grp.describe()
                                Close Amount
                                       count         mean          std     min       25%       50%        75%       max
Disposition     Claim Site
-               -                       34.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Bus Station              2.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checked Baggage       2759.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checkpoint             903.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Motor Vehicle           28.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Other                   11.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
Approve in Full Checked Baggage       1162.0   113.868072   192.166683    1.25   25.6600    60.075   125.9825   2200.00
                Checkpoint             493.0   236.643367   404.707047    8.95   60.0000   124.000   250.1400   6183.36
                Motor Vehicle            9.0  1591.428889  1459.368190  493.80  630.0000   930.180  1755.9800   5158.05
                Other                    4.0   398.967500   358.710134   61.11  207.2775   317.385   509.0750    899.99
Deny            -                        4.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checked Baggage       2333.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Checkpoint             407.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
                Motor Vehicle            1.0     0.000000          NaN    0.00    0.0000     0.000     0.0000      0.00
                Other                   13.0     0.000000     0.000000    0.00    0.0000     0.000     0.0000      0.00
Settle          Checked Baggage        432.0   286.271968   339.487254    7.25   77.0700   179.995   361.5700   2500.00
                Checkpoint             254.0   487.173031  1620.156849    6.00  166.9250   281.000   496.3925  25483.44
                Motor Vehicle            6.0  4404.910000  7680.169379  244.00  841.8125  1581.780  2215.5025  20000.00


While aggregates on groups of data is one of the best ways to get insights, visualizing data lets patterns jump out from the page, and is straightforward for those who aren’t as familiar with aggregate values. Properly formatted visualizations are critical to communicating meaning in the data, and it’s nice to see that Pandas has some of these functions out of the box:

>>> df.plot(x='Incident Date', y='Close Amount')

Incident Date by Close Amount

Exporting Transformed Data

Finally, we may need to commit either our original data, or the aggregates as a DataFrame to file format different than the one we started with, as Pandas does not limit you to writing back out to the same file format.

The most common flat file to write to from Pandas will be the .csv. From the visualization, it looks like the cost of TSA claims, while occasionally very high due to some outliers is improving in 2015. We should probably recommend comparing staffing and procedural changes to continue in that direction, and explore in more detail why we have more incidents at certain times of day.

Like loading data, Pandas offers a number of methods for writing your data to file in various formats. Writing back to an Excel file is slightly more involved than the others, so let’s write to an even more portable format: CSV. To write your transformed dataset to a new CSV file:

>>> df.to_csv(path_or_buf='claims-2014.v1.csv')

Final Notes

Here we’ve seen a workflow that is both interesting and powerful. We’ve taken a round-trip all the way from a government excel file, into Python, through some fairly powerful data visualization, and back to a .csv file which could be more universally accessed–all through the power of Pandas. Further, we’ve covered the three central objects in Pandas – DataFrames, Series, and dtypes. Best of all, we have a deeper understanding of an interesting, real-world data set.

These are the core concepts to understand when working with Pandas, and now you can ask intelligent questions (of yourself, or of Google) about these different objects. This TSA data use case has shown us exactly what Pandas is good for: the exploration, analysis, and aggregation of data to draw conclusions.

The analysis and exploration of data is important in practically any field, but it is especially useful to Data Scientists and AI professionals who may need to crunch and clean data in very specific, finely-grained ways, like getting moving averages on stock ticks. Additionally, certain tasks may need to be automated, and this could prove difficult or expensive in sprawling applications like Excel, or Google Sheets, which may not offer all the functionality of Pandas with the full power of Python.

Just imagine telling a business administrator that they may never have to run that broken spreadsheet macro ever again! Once analysis is automated, it can be deployed as a service or applied to hundreds of thousands of records streaming from a database. Alternatively, Pandas could be used to make critical decisions after establishing statistical associations between patterns, as indeed it is every day.

Next, be sure to checkout at Python’s extensive database libraries (e.g. SQLalchemy), or API clients (like the Google Sheets/Slides Python Client or Airtable API to put your results in front of domain experts). The possibilities are endless, and are only enhanced by Python’s mature libraries and active community.

This article originally appeared on (Reprinted with permission)


TensorFlow or PyTorch? A Guide to Python Machine Learning Libraries (with examples!)

By The Kite Team


Python is the fastest-growing programming language out there. That isn’t surprising given that it’s simple, easy to use, free, and applicable for many computing tasks. Data scientists in particular have embraced Python’s efficient syntax, learnability, and easy integrations with other languages such as C and C++.

All these positive qualities, along with the recent spike of interest in machine learning and artificial intelligence, can help explain the plethora of powerful open-source libraries and frameworks for machine learning and data science applications. There are libraries that can be put to use in a multitude of applications, including:

  • natural language processing / NLP (Tensorflow)
  • visualization and analysis of complex data (Theano)
  • image recognition (Caffe)
  • prediction and recommendation

Open-source frameworks have popped up to address all of the above applications, and now it can be confusing to decide on which library to use for which project. Tensorflow or Sci-kit? Should I use Keras on top of Microsoft’s CNTK? What’s the best application to use MXNet?

Once you’ve determined the goals and overall priorities for your project, this article can help you select the language that is the best fit for your project. Some of the questions that you’ll need to consider include:

  • Your confidence level with machine learning fundamentals
  • If you will be using the framework for classic machine learning algorithms or for Deep Learning
  • What application you will be using the framework for: be it heavy numerical computations, complex data analysis, image analysis, or education and research
  • Whether or not you’ll be using any additional hardware (like GPUs and TPUs), software, or cloud services for scaling on to bigger data sets.

Each open-source framework available today has its own strengths and weaknesses when measured across these factors. And choosing the best framework for your needs will really depend on just what you want to accomplish.

For example, if you are new to machine learning or want to use classic machine learning algorithms, Sci-kit could be the best choice. On the other hand, if you need to do heavy numerical computations, Theano would work much better. In any case, no matter your specific situation – this guide will aim to help you figure out which framework is the perfect fit.

Library Best Application Can Run on External Hardware Machine Learning or Deep Learning? ML Knowledge required (beginner, intermediate, advanced) Learning Curve
Sci-Kit Learn Learning ML No ML only Beginner Very Low
PyTorch Academic use and production Yes Both Beginners Low
Caffe Image processing Yes Both Mid-level Low
TensorFlow Processing large data sets quickly Yes Both intermediate High
Theano High-speed computation Yes Both Advanced Very High

Among all the myriad of options available for open-source Python frameworks, here is the compilation of our top 5 choices in descending order. You can follow along with examples for each library, stored in Kite’s github repository.

5. Sci-Kit Learn

Ideal for: ML beginners

Sci-kit Learn is a library that features a host of the classical machine learning algorithms like Support Vector Machines (SVMs), KNN Maps, K-Nearest Neighbors (KNN) classifiers, Random Forests, and regression algorithms. It includes options for both supervised and unsupervised learning. Thus, it’s ultimately an effective tool for statistical modeling.

It has been built on many other Python libraries like SciPy, Numpy, and Matplotlib, and some of its core algorithms are also written using Cython. I created an example of a Sci-Kit operation here.


  • Great for beginners and for those looking to explore machine learning algorithms
  • Good for data-mining and simple projects like predictions on small or labeled data sets


  • Does not support ANNs
  • Does not support GPU computing

What sets this framework apart from others is an easy-to-use interface for developers and a high level of abstraction that allows especially beginners in machine learning to get easily acquainted with the platform, without having to deal with the nitty-gritty of actual algorithms.

It’s easy to run and debug, and there are some nice and easy tutorials available to help understand the algorithms when you do have to work with them. However, Sci-kit Learn does have a couple of limitations.

First, it does not support Artificial Neural Networks.

Second, it’s only suitable for small projects with small datasets, and for tasks that are not particularly computationally intensive. This is mainly due to the fact that the framework does not support GPU computing.

For more seasoned or hard-core developers, it can feel limiting to some extent, as the abstraction doesn’t allow for fine tuning the underlying algorithms.

4. Theano

Ideal for: Hardcore developers requiring high-speed computation over a single GPU

If you’re looking for a framework that can crunch numbers like a pro, then Theano will be your best bet.

Theano is a workhorse well-equipped for numerical computing and sits under a large number of other deep learning frameworks like Tensorflow and Keras. The framework lets you efficiently work with mathematical expressions that include multi-dimensional arrays.


  • Efficiency in crunching large and multi-dimensional data sets
  • Provides the developer ample flexibility to fine-tune underlying algorithms and create novel models.


  • A very steep learning curve
  • Does not support scaling over multiple GPUs

While Sci-kit Learn is for beginners, Theano is only for advanced deep learning experts.

The API is low-level, so you really need feel comfortable in your coding abilities if you’re looking to explore this framework. The syntax for Theano is quite tightly integrated with NumPy and its code can run efficiently – both on a CPU and a GPU. You can see an example of a script using Theano here.

In contrast to Sci-kit learn, Theano empowers any developer with a complete flexibility to fine-tune and control their models. It even allows the developer to implement and test completely unconventional models.

Although Theano works better than Tensorflow over a single GPU, it still doesn’t match up to Tensorflow when working with multiple GPUs.

However, since Theano has been around longer, it does have a lot more documentation. The biggest drawback is that MILA, the makers of Theano, have decided to stop maintaining the framework following its 1.0 release. Nonetheless, it continues to be a great choice for avid deep learning enthusiasts.

3. Caffe

Ideal for: Mid-level programmers and image processing

Caffe (Convolutional Architecture for Fast Feature Embedding) was mainly built to support Convolutional Neural Networks (CNNs) and is the framework of choice for those working on computer vision, image processing, and feedforward networks.

The framework is a Python-based API, which was mainly written in C++. Models in Caffe are represented by Protobuf configuration files and the framework, is, in fact, the fastest CNN implementation among all Deep Learning frameworks.

It works well on image segmentation and classification tasks. With a single GPU, Caffe can process more than 60 million images in a day! Here is a simple example of using a pre-trained Caffe model to correctly identify an image as the digit, “5”.


  • Has great ready-to-use models for image recognition
  • It is the fastest CNN implementation framework
  • Models and optimizations are configured rather than coded


  • Not suitable for RNNs
  • Poor documentation
  • Creating new layers requires defining full forward, backward and gradient updates

The best thing about Caffe is that models and optimizations are not ‘coded’, but rather ‘configured’ – this reduces a lot of headaches for developers.

So, if you have a large set of images that you need to classify or run some regression algorithms on, you can quickly apply a DL Network without having to write even a line of code.

You can also train over multiple GPUs, but this has some limitations: for example, you can’t do model/data parallelism.

Another upside of Caffe is that it has a pretty good developer community – there’s a whole ‘Model Zoo’ available where you can find a number of CNN implementations and models, like AlexNet, GoogleNet, NIN, etc. This is something that other frameworks lack.

Although the framework is great for CNNs and image processing, it’s not suitable for Recurrent Neural Networks (RNNs) and applications involving text, sound, and time series data. Moreover, even though there are a lot of layers ready to be implemented, the creation of new layers can be tedious, as one would need to define full forward, backward and gradient updates for each new layer.

Finally, the framework offers a medium-level abstraction – it’s high-level enough to allow you to do quick experiments and flexible enough to allow you to fine-tune some of the aspects. This detail may be a positive for mid-level developers, but it feels somewhat limiting for hardcore developers.

2. Pytorch

Ideal for: Both academic use and production

Pytorch was developed using Python, C++ and CUDA backend. Created by the Facebook Artificial Intelligence Research team (FAIR), Pytorch is fairly new but is already competing neck-to-neck with Tensorflow, and many predict it will soon become a go-to alternative to many other frameworks.


  • Coding is easy, so it has a flatter learning curve
  • Supports dynamic graphs so you can adjust on-the-go.
  • Supports GPU acceleration


  • Quite new, so it has a smaller community and fewer resources available online

Pytorch is being lauded particularly by beginners, mostly due to its easy-to-write code – but the framework is basically a blend of both high and low-level APIs. In actuality, it’s suitable for both academic uses as well as hard-core deep learning.

It features a number of pre-trained models. When coding in Pytorch, you don’t need to categorize numbers into ‘int’, ‘short’, or ‘double’ data types, like other coding languages. This makes the performance of operations and functions on this framework more intuitive compared to other options. You can see example code for PyTorch here.

The highlight of this framework, though, is that it offers developers the ability to use dynamic graphs. Tensorflow, Theano, and their derivatives allow you to create only static graphs, so you have to define the whole graph for the model before you can run it. However, in Pytorch, you can define or adjust your graph during runtime, so it’s more flexible and allows you to use variable length inputs, especially in your RNNs.

The framework also provides strong support for GPU acceleration, so you get both efficiency and speed.

The main drawback to Pytorch, though, is that the framework is still growing and you may encounter some bugs. Moreover, owing to its younger age, the resources to supplement its official documentation are still quite scant. But looking at overall trends, this will not be a problem for too long, as more and more developers are converting to Pytorch and the community is growing slowly but steadily.

1. Tensorflow

Ideal for: Intermediate-level developers and for developing production models that need to quickly process vast data sets

Tensorflow is currently hailed as the best ML framework out there. Within a very short time, it has become a favorite for many developers and is witnessing an ever-growing community and extraordinary development momentum.

The framework was developed by the Google Brain team and supports all platforms, from Linux to Android. It is a high-level framework that allows you to run low-level code with supporting libraries. Ultimately, it allows you to monitor the progress of the training process, while tracking a lot of metrics and not having to bother about most of the other details.


  • Flexibility
  • Contains several ready-to-use ML models and ready-to-run application packages
  • Scalability with hardware and software
  • Large online community


  • Supports only NVIDIA GPUs
  • A slightly steep learning curve

Tensorflow’s architecture and UX are different from other frameworks in that the nodes in a Tensorflow graph represent mathematical operations, while the edges of the graph represent multidimensional arrays (tensors). These tensors flow between the nodes, giving you a lot of flexibility when it comes to creating new nodes, unlike the Caffe architecture, for example.

The system also has a host of models to choose from: the framework is pre-loaded with packages that let you perform voice recognition and machine translation, and models that let you run regressions, classifications, neural networks and an assortment of other algorithms.

Tensorflow can be used for quite a few applications within machine learning. Check out a basic “Hello, World” program here and a more traditional matrix example here.

But the feature that really takes the cake is Tensorflow’s computing capabilities. To date, Tensorflow is the strongest contender in the distributed processing arena. It provides remarkable scalability and lets you deploy your computations to multiple CPUs, GPUs, other servers, mobile devices, and the Google Cloud Machine Learning Engine. You can do this without having to rewrite any code – that’s truly powerful.

The main downside, though, is that at the moment, it only supports NVIDIA GPUs. Also, when it comes to RNN support, it is ultimately weaker than some other frameworks and the learning curve can be a little steeper than Sci-kit and Pytorch.

Overall, with a strong Google backing and a huge online community, Tensorflow is here for the long haul.


To sum up, while Tensorflow has gained enormous popularity owing to its flexibility and distributed processing capabilities, Pytorch is also slowly gaining momentum owing to its flatter learning curve and ability to process dynamic graphs. Both of these frameworks are multi-purpose and can be applied to many types of projects. Other frameworks like Caffe, Theano, and Sci-Kit Learn are more specialized and aimed toward specific tasks.

Nonetheless, machine learning and artificial intelligence are the future, and these open source frameworks have brought ML within the grasp of any developer with a really keen interest. These libraries provide the tools for any Pythonista to practice machine learning principles. Pick a framework that meets your level of expertise and application domain and try out your first project!


Python Programming: How to make text lowercase & uppercase


A quick and simple way to change all text into lowercase or uppercase letters using python.


text = “New York City”

lowertext = text.lower()

print (lowertext)

output: new york city

or you can simply apply .lower() to ‘text’


output: new york city

Uppercase is the same process.




What is Web Scraping?

By Zac Clancy for

Table of Contents

  • Introducing web scraping
  • Some use cases of web scraping
  • How does it work?
  • Robots.txt
  • A simple example
  • Working with HTML
  • Data processing
  • Next steps

Introducing web scraping

Simply put, web scraping is one of the tools developers use to gather and analyze information from the Internet.

Some websites and platforms offer application programming interfaces (APIs) which we can use to access information in a structured way, but others might not. While APIs are certainly becoming the standard way of interacting with today’s popular platforms, we don’t always have this luxury when interacting with most of the websites on the internet.

Rather than reading data from standard API responses, we’ll need to find the data ourselves by reading the website’s pages and feeds.

Some use cases of web scraping

The World Wide Web was born in 1989 and web scraping and crawling entered the conversation not long after in 1993.

Before scraping, search engines were compiled lists of links collected by the website administrator, and arranged into a long list of links somewhere on their website. The first web scraper and crawler, the World Wide Web Wanderer, were created to follow all these indexes and links to try and determine how big the internet was.

It wasn’t long after this that developers started using crawlers and scrapers to create crawler-based search engines that didn’t require human assistance. These crawlers would simply follow links that would come across each page and save information about the page. Since the web is a collaborative effort, the crawler could easily and infinitely follow embedded links on websites to other platforms, and the process would continue forever.

Nowadays, web scraping has its place in nearly every industry. In newsrooms, web scrapers are used to pull in information and trends from thousands of different internet platforms in real time.

Spending a little too much on Amazon this month? Websites exist that will let you know, and, in most cases, will do so by using web scraping to access that specific information on your behalf.

Machine learning and artificial intelligence companies are scraping billions of social media posts to better learn how we communicate online.

So how does it work?

The process a developer builds for web scraping looks a lot like the process a user takes with a browser:

  1. A URL is given to the program.
  2. The program downloads the response from the URL.
  3. The program processes the downloaded file depending on data required.
  4. The program starts over at with a new URL

The nitty gritty comes in steps 3 and, in which data is processed and the program determines how to continue (or if it should at all). For Google’s crawlers, step 3 likely includes collecting all URL links on the page so that the web scraper has a list of places to begin checking next. This is recursiveby design and allows Google to efficiently follow paths and discover new content.

There are many heavily used, well built libraries for reading and working with the downloaded HTML response. In the Ruby ecosystem Nokogiri is the standard for parsing HTML. For Python, BeautifulSoup has been the standard for 15 years. These libraries provide simple ways for us to interact with the HTML from our own programs.

These code libraries will accept the page source as text, and a parser for handling the content of the text. They’ll return helper functions and attributes which we can use to navigate through our HTML structure in predictable ways and find the values we’re looking to extract.

Scraping projects involve a good amount of time spent analyzing a web site’s HTML for classes or identifiers, which we can use to find information on the page. Using the HTML below we can begin to imagine a strategy to extract product information from the table below using the HTML elements with the classes products and product.

<table class="products">
  <tr class="product">...</tr>
  <tr class="product">...</tr>

In the wild, HTML isn’t always as pretty and predictable. Part of the web scraping process is learning about your data and where it lives on the pages as you go along. Some websites go to great lengths to prevent web scraping, some aren’t built with scraping in mind, and others just have complicated user interfaces which our crawlers will need to navigate through.


While not an enforced standard, it’s been common since the early days of web scraping to check for the existence and contents of a robots.txt file on each site before scraping its content. This file can be used to define inclusion and exclusion rules that web scrapers and crawlers should follow while crawling the site. You can check out Facebook’s robots.txt file for a robust example: this file is always located at /robots.txt so that scrapers and crawlers can always look for it in the same spot. Additionally, GitHub’s robots.txt, and Twitter’s are good examples.

An example robots.txt file prohibits web scraping and crawling would look like the below:
User-agent: *
Disallow: /

The User-agent: * section is for all web scrapers and crawlers. In Facebook’s, we see that they set User-agent to be more explicit and have sections for Googlebot, Applebot, and others.

The Disallow: / line informs web scrapers and crawlers who observe the robots.txt file that they aren’t permitted to visit any pages on this site. Conversely, if this line read Allow: /, web scrapers and crawlers would be allowed to visit any page on the website.

The robots.txt file can also be a good place to learn information about the website’s architecture and structure. Reading where our scraping tools are allowed to go – and not allowed to go – can help inform us on sections of the website we perhaps didn’t know existed, or may not have thought to look at.

If you’re running a website or platform it’s important to know that this file isn’t always respected by every web crawler and scraper. Larger properties like Google, Facebook, and Twitter respect these guidelines with their crawlers and information scrapers, but since robots.txt is considered a best practice rather than an enforceable standard, you may see different results from different parties. It’s also important not to disclose private information which you wouldn’t want to become public knowledge, like an admin panel on /admin or something like that.

A simple example

To illustrate this, we’ll use Python plus the BeautifulSoup and Requests libraries.

import requests
from bs4 import BeautifulSoup

page = requests.get('')
soup = BeautifulSoup(page.text, 'html.parser')

We’ll go through this line-by-line:

page = requests.get('')

This uses the requests library to make a request to and return the response.

soup = BeautifulSoup(page.text, 'html.parser')

The requests library assigns the text of our response to an attribute called text which we use to give BeautifulSoup our HTML content. We also tell BeautifulSoup to use Python 3’s built-in HTML parser html.parser.

Now that BeautifulSoup has parsed our HTML text into an object that we can interact with, we can begin to see how information may be extracted.

paragraphs = soup.find_all('p')

Using find_all we can tell BeautifulSoup to only return HTML paragraphs <p> from the document.

If we were looking for a div with a specific ID (#content) in the HTML we could do that in a few different ways:

element ='#content')
# or
element = soup.find_all('div', id='content')
# or
element = soup.find(id='content')

In the Google scenario from above, we can imagine that they have a function that does something similar to grab all the links off of the page for further processing:

links = soup.find_all('a', href=True)

The above snippet will return all of the <a> elements from the HTML which are acting as links to other pages or websites. Most large-scale web scraping implementations will use a function like this to capture local links on the page, outbound links off the page, and then determine some priority for the links’ further processing.

Working with HTML

The most difficult aspect of web scraping is analyzing and learning the underlying HTML of the sites you’ll be scraping. If an HTML element has a consistent ID or set of classes, then we should be able to work with it fairly easily, we can just select it using our HTML parsing library (Nokogiri, BeautifulSoup, etc). If the element on the page doesn’t have consistent classes or identifiers, we’ll need to access it using a different selector.

Imagine our HTML page contains the following table which we’d like to extract product information from:




Shirt Athletic $19.99
Jacket Outdoor $124.99

BeautifulSoup allows us to parse tables and other complex elements fairly simply. Let’s look at how we’d read the table’s rows in Python:

# Find all the HTML tables on the page
tables = soup.find_all('table')

# Loop through all of the tables
for table in tables:
	# Access the table's body
	table_body = table.find('tbody')
	# Grab the rows from the table body
	rows = table_body.find_all('tr')

	# Loop through the rows
	for row in rows:
    	    # Extract each HTML column from the row
    	    columns = row.find_all('td')

    	    # Loop through the columns
    	    for column in columns:
        	  # Print the column value

The above code snippet would print Shirt, followed by Athletic, and then $19.99 before continuing on to the next table row. While simple, this example illustrates one of the many strategies a developer might take for retrieving data from different HTML elements on a page.

Data processing

Researching and inspecting the websites you’ll be scraping for data is a crucial component to each project. We’ll generally have a model that we’re trying to fill with data for each page. If we were scraping restaurant websites we’d probably want to make sure we’re collecting the name, address, and the hours of operation at least, with other fields being added as we’re able to find the information. You’ll begin to notice that some websites are much easier to scrape for data than others – some are even defensive against it!

Once you’ve got your data in hand there are a number of different options for handling, presenting, and accessing that data. In many cases you’ll probably want to handle the data yourself, but there’s a slew of services offered for many use cases by various platforms and companies.

  • Search indexing: Looking to store the text contents of websites and easily search? Algolia and Elasticsearch are good for that.
  • Text analysis: Want to extract people, places, money and other entities from the text? Maybe spaCy or Google’s Natural Language API are for you.
  • Maps and location data: If you’ve collected some addresses or landmarks, you can use OpenStreetMap or MapBox to bring that location data to life.
  • Push notifications: If you want to get a text message when your web crawler finds a specific result, check out Twilio or Pusher.

Next steps

In this post, we learned about the basics of web scraping and looked at some simplistic crawling examples which helped demonstrate how we can interact with HTML pages from our own code. Ruby’s Nokogiri, Python’s BeautifulSoup, and JavaScript’s Nightmare are powerful tools to begin learning web scraping with. These libraries are relatively simple to start with, but offer powerful interfaces to begin to extend in more advanced use cases.

Moving forward from this post, try to create a simple web scraper of your own! You could potentially write a simple script that reads a tweet from a URL and prints the tweet text into your terminal. With some practice, you’ll be analyzing HTML on all the websites you visit, learning its structure, and understanding how you’d navigate its elements with a web scraper.

This article originally appeared on (Reprinted with permission)


Image Segmentation with Python

By Pranathi.V.N. Vemuri for

Table of Contents

  • Introduction to image segmentation
  • Approach
    • Cleaning
    • Segmentation
    • Validation
  • Loading and visualizing data
  • Pre-processing
  • Segmentation
  • Validation
    • The confusion matrix
    • Accuracy
    • F1 score
    • MCC
  • Validation visualization
  • Final notes

Introduction to image segmentation

In this article we look at an interesting data problem – making decisions about the algorithms used for image segmentation, or separating one qualitatively different part of an image from another.

Example code for this article may be found at the Kite Github repository. We have provided tips on how to use the code throughout.

As our example, we work through the process of differentiating vascular tissue in images, produced by Knife-edge Scanning Microscopy (KESM). While this may seem like a specialized use-case, there are far-reaching implications, especially regarding preparatory steps for statistical analysis and machine learning.

Data scientists and medical researchers alike could use this approach as a template for any complex, image-based data set (such as astronomical data), or even large sets of non-image data. After all, images are ultimately matrices of values, and we’re lucky to have an expert-sorted data set to use as ground truth. In this process, we’re going to expose and describe several tools available via image processing and scientific Python packages (opencv, scikit-image, and scikit-learn). We’ll also make heavy use of the numpy library to ensure consistent storage of values in memory.

The procedures we’ll explore could be used for any number of statistical or supervised machine learning problems, as there are a large number of ground truth data points. In order to choose our image segmentation algorithm and approach, we will demonstrate how to visualize the confusion matrix, using matplotlib to colorize where the algorithm was right and where it was wrong. In early stages, it’s more useful for a human to be able to clearly visualize the results than to aggregate them into a few abstract numerals.



To remove noise, we use a simple median filter to remove the outliers, but one can use a different noise removal approach or artifact removal approach. The artifacts vary across acquisition systems (microscopy techniques) and may require complicated algorithms to restore the missing data. Artifacts commonly fall into two categories:

  • blurry or out-of-focus areas
  • imbalanced foreground and background (correct with histogram modification)


For this article, we limit segmentation to Otsu’s approach, after smoothing an image using a median filter, followed by validation of results. You can use the same validation approach for any segmentation algorithm, as long as the segmentation result is binary. These algorithms include, but are not limited to, various Circular Thresholding approaches that consider different color space.

Some examples are:

  • Li Thresholding
  • An adaptive thresholding method that is dependent on local intensity
  • Deep learning algorithms like UNet used commonly in biomedical image segmentation
  • Deep learning approaches that semantically segment an image


We begin with a ground truth data set, which has already been manually segmented. To quantify the performance of a segmentation algorithm, we compare ground truth with the predicted binary segmentation, showing accuracy alongside more effective metrics. Accuracy can be abnormally high despite a low number of true positives (TP) or false negatives (FN). In such cases, F1 Score and MCC are better quantification metrics for the binary classification.We’ll go into detail on the pros and cons of these metrics later.

For qualitative validation, we overlay the confusion matrix results i.e where exactly the true positives, true negatives, false positives, false negatives pixels are onto the grayscale image. This validation can also be applied to a color image on a binary image segmentation result, although the data we used in this article is a grayscale image. In the end, we will present the whole process so that you can see the results for yourself.  Now, let’s look at the data–and the tools used to process that data.

Loading and visualizing data

We will use the below modules to load, visualize, and transform the data. These are useful for image processing and computer vision algorithms, with simple and complex array mathematics. The module names in parentheses will help if installing individually.

Module Reason
numpy Histogram calculation, array math, and equality testing
matplotlib Graph plotting and Image visualization
scipy Image reading and median filter
cv2 (opencv-python) Alpha compositing to combine two images
skimage (scikit-image) Image thresholding
sklearn (scikit-learn) Binary classifier confusion matrix
nose Testing

Displaying Plots Sidebar: If you are running the example code in sections from the command line, or experience issues with the matplotlib backend, disable interactive mode by removing the plt.ion() call, and instead call at the end of each section, by uncommenting suggested calls in the example code. Either ‘Agg’ or ‘TkAgg’ will serve as a backend for image display. Plots will be displayed as they appear in the article.

Importing modules

import cv2
import matplotlib.pyplot as plt
import numpy as np
import scipy.misc
import scipy.ndimage
import skimage.filters
import sklearn.metrics

# Turn on interactive mode. Turn off with plt.ioff()

In this section, we load and visualize the data. The data is an image of mouse brain tissue stained with India ink,  generated by Knife-Edge Scanning Microscopy (KESM). This 512 x 512 image is a subset, referred to as a tile. The full data set is 17480 x 8026 pixels, 799 slices in depth, and 10gb in size. So, we will write algorithms to process the tile of size 512 x 512 which is only 150 KB.

Individual tiles can be mapped to run on multi processing/multi threaded (i.e. distributed infrastructure), and then stitched back together to obtain the full segmented image. The specific stitching method is not demonstrated here. Briefly, stitching involves indexing the full matrix and putting the tiles back together according to this index. For combining numerical values, you can use map-reduce. Map-Reduce yields metrics such as the sum of all the F1 scores along all tiles, which you can then average. Simply append the results to a list, and then perform your own statistical summary.

The dark circular/elliptical disks on the left are vessels and the rest is the tissue. So, our two classes in this dataset are:

  • foreground (vessels) – labeled as 255
  • background (tissue) – labeled as 0

The last image on the right below is the ground truth image. Vessels are traced manually by drawing up contours and filling them to obtain the ground truth by a board-certified pathologist. We can use several examples like these from experts to train supervised deep learning networks and validate them on a larger scale. We can also augment the data by giving these examples to crowdsourced platforms and training them to manually trace a different set of images on a larger scale for validation and training. The image in the middle is just an inverted grayscale image, which corresponds with the ground truth binary image.

Loading and visualizing images in figure above

grayscale = scipy.misc.imread('grayscale.png')
grayscale = 255 - grayscale
groundtruth = scipy.misc.imread('groundtruth.png')
plt.subplot(1, 3, 1)
plt.imshow(255 - grayscale, cmap='gray')
plt.subplot(1, 3, 2)
plt.imshow(grayscale, cmap='gray')
plt.title('inverted grayscale')
plt.subplot(1, 3, 3)
plt.imshow(groundtruth, cmap='gray')
plt.title('groundtruth binary')


Before segmenting the data, you should go through the dataset thoroughly to determine if there are any artifacts due to the imaging system. In this example, we only have one image in question. By looking at the image, we can see that there aren’t any noticeable artifacts that would interfere with the segmentation. However, you can remove outlier noise and smooth an image using a median filter. A median filter replaces the outliers with the median (within a kernel of a given size).

Median filter of kernel size 3

median_filtered = scipy.ndimage.median_filter(grayscale, size=3)
plt.imshow(median_filtered, cmap='gray')
plt.title('median filtered image')

To determine which thresholding technique is best for segmentation, you could start by thresholding to determine if there is a distinct pixel intensity that separates the two classes. In such cases, you can use that intensity obtained by the visual inspection to binarize the image. In our case, there seem to be a lot of pixels with intensities of less than 50 which correspond to the background class in the inverted grayscale image.

Although the distribution of the classes is not bimodal (having two distinct peaks), it still has a distinction between foreground and background, which is where the lower intensity pixels peak and then hit a valley. This exact value can be obtained by various thresholding techniques. The segmentation section examines one such method in detail.

Visualize histogram of the pixel intensities

counts, vals = np.histogram(grayscale, bins=range(2 ** 8))
plt.plot(range(0, (2 ** 8) - 1), counts)
plt.title('Grayscale image histogram')
plt.xlabel('Pixel intensity')


After removing noise, you can apply the skimage filters module to try all thresholds to explore which thresholding methods fare well. Sometimes, in an image, a histogram of its pixel intensities is not bimodal. So, there might be another thresholding method that can fare better like an adaptive thresholding method that does thresholding based on local pixel intensities within a kernel shape. It’s good to see what the different thresholding methods results are, and skimage.filters.thresholding.try_all_threshold() is handy for that.

Try all thresholding method

result = skimage.filters.thresholding.try_all_threshold(median_filtered)

The simplest thresholding approach uses a manually set threshold for an image. On the other hand, using an automated threshold method on an image calculates its numerical value better than the human eye and may be easily replicated. For our image in this example, it seems like Otsu, Yen, and the Triangle method are performing well. The other results for this case are noticeably worse.

We’ll use the Otsu thresholding to segment our image into a binary image for this article. Otsu calculates thresholds by calculating a value that maximizes inter-class variance (variance between foreground and background) and minimizes intra-class variance (variance within foreground or variance within background).  It does well if there is either a bimodal histogram (with two distinct peaks) or a threshold value that separates classes better.

Otsu thresholding and visualization

threshold = skimage.filters.threshold_otsu(median_filtered)
print('Threshold value is {}'.format(threshold))
predicted = np.uint8(median_filtered > threshold) * 255
plt.imshow(predicted, cmap='gray')
plt.title('otsu predicted binary image')

If the above simple techniques don’t serve the purpose for binary segmentation of the image, then one can use UNet, ResNet with FCN or various other supervised deep learning techniques to segment the images. To remove small objects due to the segmented foreground noise, you may also consider trying skimage.morphology.remove_objects().


In any of the cases, we need the ground truth to be manually generated by a human with expertise in the image type to validate the accuracy and other metrics to see how well the image is segmented.

The confusion matrix

We use sklearn.metrics.confusion_matrix() to get the confusion matrix elements as shown below. Scikit-learn confusion matrix function returns 4 elements of the confusion matrix, given that the input is a list of elements with binary elements. For edge cases where everything is one binary value(0) or other(1), sklearn returns only one element. We wrap the sklearn confusion matrix function and write our own with these edge cases covered as below:


def get_confusion_matrix_elements(groundtruth_list, predicted_list):
    """returns confusion matrix elements i.e TN, FP, FN, TP as floats
	See example code for helper function definitions
    _assert_valid_lists(groundtruth_list, predicted_list)

    if _all_class_1_predicted_as_class_1(groundtruth_list, predicted_list) is True:
        tn, fp, fn, tp = 0, 0, 0, np.float64(len(groundtruth_list))

    elif _all_class_0_predicted_as_class_0(groundtruth_list, predicted_list) is True:
        tn, fp, fn, tp = np.float64(len(groundtruth_list)), 0, 0, 0

        tn, fp, fn, tp = sklearn.metrics.confusion_matrix(groundtruth_list, predicted_list).ravel()
        tn, fp, fn, tp = np.float64(tn), np.float64(fp), np.float64(fn), np.float64(tp)

    return tn, fp, fn, tp


Accuracy is a common validation metric in case of binary classification. It is calculated as


where TP  = True Positive, TN = True Negative, FP = False Positive, FN = False Negative


def get_accuracy(groundtruth_list, predicted_list):

    tn, fp, fn, tp = get_confusion_matrix_elements(groundtruth_list, predicted_list)
    total = tp + fp + fn + tn
    accuracy = (tp + tn) / total
    return accuracy

It varies between 0 to 1, with 0 being the worst and 1 being the best. If an algorithm detects everything as either entirely background or foreground, there would still be a high accuracy. Hence we need a metric that considers the imbalance in class count. Especially since the current image has more foreground pixels(class 1) than background 0.

F1 score

The F1 score varies from 0 to 1 and is calculated as:

F1 Score

with 0 being the worst and 1 being the best prediction. Now let’s handle F1 score calculation considering edge cases.


def get_f1_score(groundtruth_list, predicted_list):
    """Return f1 score covering edge cases"""

    tn, fp, fn, tp = get_confusion_matrix_elements(groundtruth_list, predicted_list)
    if _all_class_0_predicted_as_class_0(groundtruth_list, predicted_list) is True:
        f1_score = 1
    elif _all_class_1_predicted_as_class_1(groundtruth_list, predicted_list) is True:
        f1_score = 1
        f1_score = (2 * tp) / ((2 * tp) + fp + fn)

    return f1_score

An F1 score of above 0.8 is considered a good F1 score indicating prediction is doing well.


MCC stands for Matthews Correlation Coefficient, and is calculated as:

mcc equation

It lies between -1 and +1. -1 is absolutely an opposite correlation between ground truth and predicted, 0 is a random result where some predictions match and +1 is where absolutely everything matches between ground and prediction resulting in positive correlation. Hence we need better validation metrics like MCC.

In MCC calculation, the numerator consists of just the four inner cells (cross product of the elements) while the denominator consists of the four outer cells (dot product of the) of the confusion matrix. In the case where the denominator is 0, MCC would then be able to notice that your classifier is going in the wrong direction, and it would notify you by setting it to the undefined value (i.e. numpy.nan). But, for the purpose of getting valid values, and being able to average the MCC over different images if necessary, we set the MCC to -1, the worst possible value within the range. Other edge cases include all elements correctly detected as foreground and background with MCC and F1 score set to 1. Otherwise, MCC is set to -1 and F1 score is 0.

To learn more about MCC and the edge cases, this is a good article. To understand why MCC is better than accuracy or F1 score more in detail, Wikipedia does good work here.


def get_mcc(groundtruth_list, predicted_list):
    """Return mcc covering edge cases"""   

    tn, fp, fn, tp = get_confusion_matrix_elements(groundtruth_list, predicted_list)
    if _all_class_0_predicted_as_class_0(groundtruth_list, predicted_list) is True:
        mcc = 1
    elif _all_class_1_predicted_as_class_1(groundtruth_list, predicted_list) is True:
        mcc = 1
    elif _all_class_1_predicted_as_class_0(groundtruth_list, predicted_list) is True:
        mcc = -1
    elif _all_class_0_predicted_as_class_1(groundtruth_list, predicted_list) is True :
        mcc = -1

    elif _mcc_denominator_zero(tn, fp, fn, tp) is True:
        mcc = -1

    # Finally calculate MCC
        mcc = ((tp * tn) - (fp * fn)) / (
            np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)))
    return mcc

Finally, we can compare the validation metrics by result, side-by-side.

>>> validation_metrics = get_validation_metrics(groundtruth, predicted)

{'mcc': 0.8533910225863214, 'f1_score': 0.8493358633776091, 'tp': 5595.0, 'fn': 1863.0, 'fp': 122.0, 'accuracy': 0.9924278259277344, 'tn': 254564.0}

Accuracy is close to 1, as we have a lot of background pixels in our example image that are correctly detected as background (i.e. true negatives are are naturally higher). This shows why accuracy isn’t a good measure for binary classification.

F1 score is 0.84. So, in this case, we probably don’t need a more sophisticated thresholding algorithm for binary segmentation. If all the images in the stack had similar histogram distribution and noise, then we could use Otsu and have satisfactory prediction results.

The MCC of 0.85 is high, also indicating the ground truth and predicted image have a high correlation, clearly seen from the predicted image picture from the previous section.

Now, let’s visualize and see where the confusion matrix elements TP, FP, FN, TN are distributed along the image. It shows us where the threshold is picking up foreground (vessels) when they are not present (FP) and where true vessels are not detected (FN), and vice-versa.

Validation visualization

To visualize confusion matrix elements, we figure out exactly where in the image the confusion matrix elements fall. For example, we find the TP array (i.e. pixels correctly detected as foreground) is by finding the logical “and” of the ground truth and the predicted array. Similarly, we use logical boolean operations commonly called as Bit blit to find the FP, FN, TN arrays.


def get_confusion_matrix_intersection_mats(groundtruth, predicted):
    """ Returns dict of 4 boolean numpy arrays with True at TP, FP, FN, TN

    confusion_matrix_arrs = {}

    groundtruth_inverse = np.logical_not(groundtruth)
    predicted_inverse = np.logical_not(predicted)

    confusion_matrix_arrs['tp'] = np.logical_and(groundtruth, predicted)
    confusion_matrix_arrs['tn'] = np.logical_and(groundtruth_inverse, predicted_inverse)
    confusion_matrix_arrs['fp'] = np.logical_and(groundtruth_inverse, predicted)
    confusion_matrix_arrs['fn'] = np.logical_and(groundtruth, predicted_inverse)

    return confusion_matrix_arrs

Then, we can map pixels in each of these arrays to different colors. For the figure below we mapped TP, FP, FN, TN to the CMYK (Cyan, Magenta, Yellow, Black) space. One could similarly also map them to (Green, Red, Red, Green) colors. We would then get an image where everything in red signifies the incorrect predictions. The CMYK space allows us to distinguish between TP, TN.


def get_confusion_matrix_overlaid_mask(image, groundtruth, predicted, alpha, colors):
    Returns overlay the 'image' with a color mask where TP, FP, FN, TN are
    each a color given by the 'colors' dictionary
    image = cv2.cvtColor(image, cv2.COLOR_GRAY2RGB)
    masks = get_confusion_matrix_intersection_mats(groundtruth, predicted)
    color_mask = np.zeros_like(image)
    for label, mask in masks.items():
        color = colors[label]
        mask_rgb = np.zeros_like(image)
        mask_rgb[mask != 0] = color
        color_mask += mask_rgb
    return cv2.addWeighted(image, alpha, color_mask, 1 - alpha, 0)

alpha = 0.5
confusion_matrix_colors = {
   'tp': (0, 255, 255),  #cyan
   'fp': (255, 0, 255),  #magenta
   'fn': (255, 255, 0),  #yellow
   'tn': (0, 0, 0)     #black
validation_mask = get_confusion_matrix_overlaid_mask(255 - grayscale, groundtruth, predicted, alpha, confusion_matrix_colors)
print('Cyan - TP')
print('Magenta - FP')
print('Yellow - FN')
print('Black - TN')
plt.title('confusion matrix overlay mask')

We use opencv here to overlay this color mask onto the original (non-inverted) grayscale image as a transparent layer. This is called Alpha compositing:

Final notes

The last two examples in the repository are testing the edge cases and a random prediction scenario on a small array (fewer than 10 elements), by calling the test functions. It is important to test for edge cases and potential issues if we are writing production level code, or just to test the simple logic of an algorithm.

Travis CI is very useful for testing whether your code works on the module versions described in your requirements, and if all the tests pass as new changes are merged into master. Keeping your code clean, well documented, and with all statements unit tested and covered is a best practice. These habits limit the need to chase down bugs, when a complex algorithm is built on top of simple functional pieces that could have been unit tested. Generally, documentation and unit testing helps others stay informed about your intentions for a function. Linting helps improve readability of the code, and flake8 is good Python package for that.

Here are the important takeaways from this article:

  1. Tiling and stitching approach for data that doesn’t fit in memory
  2. Trying different thresholding techniques
  3. Subtleties of Validation Metrics
  4. Validation visualization
  5. Best Practices

There are many directions you could go from here with your work or projects. Applying the same strategy to different data sets, or automating the validation selection approach would be excellent places to start. Further, imagine you needed to analyze a database with many of these 10gb files. How could you automate the process? How could you validate and justify the results to human beings? How does better analysis improve the outcomes of real-world scenarios (like the development of surgical procedures and medicine)? Asking questions like these will allow continued improvements in Statistics, Data Science, and Machine Learning.

Finally, Thanks to Navid Farahani for annotations, Katherine Scott for the guidance, Allen Teplitsky for the motivation, and all of the 3Scan team for the data.

This article originally appeared on (Reprinted with permission)


Creating 3D Printed WiFi Access QR Codes with Python

By Eric J. Ma for


Over the weekend, I embarked on a project to create a 3D printed QR code that guests at our house could scan to gain access to our guest wireless network. Why 3D you might ask? Well, that’s how geeks like myself like to impress their guests! Also, let’s be real, I have a 3D printer at home, and I was looking for a fun way to put it to practical use. It turns out that it makes for some nice wall artwork as well.

In this first blog post I detail how I generate a QR code using Python, then how to build 3D printable blocks and, finally, how to convert that model into a file 3D printers can read.

In a follow-up blog post, I will specify how I created a hybrid command line app and Flask app from the same code base, using click and Flask. It will take the code that we write here and turn it into an app that can be used from the command line and from a web interface — it’s a great exercise in showing the similarities between the CLI and Flask. Coming soon!

Why a 3D QR code for my WIFI password?

There are a ton of QR code generators out there on the web and more than a handful of WiFi QR code generators too – so why did I embark on this project? Mainly, it was me wanting to scratch my itch surrounding QR codes. The last time I went to China (Xi’an and Shanghai, specifically), I saw QR codes everywhere. There surely had to be something good we could use this for at home that didn’t involve just packing and storage. Now that I know how simple it is to create a QR code using Python, I’m sure I’ll find myriad uses for them!

Getting Set Up

Ok, let’s get started! To create QR codes, you need to install pyqrcode and pypng in your environment:
pip install pyqrcode
pip install pypng

If you want to do the 3D printing part, you’ll also need to install SolidPython and NumPy:
pip install SolidPython
pip install numpy

Finally, to build a command line app and a web app, you’ll need to install click and Flask:
pip install click
pip install Flask

If you are using the conda package manager, you should know that numpy, click and Flask are conda-installable if you prefer to do so.

I also used Kite in the Atom text editor: this allowed me to view documentation and common usage patterns for the packages I imported.

Step 1: Encoding WiFi credentials in a QR code

Let’s start by creating a QR code for our WiFi guest network.

Let’s say that these are the security credentials for the network:
SSID (a.k.a. Network Name): Family Guest Network
Password: vn8h2sncu093y3nd!
Security Type (one of WPA or WEP): WPA

QR codes are merely two-dimensional barcodes that encode a string that can be parsed by another program. In order to create a QR-code that is readable for accessing WiFi, we need a string that can be parsed by our devices. This string is structured as follows:

So in our case, we would want a string that looks like:
WIFI:S:Family Guest Network;T:WPA;P:vn8h2sncu093y3nd!;;

Now, we can code up our Python program to encode the QR code for us. I’ll assume you’re running Python 3.6 or later.

import pyqrcode as pq
ssid = 'Family Guest Network'
security = 'WPA'
password = 'vn8h2sncu093y3nd!'
qr = pq.create(f'WIFI:S:{ssid};T:{security};P:{password};;')

With that block of code, you should get a QR code printed to your terminal, just like that!

Let’s say you wanted to do the simple thing, and just have a regular laser/inkjet printer make a printout of the QR code. To do so, you can save the QR code to disk as a PNG file:


And just like that, you’ve used Python to create a WiFi QR code! Just scan this code using your phone’s camera and you (or your guests) will be able to join your WiFi network.

Now, if you remember that QR codes are just “ASCII strings encoded in a 2D barcode”, then you’ll know that you can pass any arbitrary string into the pyqrcode.create() function. That means you can come up with any creative use of a short string that would make sense to scan with your phone! For example, you can create business cards with your LinkedIn profile URL embedded in the QR code, use it to encode a serial number information on your possessions, or more!

Next, we’ll turn the QR code into a 3D-printable model using our imported Python packages and free online CAD software.

Step 2: 3D Printing a QR Code

For this, we will need a package called SolidPython, and optionally numpy to help us with some array processing (it can also be done entirely using built-in lists if needed).

To start, I defined a convenience function that let me create and return a QRCode object that can be passed around and manipulated.

def create_wifi_qr(ssid: str, security: str, password: str):
    qr = pq.create(f'WIFI:S:{ssid};T:{security};P:{password};;')
    return qr

Its use will become evident later. You’ll also notice that I’m using type hints inside the function.

Create Text Representation

Using this function, we can create a text representation of the QR code:

qr = create_wifi_qr(ssid, security, password)

This will essentially give a series of 1s and 0s. This is a string, though, not a numpy array. Hence, we may have to convert this into a list of lists, or a numpy array (as a user of the scientific Python stack, I prefer using arrays where possible, but in this case there is no real practical advantage to doing so because we are not doing linear algebra).

Create Array Representation

Let’s now define a function that takes in the QRCode object and returns an array version of the text rendering.

def qr2array(qr):
    arr = []
    for line in qr.text().split('\n'):
        if len(line) != 0:
            arr.append([int(bit) for bit in line])
    return np.vstack(arr)

With that, we can create an array version of our QR code above:

arr = qr2array(qr)

Create 3D Model

Now, we’re ready to play with SolidPython!

SolidPython is a Python package that provides an interface to the OpenSCAD language. OpenSCAD allows a programmer to programmatically define 3D models using the language of geometry. This includes the creation of cubes and other 3D objects, as well as object manipulations, such as translation, coloring, and union-ing.

Take a look at the code below for an example of how we create the 3D object.

from solid import color, cube, scad_render, translate, union

SCALE = 2  # output defaults to 1 mm per unit; this lets us increase the size of objects proportionally.
cubes = [translate([i*SCALE, j*SCALE, 0])(color('black')(cube(size=[SCALE, SCALE, HEIGHT])))
        for i, row in enumerate(arr)
        for j, col in enumerate(row)
        if arr[i, j] == 1]

base_plate = color('white')(cube(size=(arr.shape[0] * SCALE, arr.shape[1] * SCALE, HEIGHT / 2)))
qrobj = union()(*cubes, base_plate)


What we’re doing here is taking the 1s and creating cubes where they are supposed to be, but leaving the zeros empty. Then, we add a “base plate” so that everything stays nice and connected, and finally union all of the cubes with the base plate, so that we get one solid piece that is 3D printed.

In short, the flow is:
SolidPython -> OpenSCAD -> STL -> .gcode

That’s it! You have everything you need to 3D print a QR code of your WiFi credentials.


The key takeaways from this blog post are:

  1. How to create a QR code using Python.
  2. How to create a text representation of the QR code.
  3. How to convert the text representation into an array.
  4. How to create a 3D model of the QR code using the array.

Now that you have a rendered 3D model, you can either 3D print it at home, or send it to a friend to 3D print it for you. You’ll no longer have to give plain text WiFi passwords to your guests – they can just scan the aesthetically-pleasing 3D printed QR code instead!

With this example of how to create an OpenSCAD file from Python using SolidPython, I hope you’ll go forth and make a ton of fun stuff! Please share your experiences in the comment section below.

Kite added the associated code from this blog post to a public Github repository. You can find the source code from this and other posts in their Python series here.

In my next post, I’ll show how I took this code base to create a hybrid command line app and Flask app. Stay tuned for more details and make sure to subscribe to the Kite blog so you don’t miss it!

This article originally appeared on

(Reprinted with permission)


Turbocharging Python with Command Line Tools

By Noah Gift for

Table of Contents

  • Introduction
  • Using The Numba JIT (Just in time Compiler)
  • Using the GPU with CUDA Python
  • Running True Multi-Core Multithreaded Python using Numba
  • KMeans Clustering
  • Summary


It’s as good a time to be writing code as ever – these days, a little bit of code goes a long way. Just a single function is capable of performing incredible things. Thanks to GPUs, Machine Learning, the Cloud, and Python, it’s is easy to create “turbocharged” command-line tools. Think of it as upgrading your code from using a basic internal combustion engine to a nuclear reactor. The basic recipe for the upgrade? One function, a sprinkle of powerful logic, and, finally, a decorator to route it to the command-line.

Writing and maintaining traditional GUI applications – web or desktop – is a Sisyphean task at best. It all starts with the best of intentions, but can quickly turn into a soul crushing, time-consuming ordeal where you end up asking yourself why you thought becoming a programmer was a good idea in the first place. Why did you run that web framework setup utility that essentially automated a 1970’s technology – the relational database – into series of python files? The old Ford Pinto with the exploding rear gas tank has newer technology than your web framework. There has got to be a better way to make a living.

The answer is simple: stop writing web applications and start writing nuclear powered command-line tools instead. The turbocharged command-line tools that I share below are focused on fast results vis a vis minimal lines of code. They can do things like learn from data (machine learning), make your code run 2,000 times faster, and best of all, generate colored terminal output.

Here are the raw ingredients that will be used to make several solutions:

You can follow along with source code, examples, and resources in Kite’s github repository.

Using The Numba JIT (Just in time Compiler)

Python has a reputation for slow performance because it’s fundamentally a scripting language. One way to get around this problem is to use the Numba JIT. Here’s what that code looks like:

First, use a timing decorator to get a grasp on the runtime of your functions:

def timing(f):
    def wrap(*args, **kwargs):
        ts = time()
        result = f(*args, **kwargs)
        te = time()
        print(f'fun: {f.__name__}, args: [{args}, {kwargs}] took: {te-ts} sec')
        return result
    return wrap

Next, add a numba.jit decorator with the “nopython” keyword argument, and set to true. This will ensure that the code will be run by the JIT instead of regular python.

def expmean_jit(rea):
    """Perform multiple mean calculations"""

    val = rea.mean() ** 2
    return val

When you run it, you can see both a “jit” as well as a regular version being run via the command-line tool:

$ python jit-test

Running NO JIT
func:'expmean' args:[(array([[1.0000e+00, 4.2080e+05, 4.2350e+05, ..., 1.0543e+06, 1.0485e+06,
       [2.0000e+00, 5.4240e+05, 5.4670e+05, ..., 1.5158e+06, 1.5199e+06,
       [3.0000e+00, 7.0900e+04, 7.1200e+04, ..., 1.1380e+05, 1.1350e+05,
       [1.5277e+04, 9.8900e+04, 9.8100e+04, ..., 2.1980e+05, 2.2000e+05,
       [1.5280e+04, 8.6700e+04, 8.7500e+04, ..., 1.9070e+05, 1.9230e+05,
       [1.5281e+04, 2.5350e+05, 2.5400e+05, ..., 7.8360e+05, 7.7950e+05,
        7.7420e+05]], dtype=float32),), {}] took: 0.0007 sec

$ python jit-test –jit

Running with JIT
func:'expmean_jit' args:[(array([[1.0000e+00, 4.2080e+05, 4.2350e+05, ..., 1.0543e+06, 1.0485e+06,
       [2.0000e+00, 5.4240e+05, 5.4670e+05, ..., 1.5158e+06, 1.5199e+06,
       [3.0000e+00, 7.0900e+04, 7.1200e+04, ..., 1.1380e+05, 1.1350e+05,
       [1.5277e+04, 9.8900e+04, 9.8100e+04, ..., 2.1980e+05, 2.2000e+05,
       [1.5280e+04, 8.6700e+04, 8.7500e+04, ..., 1.9070e+05, 1.9230e+05,
       [1.5281e+04, 2.5350e+05, 2.5400e+05, ..., 7.8360e+05, 7.7950e+05,
@click.option('--jit/--no-jit', default=False)
        7.7420e+05]], dtype=float32),), {}] took: 0.2180 sec

How does that work? Just a few lines of code allow for this simple toggle:

def jit_test(jit):
    rea = real_estate_array()
    if jit:
        click.echo('Running with JIT', fg='green'))
        click.echo('Running NO JIT', fg='red'))

In some cases a JIT version could make code run thousands of times faster, but benchmarking is key. Another item to point out is the line:

click.echo('Running with JIT', fg='green'))

This script allows for colored terminal output, which can be very helpful it creating sophisticated tools.

Using the GPU with CUDA Python

Another way to nuclear power your code is to run it straight on a GPU. This example requires you run it on a machine with a CUDA enabled. Here’s what that code looks like:

def cuda_operation():
    """Performs Vectorized Operations on GPU"""

    x = real_estate_array()
    y = real_estate_array()

    print('Moving calculations to GPU memory')
    x_device = cuda.to_device(x)
    y_device = cuda.to_device(y)
    out_device = cuda.device_array(
        shape=(x_device.shape[0],x_device.shape[1]), dtype=np.float32)

    print('Calculating on GPU')
    add_ufunc(x_device,y_device, out=out_device)

    out_host = out_device.copy_to_host()
    print(f'Calculations from GPU {out_host}')

It’s useful to point out is that if the numpy array is first moved to the GPU, then a vectorized function does the work on the GPU. After that work is completed, then the data is moved from the GPU. By using a GPU there could be a monumental improvement to the code, depending on what it’s running. The output from the command-line tool is shown below:

$ python cuda-operation
Moving calculations to GPU memory

(10015, 259)
Calculating on GPU
Calculcations from GPU [[2.0000e+00 8.4160e+05 8.4700e+05 ... 2.1086e+06 2.0970e+06 2.0888e+06]
 [4.0000e+00 1.0848e+06 1.0934e+06 ... 3.0316e+06 3.0398e+06 3.0506e+06]
 [6.0000e+00 1.4180e+05 1.4240e+05 ... 2.2760e+05 2.2700e+05 2.2660e+05]
 [3.0554e+04 1.9780e+05 1.9620e+05 ... 4.3960e+05 4.4000e+05 4.4080e+05]
 [3.0560e+04 1.7340e+05 1.7500e+05 ... 3.8140e+05 3.8460e+05 3.8720e+05]
 [3.0562e+04 5.0700e+05 5.0800e+05 ... 1.5672e+06 1.5590e+06 1.5484e+06]]

Running True Multi-Core Multithreaded Python using Numba

One common performance problem with Python is the lack of true, multi-threaded performance. This also can be fixed with Numba. Here’s an example of some basic operations:

def add_sum_threaded(rea):
    """Use all the cores"""

    x,_ = rea.shape
    total = 0
    for _ in numba.prange(x):
        total += rea.sum()

def add_sum(rea):
    """traditional for loop"""

    x,_ = rea.shape
    total = 0
    for _ in numba.prange(x):
        total += rea.sum()

@click.option('--threads/--no-jit', default=False)
def thread_test(threads):
    rea = real_estate_array()
    if threads:
        click.echo('Running with multicore threads', fg='green'))
        click.echo('Running NO THREADS', fg='red'))

Note that the key difference between the parallel version is that it uses @numba.jit(parallel=True) and numba.prange to spawn threads for iteration. Looking at the picture below, all of the CPUs are maxed out on the machine, but when almost the exact same code is run without the parallelization, it only uses a core.

$ python thread-test

$ python thread-test --threads

KMeans Clustering

One more powerful thing that can be accomplished in a command-line tool is machine learning. In the example below, a KMeans clustering function is created with just a few lines of code. This clusters a pandas DataFrame into a default of 3 clusters.

def kmeans_cluster_housing(clusters=3):
    """Kmeans cluster a dataframe"""
    url = ''
    val_housing_win_df =pd.read_csv(url)
    numerical_df =(
        val_housing_win_df.loc[:,['TOTAL_ATTENDANCE_MILLIONS', 'ELO',
    #scale data
    scaler = MinMaxScaler()
    #cluster data
    k_means = KMeans(n_clusters=clusters)
    kmeans =
    val_housing_win_df['cluster'] = kmeans.labels_
    return val_housing_win_df

The cluster number can be changed by passing in another number (as shown below) using click:

@click.option('--num', default=3, help='number of clusters')
def cluster(num):
    df = kmeans_cluster_housing(clusters=num)
    click.echo('Clustered DataFrame')

Finally, the output of the Pandas DataFrame with the cluster assignment is show below. Note, it has cluster assignment as a column now.

$ python -W cluster

Clustered DataFrame 0 1 2 3 4
TEAM Chicago Bulls Dallas Mavericks Sacramento Kings Miami Heat Toronto Raptors
GMS 41 41 41 41 41
PCT_ATTENDANCE 104 103 101 100 100
COUNTY Cook Dallas Sacremento Miami-Dade York-County
MEDIAN_HOME_PRICE_COUNTY_MILLIONS 269900.0 314990.0 343950.0 389000.0 390000.0
COUNTY_POPULATION_MILLIONS 5.20 2.57 1.51 2.71 1.10
cluster 0 0 1 0 0

$ python -W cluster --num 2

Clustered DataFrame 0 1 2 3 4
TEAM Chicago Bulls Dallas Mavericks Sacramento Kings Miami Heat Toronto Raptors
GMS 41 41 41 41 41
PCT_ATTENDANCE 104 103 101 100 100
COUNTY Cook Dallas Sacremento Miami-Dade York-County
MEDIAN_HOME_PRICE_COUNTY_MILLIONS 269900.0 314990.0 343950.0 389000.0 390000.0
COUNTY_POPULATION_MILLIONS 5.20 2.57 1.51 2.71 1.10
cluster 1 1 0 1 1


The goal of this article is to show how simple command-line tools can be a great alternative to heavy web frameworks. In under 200 lines of code, you’re now able to create a command-line tool that involves GPU parallelization, JIT, core saturation, as well as Machine Learning. The examples I shared above are just the beginning of upgrading your developer productivity to nuclear power, and I hope you’ll use these programming tools to help build the future.

Many of the most powerful things happening in the software industry are based on functions: distributed computing, machine learning, cloud computing (functions as a service), and GPU based programming are all great examples. The natural way of controlling these functions is a decorator-based command-line tool – not clunky 20th Century clunky web frameworks. The Ford Pinto is now parked in a garage, and you’re driving a shiny new “turbocharged” command-line interface that maps powerful yet simple functions to logic using the Click framework.

Noah Gift is lecturer and consultant at both UC Davis Graduate School of Management MSBA program and the Graduate Data Science program, MSDS, at Northwestern. He is teaching and designing graduate machine learning, AI, Data Science courses and consulting on Machine Learning and Cloud Architecture for students and faculty.

Noah’s new book, Pragmatic AI, will help you solve real-world problems with contemporary machine learning, artificial intelligence, and cloud computing tools. Noah Gift demystifies all the concepts and tools you need to get results—even if you don’t have a strong background in math or data science. Save 30% with the code, “KITE”.

About the Author:

This article originally appeared on

(Reprinted with permission)


Thoughts on Security

By Adam Smith for

Last week we launched Kite, a copilot for programmers. We’ve been excited about the Kite vision since 2014—we’re blown away by how many of you are excited about it too!

The response far exceeded our expectations. We had over a thousand upvotes on Hacker News; we were in the all time top 1% of launches on Product Hunt; and we had over two thousands tweets about Kite, not counting retweets. We couldn’t be more grateful to those who believed in the vision and took the time to share Kite with a friend or join the discussion online.

That said, we have a lot of work to do. Kite is the first product of its kind, which means we’re pioneering some new terrain. We signed up for this, and are committed to getting it right.

Why Cloud? Garmin versus Waze.

The first question is: why keep the copilot logic in the cloud, instead of locally as part of the Kite install? The short answer is we can build a better experience if Kite is a cloud service.

The full answer is a long list of things that are better about cloud services. Editors today are Garmin GPS, and Kite is Waze. Some folks still use Garmin GPS due to privacy concerns, but most of the world uses internet-connected navigation for its many advantages: fresher maps, more coverage, better tuned navigation algorithms, better user experience because iteration is 10x cheaper, etc.

The same patterns apply to Kite. I’d like to give three quick examples, and then talk about the larger strategy.

  1. Data by the Terabyte. Kite uses lots of data to power the copilot experience. We index public documentation, maintain maps of the Python world (e.g. scipy.array is an alias for numpy.array), and surface patterns extracted from all of Github. We keep all of this in RAM, so you don’t have to. We run servers with 32 GB of RAM; while some of you may have that kind of rig (we’re jealous!), the typical Macbook Pro doesn’t. This data set will grow as we add support for more programming languages and more functionality. With a cloud-based architecture you don’t need to preselect which languages you’ll use, or sacrifice gigabytes of memory on your machine.
  2. Machine Learning. Kite is powered by a number of statistical models, and we’re adding more over time. For example, Kite’s search and “Did you mean” features both use machine learning. Of course we could ship these to your local client, but our models will get smarter over time if we know which result you clicked on (like Google Search) and whether you accepted a suggested change to your code (like Google Spellcheck).
  3. Rapid ship cycles. We ship multiple times per week. This means our bugs get fixed faster, data is fresher, and you get the newest features as soon as possible.

The cloud and its resulting feedback loops lead to better products, faster. We’ve seen the same evolution across a number of verticals. A few examples:

  • Outlook → Gmail
  • Colocation → AWS
  • Network File Share → Dropbox
  • MS Office → Google Docs

In each of these cases, security had to be addressed. At first it wasn’t clear the world would make the jump. It didn’t happen all at once, and there are still people using the legacy technologies. This evolution takes time, and overall is very healthy.

So what does Kite need to do as a company excited about the possibilities of cloud-connected programming?

Security: Core Principles

Let’s talk about the security concerns that naturally arise from a cloud-powered programming copilot. As software developers, security has naturally been on our minds since the beginning. Frankly many of us here at Kite would have left similar comments on the HN thread :). Many of you are rightfully concerned about security as well, so let’s jump in.

Our approach to security begins with a few core principles:

  1. Security is a journey, not a destination. We will never be done giving you the tools you need to control your data. We will also never be done earning your trust.
  2. Control. You should control what data gets sent to Kite’s backend and whether you want us to store it for your later use. We should offer as much control as we can.
  3. Transparency. You should understand what is happening. We need to communicate this repeatedly, and clearly.
  4. We’re building the future together with you. We don’t presume to have all of the answers. We want to work with all of you to find the best solutions.

We are committed to these principles. We want you and your employer to be excited about using Kite, and we think these principles are a good first step.

Let’s look at some examples of how we’ll put these principles into action.

You should be able to control

  • Which directories and files, if any, are indexed by Kite,
  • If Kite should remember your history of code changes,
  • If Kite should help with terminal commands,
  • If Kite should remember terminal commands you’ve previously written,
  • If Kite should remember the output of past terminal commands,
  • …and you should be able to easily turn these switches on and off.
  • If you change a setting, we should ask if you’d like to delete historical data, as applicable.

You should always be able to see

  • What files Kite has indexed (and permanently remove them as needed),
  • What terminal commands, or file edits, are being remembered by Kite (and permanently remove them as needed),
  • …and Kite should check in periodically to verify that your security settings match your preferences.

These are the first levels of control and transparency, which are based on files, directories, and the type of information (terminal versus editor).

Secrets, like passwords or keys, are a category of content that deserve special attention. We don’t want secrets on our servers, and we will be developing multiple mechanisms (automated and manual) to make sure they stay off our servers. We don’t have specifics to announce yet, but we believe we will set industry standards that will be adopted across multiple categories of tools such as continuous integration and code review systems.

We know a lot of folks are also interested in on premise deployment. We understand the use case and want to support it. We worry that it would delay a lot of seriously awesome stuff we have on the roadmap, e.g. support for Javascript, so we are thinking through how to fit it in. It is something we want to facilitate, particularly in the long run.

An Example

Since last week’s launch we have begun adding some of these principles into the product. I’d like to show you one feature we shipped yesterday. It’s called the lockout screen.

Kite’s Security panel asks users to whitelist the directories that Kite should be turned on for. Code living outside of this whitelist never gets read by Kite. So what should the sidebar show if you open a Python file outside of the whitelist? As of yesterday’s addition, you’ll see something like this:

This interaction embodies the principles of transparency and control. It communicates what is happening, why, and gives you a one-click control mechanism to change what’s happening, if you so choose.

The Future Ahead

We are committed to incorporating the principles of control and transparency into the foundations of Kite. We will write more about security on our blog as we design and implement these features.

That said, we realize that everyone has different needs. We can’t promise that the options and functionality we choose on day 1 will be perfect for everyone, but we’re working day and night to expand the circle as widely as possible. We’ll do this tirelessly over the long term.

We’d love to hear your thoughts along the way. It’s only been a week, but all of you have been incredibly helpful as we learn how to get this right. As always, we encourage you to talk with us on Twitter at @kitehq.

Nothing makes us happier than knowing so many of you are equally excited about the Kite vision. The future of programming is awesome. Let’s build it together!

P.S. We are hiring! We are looking for frontend web devs, generalist systems engineers, programming language devs, and mac/windows/linux developers. You can reach us at [email protected].

About the Author:

This article originally appeared on

(Reprinted with permission)