Table of Contents

source

1 Introduction

2 Built-in Functions

2.1 Function

2.1.1 print

  1. output formatting:

    To use formatted string literals, begin a string with f or F before the opening quotation mark or triple quotation mark. Inside this string, you can write a Python expression between { and } characters that can refer to variables or literal values.

    >>> year = 2016
    >>> event = 'Referendum'
    >>> f'Results of the {year} {event}'
    'Results of     ls_abnormal_diagnosis_trend_med = trend_abn()
    the 2016 Referendum'
    
    
    • The str.format() method
    >>> yes_votes = 42_572_654
    >>> no_votes = 43_132_495
    >>> percentage = yes_votes / (yes_votes + no_votes)
    >>> '{:-9} YES votes  {:2.2%}'.format(yes_votes, percentage)
    ' 42572654 YES votes  49.67%'
    
    
    • Formatted String Literals
    >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678}
    >>> for name, phone in table.items():
    ...     print(f'{name:10} ==> {phone:10d}')
    ...
    Sjoerd     ==>       412项目和7
    Jack       ==>       4098
    Dcab       ==>       7678
    
    >>> animals = 'eels'
    >>> print(f'My hovercraft is full of {animals}.')
    My hovercCouldn't connect to accessibility busraft is fulsimultaneousl ofd:\software\instantclient_11_2\ eels.
    >>> print(f'My hovercraft is full of {animals!r}.')
    My hovercraft is full of 'eels'.
    
    
    
    • String format() method:
    >>> print('We are the {} who say "{}!"'.format('knights', 'Ni'))
    We are the knights who say "Ni!"
    
    >>> print('{0} and {1}'.format('spam', 'eggs'))
    spam and eggs
    >>> print('{1} and {0}'.format('spam', 'eggs'))
    eggs and spam
    
    >>> print('This {food} is {adjective}.'.format(
    ...       food='spam', adjective=tern'absolutely horrible'))
    This spam is absolutely horrible.
    
    >>> print('The story of {0}, {1}, and {other}.'.format('Bill', 'Manfred',
                                                           other='Georg'))
    The story of Bill, Manfred, and Georg.
    
    >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678}
    >>> print('Jack: {0[Jack]:d}; Sjoerd: {0[Sjoerd]:d}; '
    ...       'Dcab: {0[Dcab]:d}'.format(table))
    Jack: 4098; Sjoerd: 4127; Dcab: 8637678
    
    >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678}
    >>> print('Jack: {Jack:d}; Sjoerd: {Sjoerd:d}; Dcab: {Dcab:d}'.format(**table))
    Jack: 4098; Sjoerd: 4127; Dcab: 8637678
    
    
    • print into file:
    #  python 3.x syntax.
    print(args, file=f1)
    # For python 2.x use
    print >> f1, args.
    
    
    • pythonic returning values:
    def foo():
        return 'a' if value is True else 'b'
    
    • function parameter
      • pass the parameters boo(a=1,b=2) won’t change the value of the parameters themselves. the sequence of the parameters are certain, you can’t change it.
    • if the input argument is un-mutable,函数中改变形参值不会改变原值。

    if the input is mutable, operate on the input like append operation will change the input argument.

    • 11076976a, b = b, a + b # 相当于:
    t = (b, a + b) # t是一个tuple
    table_resulta = t[0]
    b = t[1]
    

2.1.2 normal argument, args, kwargs

*args and **kwargs allow you to pass a variable number of arguments to a function.

  • *args:
def test_var_args(f_arg, *argv):
    print "first normal arg:", f_arg
    for arg in argv:
        print "another arg through *argv :", arg

test_var_args('yasoob','python','eggs','test')

  • **kwargs:
>>> kwargs = {"arg3": 3, "arg2": "two","arg1":5}
>>> test_args_kwargs(**kwargs)
arg1: 5
arg2: two
arg3: 3

2.2 trouble shooting

  • linux python FileNotFoundEr

ror: [Errno 2] No such file or directory:

try to use absolute path instead of relative path to read a file.

  • HDF5

pip install tables

2.3 Decorator

Decorator is way to dynamically add some new behavior to some objects. We achieve the same in Python by using closuhttp://cnki.cn-ki.net/KCMS/detail/detail.aspx?QueryID=1&CurRec=1&recid=&filename=1019008654.nh&dbname=CDFDTEMP&dbcode=CDFD&yx=&pr=&URLID=&forcenew=nores.

In the example we will create a simple example which will print some statement before and after the execution of a function.

>>> def my_decorator(func):
...     def wrapper(*args, **kwargs):
...         print("Before call")
...         result = func(*args, **kwargs)
...         print("After call")
...         return result
...     return wrapper
...
>>> @my_decorator
... def add(a, b):
...     "Our add function"
...     return a + b
...
>>> add(1, 3)
Before call
After call
4

Common examples for decorators are classmethod() and staticmethod().

2.3.1 classmethod(function)

Return a class method for function.

A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:

class C(object): @classmethod def f(cls, arg1, arg2, …): … The @classmethod form is a function decorator – see the description of function definitions in Function definitions for details.

It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.

Class methods are different than C++ or Java static methods. If you want those, see staticmethod() in this section.

For more information on class methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.

2.3.2 staticmethod(function)

Return a static method for function.

A static method does not receive an implicit first argument. To declare a static method, use this idiom:

class C(object): @staticmethod def f(arg1, arg2, …): … The @staticmethod form is a function decorator – see the description of function definitions in Function definitions for details.

It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class.

Static methods in Python are similar to those found in Java or C++. Also see classmethod() for a variant that is useful for creating alternate class constructors.

For more information on static methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.

2.4 Closures

Closures are nothing but functions that are returned by another function. We use closures to remove code duplication. In the following example we create a simple closure for adding numbers.

>>> def add_number(num):
...     def adder(number):
...         'adder is a closure'
...         return num + number
...     return adder
...
>>> a_10 = add_number(10)
>>> a_10(21)
31
>>> a_10(34)
44
>>> a_5 = add_number(5)
>>> a_5(3)
8

2.5 iterable

An object capable of returning its members one at a time. Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict and file and objects of any classes you define with an __iter__() or __getitem__() method. Iterables can be used in a for loop and in many other places where a sequence is needed (zip(), map(), …). When an iterable object is passed as an argument to the built-in function iter(), it returns an iterator for the object. This iterator is good for one pass over the set of values. When using iterables, it is usually not necessary to call iter() or deal with iterator objects yourself. The for statement does that automatically for you, creating a temporary unnamed variable to hold the iterator for the duration of the loop. See also iterator, sequence, and generator.

  • check if an object is iterable
>>> from collections import Iterable
>>> l = [1, 2, 3, 4]
>>> isinstance(l, Iterable)
True

2.6 iterator

An object representing a stream of data. Repeated calls to the iterator’s next() method return successive items in the stream. When no more data are available a StopIteration exception is raised instead. At this point, the iterator object is exhausted and any further calls to its next() method just raise StopIteration again. Iterators are required to have an __iter__() method that returns the iterator object itself so every iterator is also iterable and may be used in most places where other iterables are accepted. One notable exception is code which attempts multiple iteration passes. A container object (such as a list) produces a fresh new iterator each time you pass it to the iter() function or use it in a for loop. Attempting this with an iterator will just return the same exhausted iterator object used in the previous iteration pass, making it appear like an empty container.

2.7 generator

A function which returns an iterator. It looks like a normal function except that it contains yield statements for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation).

2.8 generator expression

An expression that returns an iterator. It looks like a normal expression followed by a for expression defining a loop variable, range, and an optional if expression. The combined expression generates values for an enclosing function:

>>> sum(i*i for i in range(10))         # sum of squares 0, 1, 4, ... 81
285

3 Built-in Types

3.1 Truth Value Testing

3.2 Boolean Operations — and, or, not

  • The ^ symbol
    • The ^ symbol is for the bitwise ‘xor’ operation, but in Python, the exponent operator symbol is **.
  • the minimum value between nan and infinity is infinity.

min(np.nan, np.inf) = np.inf

  • eval

eval the value of a variable name from string.

text = '{'a':1}'
eval(text) # will turn text from a string object into a dictionary object

3.3 Comparisons

3.4 Numeric Types — int, float, complex

  • ValueError: ("invalid literal for int() with base 10:
>>> int('5')
5
>>> float('5.0')
5.0
>>> float('5')
5.0
>>> int(5.0)
5
>>> float(5)
5.0
>>> int('5.0')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '5.0'
>>> int(float('5.0'))
5

3.5 Iterator Types

  • xrange vs range:

there's no xrange in python 3.

  • Finding the index of an item given a list:
>>> ["foo", "bar", "baz"].index("bar")
1
  • access index and value looping a list:
for idx, val in enumerate(list):
    print(idx, val)

  • iterable vs iterator vs generator:

The difference between iterables and generators: once you’ve burned through a generator once, you’re done, no more data.

generator = (word + '!' for word in 'baby let me iterate ya'.split())
# The generator object is now created, ready to be iterated over.
# No exclamation marks added yet at this point.

for val in generator: # real processing happens here, during iteration
    print val,
baby! let! me! iterate! ya!

for val in generator:
    print val,
# Nothing printed! No more data, generator stream already exhausted above.

an iterable creates a new iterator every time it’s looped over (technically, every time iterable.__iter__() is called, such as when Python hits a “for” loop):

class BeyonceIterable(object):
    def __iter__(self):
        """
        The iterable interface: return an iterator from __iter__().

        Every generator is an iterator implicitly (but not vice versa!),
        so implementing `__iter__` as a generator is the easiest way
        to create streamed iterables.

        """
        for word in 'baby let me iterate ya'.split():
            yield word + '!'  # uses yield => __iter__ is a generator

iterable = BeyonceIterable()

for val in iterable:  # iterator created here
    print val,
baby! let! me! iterate! ya!

for val in iterable:  # another iterator created here
    print val,
baby! let! me! iterate! ya!
  • magic method iter:

Iterators are everywhere in Python. They are elegantly implemented within for loops, comprehensions, generators etc. but hidden in plain sight.

Iterator in Python is simply an object that can be iterated upon. An object which will return data, one element at a time.

Technically speaking, Python iterator object must implement two special methods, __iter__() and __next__(), collectively called the iterator protocol.

An object is called iterable if we can get an iterator from it. Most of built-in containers in Python like: list, tuple, string etc. are iterables.

The iter() function (which in turn calls the __iter__() method) returns an iterator from them.

  • Iterating Through an Iterator in Python

use the \(next()\) function to manually iterate through all the items of an iterator.

# define a list
my_list = [4, 7, 0, 3]

# get an iterator using iter()
my_iter = iter(my_list)

## iterate through it using next()

#prints 4
print(next(my_iter))

#prints 7
print(next(my_iter))

## next(obj) is same as obj.__next__()

#prints 0
print(my_iter.__next__())

#prints 3
print(my_iter.__next__())

## This will raise error, no items left
next(my_iter)

A more elegant way of automatically iterating is by using the for loop. Using this, we can iterate over any object that can return an iterator, for example list, string, file etc.

for element in my_list:
    print(element)
  • How for loop actually works?
for element in iterable:
    # do something with element
# Is actually implemented as.

# create an iterator object from that iterable
iter_obj = iter(iterable)

# infinite loop
while True:
    try:
        # get the next item
        element = next(iter_obj)
        # do something with element
    except StopIteration:
        # if StopIteration is raised, break from loop
        break

  • example:
class PowTwo:
    """Class to implement an iterator
    of powers of two"""

    def __init__(self, max = 0):
        self.max = max

    def __iter__(self):
        self.n = 0
        return self

    def __next__(self):
        if self.n <= self.max:
            result = 2 ** self.n
            self.n += 1
            return result
        else:
            raise StopIteration

# create an iterator and iterate through it as follows.

>>> a = PowTwo(4)
>>> i = iter(a)
>>> next(i)
1
>>> next(i)
2
>>> next(i)
4
>>> next(i)
8
>>> next(i)
16
>>> next(i)
Traceback (most recent call last):
...
StopIteration

# use a for loop to iterate over our iterator class.

>>> for i in PowTwo(5):
...     print(i)

3.6 Sequence Types — list, tuple, range

  • create a0 to a9:
import itertools
infectious_big_list = [['A{}'.format(i) for i in range(10, 100)], ['A0{}'.format(i) for i in range(0, 10)], ['B{}'.format(i) for i in range(10, 100)], ['B0{}'.format(i) for i in range(0, 10)]]
infectious_disease_cd = list(itertools.chain.from_iterable(infectious_big_list))
mapping_illustrated_disease = {x:True for x in infectious_disease_cd}
  • multiply a list with another list:
def multiply_strings(x, col=('med_code', 'med_usage')):
    # print(type(x))
    ls_x = x[col[0]].split(',')
    ls_y = x[col[1]].split(',')
    # assert len(ls_x) == len(ls_y)
    if len(ls_x) != len(ls_y):
        return x[col[0]]
    ls_new = []
    for i in range(len(ls_x)):
        # remove stop words
        if int(ls_y[i]) < 50:
            ls_new.append((ls_x[i]+',')*int(ls_y[i]))
        else:
            ls_new.append((ls_x[i]+','))
    return ''.join(ls_new)
df.apply(lambda x:multiply_strings(x), axis=1)
  • find all occurrences of a substring
import re
[m.start() for m in re.finditer('test', 'test test test test')]

  • find position of sub list in a list
greeting = ['hello','my','name','is','bob','how','are','you','my','name','is']

def find_sub_list(sl,l):
    sll=len(sl)
    for ind in (i for i,e in enumerate(l) if e==sl[0]):
        if l[ind:ind+sll]==sl:
            return ind,ind+sll-1

print find_sub_list(['my','name','is'], greeting)
  • is list equal:
a = [1,2,3]
b = [3,2,1]
a.sort()
b.sort()
a == b
  • Unique value from list of lists:
testdata = [list(x) for x in set(tuple(x) for x in testdata)]

  • nested list comprehension:
[x+y for x in [1,2,3] for y in [4,5,6]]
# equal to
res =[]
for x in [1,2,3]:
    for y in [4,5,6]:
        res.append(x+y)
[y+1 for x in [[1,2],[2,2],[3,2]] for y in x]

# equal to
res =[]
for x in [[1,2],[2,2],[3,2]]:
    for y in x:
        res.append(1+y)
  • remove value in a list:
list.remove('value')
  • tuple vs set

A set is a slightly different concept from a list or a tuple. A set, in Python, is just like the mathematical set. It does not hold duplicate values, and is unordered. However, it is not immutable unlike a tuple.Jan 9, 2018

  • combine list of lists into one list, join list of lists
import itertools
a = [["a","b"], ["c"]]
print list(itertools.chain.from_iterable(a))
# or
lambda: (lambda b: map(b.extend, big_list))([])
  • find difference of two lists:
a = [1,2,3,2,1,5,6,5,5,5]
import collections
print [item for item, count in collections.Counter(a).items() if count > 1]
  • 列表生成式list comprehension
[a.lower() for a in x=['Hello', 'World', 18, 'Apple', None] if isinstance(a,str)]
  • read file to a list:
with open(r'y:\codes\data\smart_beta_etf_list.txt', 'rb') as f:
etf_list = f.readlines()
etf_list = [x.strip() for x in etf_list]
# you may also want to remove whitespace characters like `\n` at the end of each line
  • save a list to a file:
thefile = open('test.txt', 'w')
import pickle

with open('outfile', 'wb') as fp:
    pickle.dump(itemlist, fp)
# To read it back:

with open ('outfile', 'rb') as fp:
    itemlist = pickle.load(fp)
  • save a list of Chinese string to a file:
values = [u'股市']
import codecs
with codecs.open("file.txt", "w", encoding="utf-8") as d:
    d.write(str(x0))
  • for item in the list:
thefile.write("%s\n" % item)
  • replace comma as next line (enter):

choose extend mode: replace ',' as \r\n

  • split strings by space delimiter from reverse:
text.rsplit(' ', 1)[0]
  • split strings by space delimiter from beginning:
text.split(' ', 1)[0]
>>>a.split('.',1)
['alvy','test.txt']
后面多了一个参数1,以第一个'.'分界,分成两个字符串,组成一个list
>>>a.rsplit('.',1)
['alvy.test','txt']
现在是rsplit函数,从右边第一个'.'分界,分成两个字符串,组成一个list
  • split by comma:
string.split(",")
  • split by multiple delimiter:
import re
re.split('; |, |\*|\n',str)

3.7 生成器generator

通过列表生成式,我们可以直接创建一个列表。但是,受到内存限制,列表容量肯定是有限的。 而且,创建一个包含100万个元素的列表,不仅占用很大的存储空间,如果我们仅仅需要访问前面几个元素,那后面绝大多数元素占用的空间都白白浪费了。 要创建一个generator,有很多种方法。第一种方法很简单,只要把一个列表生成式的[]改成(),就创建了一个generator: 如果要一个一个打印出来,可以通过next()函数获得generator的下一个返回值: next(g) 这里,最难理解的就是generator和函数的执行流程不一样。函数是顺序执行,遇到return语句或者最后一行函数语句就返回。 而变成generator的函数,在每次调用next()的时候执行,遇到yield语句返回,再次执行时从上次返回的yield语句处继续执行。

def odd():
    print('step 1')

    yield 1
    print('step 2')
    yield(3)
    print('step 3')
    yield(5)
>>> o = odd()
>>> next(o)
step 1
1
>>> next(o)
step 2
3
>>> next(o)
step 3
5
>>> next(o)
Traceback (most recent call last):

  File "<stdin>", line 1, in <module>
StopIteration

A Generator is an Iterator

A function with yield in it is still a function, that, when called, returns an instance of a generator object:

def a_function():
    "when called, returns generator object"
    yield

A generator expression also returns a generator:

a_generator = (i for i in range(0))

A Generator is an Iterator

An Iterator is an Iterable

Iterators require a next or next method

3.7.1 loop

  • loop with batches:
for i in tqdm(range(0, len(category), batch_size)):
    re_batch = {}
    for j in range(batch_size):
        re_batch[j] = wiki_category_re.search(category, last_span)
        if re_batch[j] is not None:
            last_span = re_batch[j].span()[1]
    upload_cat_node(re_batch)
  • don't care the interator sequence:
for _ in range(10):
    print(_)
  • fetch several pairs from a dictionary:
from itertools import islice

def take(n, iterable):
    "Return first n items of the iterable as a list"
    return list(islice(iterable, n))

n_items = take(3, dict_df_ret.items())
n_items

# or
list(islice(dictionary.items(), 3))
  • iterate key and value in a dictionary:
# python 2
for index, value in dict.iteritems():
# python 3
for index, value in dict.items():
    print index, value
  • iterate keys in a dictionary:
for k in dict:
  • iterate a row in pandas dataframe:
DataFrame.iterrows():
return generator.
>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
>>> row = next(df.iterrows())[1]
>>> row
int      1.0
float    1.5
Name: 0, dtype: float64
>>> print(row['int'].dtype)
float64
>>> print(df['int'].dtype)
int64
  • To preserve dtypes while iterating over the rows, it is better to use itertuples()
    • which returns tuples of the values and which is generally faster as iterrows.

3.8 Text Sequence Type — str

3.9 Binary Sequence Types — bytes, bytearray, memoryview

  • convert bytes into string:
b"abcde".decode("utf-8")

3.10 Set Types — set, frozenset

  • access an element in a set.
a=set([1,2,3])
element = a.pop(0)

list(a)[0]
from random import sample

def ForLoop(s):
    for e in s:
        break
    return e

def IterNext(s):
    return next(iter(s))

def ListIndex(s):
    return list(s)[0]

def PopAdd(s):
    e = s.pop()
    s.add(e)
    return e

def RandomSample(s):
    return sample(s, 1)

def SetUnpacking(s):
    e, *_ = s
    return e

from simple_benchmark import benchmark

b = benchmark([ForLoop, IterNext, ListIndex, PopAdd, RandomSample, SetUnpacking],
              {2**i: set(range(2**i)) for i in range(1, 20)},
              argument_name='set size',
              function_aliases={first: 'First'})

b.plot()

3.11 Mapping Types — dict

  • concat two dictionary:
dict1.update(dict2)
  • dump dictionary into pickle:
with open('./data/disease.pkl', 'wb') as f:
    pickle.dump(dict_intravenous_thrombolysis, f)
  • dump dictionary into json:
import json
with open('multiple_paths.json', 'w', encoding='utf-8') as fp:
    js_obj = json.dumps(filtered_dict)
    fp.write(js_obj)
  • get key from value:
for name, age in word2id.items():    # for name, age in list.items():  (for Python 3.x)
    if age == 16116:
        print(name)

# or
mydict = {'george':16,'amber':19}
print(list(mydict.keys())[list(mydict.values()).index(16)]) # Prints george

print(list(word2id.keys())[list(word2id.values()).index(16116)]) # Prints
  • get some keys value according to a list in a dictionary:
value = {}
for key in finance_vocab:
    value[key] = dict_vocab.get(key)

  • filter dictionary by value:
filtered_dict = {k:v for k,v in dict.items() if v<0}
  • set all values in a dict:
visited = dict.fromkeys(self.graph, False)
  • check if a value is in a dict:
'红鲱鱼招股书' in g.graph.values()
  • check if a value is in a defaultdict collection list:
any('波动性' in v for v in g.graph.values())
# or
def in_values(s, d):
    """Does `s` appear in any of the values in `d`?"""
    for v in d.values():
        if s in v:
            return True
    return False

in_values('cow', animals)
  • sort a dict by its value:
s = [(k, d[k]) for k in sorted(d, key=d.get, reverse=True)]
  • count key values in a dict:
d = defaultdict(list)
for name in g.graph.keys():
  key = len(g.graph[name])
  d[name] = key
  • convert a list of tuples with the same key into a dictionary:
from collections import defaultdict
d = defaultdict(list)
for k, v in list(graph.out_edges('财务管理')):
    d[k].append(v)
# or
d = {}
for k, v in list(graph.out_edges('财务管理')):
    d.setdefault(k,[]).append(v)
  • multiply diction by another dictionary:
from copy import copy
my_dict = copy(another_dict)
my_dict.update((x, y*2) for x, y in my_dict.items())
  • add value in a dictionary:
In [231]: d
Out[231]:
defaultdict(list,
            {'上海证券交易所上市公司': 383,
             '各证券交易所上市公司': 37,
             '深圳证券交易所上市公司': 511,
             '证券': 64,
             '证券交易所': 8})

sum(d.values())
  • write defaultdict to a json file:
import json
# writing
json.dump(yourdict, open(filename, 'w'))
# reading
yourdict = json.load(open(filename))

3.12 Context Manager Types

3.13 Other Built-in Types

3.14 Special Attributes – magic method:

  • getitem in a class allows its instances to use the [ ] (indexer) operators
  • setitem Called to implement assignment to self[key]
  • call magic method in a class causes its instances to become callables – in other words, those instances now behave like functions.
  • getattr overrides Python’s default mechanism for member access.
  • getattr magic method only gets invoked for attributes that are not in the dict magic attribute. Implementing getattr causes the hasattr built-in function to always return True, unless an exception is raised from within getattr.
  • setattr allows you to override Python’s default mechanism for member assignment.
  • The repr function also converts an object to a string. It can also be invoked using the reverse quotes (`), also called accent grave, (underneath the tilde, ~, on most keyboards). But it will convert unambitiously the object. For example, repr(datetime.datetime.now) = datetime.datetime(2018, 1, 20, 13, 32, 51, 483232).
  • __str__

get string of elements inside.

print `a`
print repr(a)
  • find out where module is installed
import os
import spacy
print(os.path.dirname(spacy.__file__))

4 Built-in Exceptions

4.1 Base classes

4.2 Concrete exceptions

4.3 Warnings

4.4 Exception hierarchy

4.5 exception

  • oracle cx error:
try:
    xx
except cx.DatabaseError:
    continue
  • retry:
response = None
error = None
while response is None:
  try:
    response = doing_something()
    if response is not None:
      if 'good' in response:
        print("successfully uploaded")
      else:
        exit("reason %s"%response)
  except HttpError as e:
    if e.code in RETRIABLE_STATUS_CODES:
      error = 'A retriable HTTP error %d occurred:\n%s' % (e.resp.status,
                                                             e.content)
    else:
      raise
  except RETRIABLE_EXCEPTIONS as e:
    error = 'A retriable error occurred: %s' % e

    if error is not None:
      print error
      retry += 1
      if retry > MAX_RETRIES:
        exit('No longer attempting to retry.')

      max_sleep = 2 ** retry
      sleep_seconds = random.random() * max_sleep
      print 'Sleeping %f seconds and then retrying...' % sleep_seconds
      time.sleep(sleep_seconds)
  • capture urllib error:
import urllib2

req = urllib2.Request('http://www.python.org/fish.html')
try:
    resp = urllib2.urlopen(req)
except urllib2.HTTPError as e:
    if e.code == 404:
        # do something...
    else:
        # ...
except urllib2.URLError as e:
    # Not an HTTP-specific error (e.g. connection refused)
    # ...
else:
    # 200
    body = resp.read()

  • create an exception:
class ConstraintError(Exception):
    def __init__(self, arg):
        self.args = arg


if error:
    raise ConstraintError("error")


class Networkerror(RuntimeError):
    def __init__(self, arg):
        self.args = arg


try:
    raise Networkerror("Bad hostname")
except Networkerror,e:
    print e.args
    print e.message
  • clean-up actions
>>> def divide(x, y):
...     try:
...         result = x / y
...     except ZeroDivisionError:
...         print "division by zero!"
...     else:
...         print "result is", result
...     finally:
...         print "executing finally clause"
...
>>> divide(2, 1)
result is 2
executing finally clause
>>> divide(2, 0)
division by zero!
executing finally clause
>>> divide("2", "1")
executing finally clause
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 3, in divide
TypeError: unsupported operand type(s) for /: 'str' and 'str'

5 Text Processing Services

5.1 string — Common string operations

  • check if string is Chinese:
from googletrans import Translator
translator = Translator(proxies ={
             'http': 'http://192.168.1.126:1080',
             'https': 'http://192.168.1.126:1080'
         }
)
input_text_language_0 = translator.detect(input_text_0).lang
  • find if strings are almost equal:
from difflib import SequenceMatcher
s_1 = 'Mohan Mehta'
s_2 = 'Mohan Mehte'
print(SequenceMatcher(a=s_1,b=s_2).ratio())
0.909090909091
  • check if string is empty:
if not text:
  print('text is empty')
  • if the string is an English word.
from nltk.corpus import wordnet
if not wordnet.synsets(word) and not word.isdigit()
  • jieba cut, remove signs.
punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐、﹒
﹔﹕﹖﹗﹚﹜﹞!),.:;?|}︴︶︸︺︼︾﹀﹂﹄﹏、~¢
々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖([{£¥〝︵︷︹︻
︽︿﹁﹃﹙﹛﹝({“‘-—_…''')

str_in = u"小明硕士毕业于中国科学院计算所,\
后在日本京都大学深造,凭借过人天赋,旁人若在另一方面爱他,他每即躲开。"

# 对str/unicode
filterpunt = lambda s: ''.join(filter(lambda x: x not in punct, s))
# 对list
filterpuntl = lambda l: list(filter(lambda x: x not in punct, l))
seg_list = jieba.cut(str_in, cut_all=False)
sent_list = filterpuntl(seg_list)
  • jieba cut on bash:
python -m jieba news.txt > cut_result.txt
  • create a list from jieba generator: sentence = [x for x in seg_list]
  • manually download nltk tokenizer:

unzip downloaded file nltk_data to /usr/local/share/nltk_data/tokenizers

  • tokenize unicode or string to sentence list.
from nltk import tokenize as n_tokenize
sent= n_tokenize.sent_tokenize(page)
# or
sent_list = page.split()
  • list comprehension
[x for x in t if x not in s if x.isdigit()]
l = [22, 13, 45, 50, 98, 69, 43, 44, 1]
[True if x >= 45 else False for x in l]
  • if string are digits.
str.isdigit()
  • concatenate two strings
" ".join((str1, str2))
  • 移除字符串头尾指定的字符(默认为空格)
#!/usr/bin/python

str = "0000000this is string example....wow!!!0000000";
print(str.strip( '0' ))
  • checks whether the string consists of alphabetic characters only.
nilbody

5.2 re — Regular expression operations

5.2.1 useage:

  • find strings
  • convert strings
  • convert syntax from python2 to python3
regular expression: find print(\S*), replace with print(\1)

5.2.2 types:

In : rex_property.search? Signature: rex_property.search(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) Docstring: Scan through string looking for a match, and return a corresponding match object instance. This object has start, end, group, groups, span.

Return None if no position in the string matches. Type: builtin_function_or_method

In : rex_property.findall? Signature: rex_property.findall(string=None, pos=0, endpos=9223372036854775807, , source=None) Docstring: *Return a list of all non-overlapping matches of pattern in string. Type: builtin_function_or_method

In : rex_property.match? Signature: rex_property.match(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) Docstring: Matches zero or more characters at the beginning of the string. Type: builtin_function_or_method

5.2.3 string array

[Pp]ython: find Python or python

  1. parts

    re.search('[a-zA-Z0-9]', 'x')

  2. not

    re.search('[^0-9]', 'x')

  3. shortcut
    • word: \w
    • number: \d
    • space, tab, next line: \s
    • 0 length sub string: \b

    re.search('\bcorn\b', 'corner')

  4. start and end with strings
    re.search('^Python', 'Python 3')
    re.search('Python$', 'this is Python')
    
  5. any character

    "."

  6. all lines contain a specific string
    ^.*Deeplearning4j$
    

5.2.4 optional words

'color' vs 'colour' re.search('colou?r', 'my favoriate color')

5.2.5 repeat

{N}

# find a telephone number
re.search(r'[\d]{3}-[\d]{4}', '867-5309 /Jenny')

# find 32big GID
[x for x in risk_model_merge.keys() if re.match("[A-Z0-9]{32}$", x)]
  1. boundary of repeated times

    [\d]{3,4}

  2. open selection

    [\d]{3,}

  3. speed selection
    • +: {1,}
    • *: {0,}

5.2.6 search for a pattern within a text file

  • bulk read:
import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)

  • read line by line:
import re

textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += reg.findall(line)
textfile.close()

5.3 difflib — Helpers for computing deltas

5.4 textwrap — Text wrapping and filling

5.5 unicodedata — Unicode Database

5.6 stringprep — Internet String Preparation

5.7 readline — GNU readline interface

  • read certain line from a file:
import linecache
linecache.getline('Sample.txt', Number_of_Line)

6 Data Types

6.1 datetime — Basic date and time types

  • Converting unix timestamp string to readable date in Python
import datetime
print(
    datetime.datetime.fromtimestamp(
        int("1284101485")
    ).strftime('%Y-%m-%d %H:%M:%S')
)

6.2 Time

  • create timestamp:
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
datetime.datetime(1300, 1, 1, 0, 0)
  • get specific timezone datetime
tz = pytz.timezone('America/Los_Angeles')
#date = date.today()
now = datetime.now()
los_angeles_time = datetime.now(tz)
  • use tqdm as a status bar:
from tqdm import tqdm
from time import sleep
for i in tqdm(range(10)):
    sleep(0.1)

# enumerate
for i in enumerate(tqdm(list)):
    do things()
# pandas
df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000)))

# Create and register a new `tqdm` instance with `pandas`
# (can use tqdm_gui, optional kwargs, etc.)
tqdm.pandas()

# Now you can use `progress_apply` instead of `apply`
df.groupby(0).progress_apply(lambda x: x**2)

  • string to datetime:
time.strptime(string[, format])
  • datetime, Timestamp, datetime64

pandas, Timestamp np.dtype('<M8[ns]')

– DatetimeIndex is composed by Timestamps.

#Timestamp to string:i
str_timestamp = pd.to_datetime(Timestamp, format = '%Y%m%d')
str_timestamp = str_timestamp.strftime('%Y-%m-%d')

datetime, utc datetime64

  • get the location of a date in datetimeindex:
pd.DatetimeIndex.get_loc(datetime)
  • datetime off set, subtract
TimeStamp +/- pd.DateOffset(years=1)
pd.Timedelta(days=365) #allowed keywords are [weeks, days, hours, minutes, seconds, milliseconds, microseconds, nanoseconds]
  • pandas date range selection:
start = ls_dates[d]
start = pd.to_datetime(start)
period = start + pd.DateOffset(30)
print(start, period)
df_range = df_simul[(df_simul.index < period) & (df_simul.index > start)]

6.3 calendar — General calendar-related functions

6.4 collections — Container datatypes

6.5 collections — High-performance container datatypes

module function
deque list-like container with fast appends and pops on either end
Counter dict subclass for counting hashable objects
defaultdict dict subclass that calls a factory function to supply missing values

6.6 collections.abc — Abstract Base Classes for Containers

6.7 heapq — Heap queue algorithm

6.8 bisect — Array bisection algorithm

6.9 array — Efficient arrays of numeric values

6.10 weakref — Weak references

6.11 types — Dynamic type creation and names for built-in types

6.12 copy — Shallow and deep copy operations

from copy import copy

6.13 pprint — Data pretty printer

适合打印列表。

>>> import pprint
>>> tup = ('spam', ('eggs', ('lumberjack', ('knights', ('ni', ('dead',
... ('parrot', ('fresh fruit',))))))))
>>> stuff = ['a' * 10, tup, ['a' * 30, 'b' * 30], ['c' * 20, 'd' * 20]]
>>> pprint.pprint(stuff)
['aaaaaaaaaa',
 ('spam',
  ('eggs',
   ('lumberjack',
    ('knights', ('ni', ('dead', ('parrot', ('fresh fruit',)))))))),
 ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'],
 ['cccccccccccccccccccc', 'dddddddddddddddddddd']]

6.14 reprlib — Alternate repr() implementation

6.15 enum — Support for enumerations

7 Numeric and Mathematical Modules

7.1 numbers — Numeric abstract base classes

7.2 math — Mathematical functions

7.3 cmath — Mathematical functions for complex numbers

7.4 decimal — Decimal fixed point and floating point arithmetic

Floating-point numbers are represented in computer hardware as base 2 (binary) fractions. For example, the decimal fraction 0.001 has value 0/2 + 0/4 + 1/8. On a typical machine running Python, there are 53 bits of precision available for a Python float, so the value stored internally when you enter the decimal number 0.1 is the binary fraction.

0.00011001100110011001100110011001100110011001100110011010
>>> round(2.675, 2)
2.67

it’s again replaced with a binary approximation, whose exact value is

2.67499999999999982236431605997495353221893310546875

  • precision, scientific number:
import numpy as np
np.set_printoptions(suppress=True)
%precision %.4g

7.5 fractions — Rational numbers

7.6 random — Generate pseudo-random numbers

7.7 statistics — Mathematical statistics functions

8 Functional Programming Modules

8.1 itertools — Functions creating iterators for efficient looping

8.2 functools — Higher-order functions and operations on callable objects

8.3 operator — Standard operators as functions

9 File and Directory Access

9.1 pathlib — Object-oriented filesystem paths

9.2 os.path — Common pathname manipulations

  • delete a file:
if os.path.exists("demofile.txt"):
  os.remove("demofile.txt")
  • change file name
import os
def change_filename(dir_name, filename, suffix, extension=None):
    # name = filename.split('/')[-1]
    path, ext = os.path.splitext(filename)
    if extension is None:
        extension = ext
    return os.path.join(dir_name, path + suffix + extension)


  • move file
import os
os.move(source_file_path, destination)
  • find current working dir:
import sys, os
# run python file.py
ROOTDIR = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(os.path.join(ROOTDIR, "lib"))
# run in python
ROOTDIR = os.path.join(os.path.dirname("__file__"), os.pardir)
  • temperary folder:
import os
import tempfile
TEMP_FOLDER = tempfile.gettempdir()
print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))
  • walk all file from a directory and its sub-directory

Directory tree generator.

For each directory in the directory tree rooted at top (including top itself, but excluding '.' and '..'), yields a 3-tuple

dirpath, dirnames, filenames

import os
from os.path import join, getsize
for root, dirs, files in os.walk('/home/weiwu/share/deep_learning/data/enwiki'):
    print(root, "consumes, ")
    print(sum([getsize(join(root, name)) for name in files]), '\s')
    print("bytes in", len(files), "non-directory files")
  • check if file exist
os.path.isfile(os.path.join(path,name))
  • get current work directory
import os
cwd = os.getcwd()
  • get temporary work directory
from tempfile import gettempdir
tmp_dir = gettempdir()

9.3 fileinput — Iterate over lines from multiple input streams

  • open

open() returns a file object, and is most commonly used with two arguments: open(filename, mode).

>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>

The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be 'r' when the file will only be read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for appending; any data written to the file is automatically added to the end. 'r+' opens the file for both reading and writing. The mode argument is optional; r will be assumed if it’s omitted.

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'.

  • write text at the end of a file without overwrite that file:
f = open('filename.txt', 'a')
f.write("stuff")
f.close()
  • read specific lines
fp = open("file")
for i, line in enumerate(fp):
    if i == 25:
        # 26th line
    elif i == 29:
        # 30th line
    elif i > 29:
        break
fp.close()
# Note that i == n-1 for the nth line.

# In Python 2.6 or later:
with open("file") as fp:
    for i, line in enumerate(fp):
        if i == 25:
            # 26th line
        elif i == 29:
            # 30th line
        elif i > 29:
            break

9.4 stat — Interpreting stat() results

9.5 filecmp — File and Directory Comparisons

9.6 tempfile — Generate temporary files and directories

9.7 glob — Unix style pathname pattern expansion

9.8 fnmatch — Unix filename pattern matching

9.9 linecache — Random access to text lines

9.10 shutil — High-level file operations

9.11 macpath — Mac OS 9 path manipulation functions

10 Data Persistence

10.1 pickle — Python object serialization

10.1.1 dump:

import pickle

data1 = {'a': [1, 2.0, 3, 4+6j],
         'b': ('string', u'Unicode string'),
         'c': None}

selfref_list = [1, 2, 3]
selfref_list.append(selfref_list)

output = open('data.pkl', 'wb')

# Pickle dictionary using protocol 0.
pickle.dump(data1, output)

# Pickle the list using the highest protocol available.
pickle.dump(selfref_list, output, -1)

output.close()
pickle.dump( x0, open( "x0.pkl", "wb" ) )

10.1.2 load:

  • read all the objects in the pickle dump file:
pickle_file = open('./data/city_20190228.pkl', 'rb')
dict_disease_seed_graph = []
while True:
    try:
        dict_disease_seed_graph.append(pickle.load(pickle_file))
    except EOFError:
        pickle_file.close()
        break
import pprint, pickle

pkl_file = open('data.pkl', 'rb')

data1 = pickle.load(pkl_file)
pprint.pprint(data1)

data2 = pickle.load(pkl_file)
pprint.pprint(data2)

pkl_file.close()

10.2 copyreg — Register pickle support functions

10.3 shelve — Python object persistence

10.4 marshal — Internal Python object serialization

10.5 dbm — Interfaces to Unix “databases”

10.6 sqlite3 — DB-API interface for SQLite databases

  • ModuleNotFoundError: No module named 'MySQLdb'
pip install mysqlclient
  • install mysql connector

python ModuleNotFoundError: No module named 'mysql'

pip search mysql-connector | grep --color mysql-connector-python
pip install mysql-connector-python-rf

10.7 protobuf

10.7.1 tutorial

  • need a .proto file for the structure:
syntax = "proto3"; // or proto2
package tutorial;

import "google/protobuf/timestamp.proto";
// [END declaration]

// [START java_declaration]
option java_package = "com.example.tutorial";
option java_outer_classname = "AddressBookProtos";
// [END java_declaration]

// [START csharp_declaration]
option csharp_namespace = "Google.Protobuf.Examples.AddressBook";
// [END csharp_declaration]

// [START messages]
message Person {
  string name = 1;
  int32 id = 2;  // Unique ID number for this person.
  string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    string number = 1;
    PhoneType type = 2;
  }

  repeated PhoneNumber phones = 4;

  google.protobuf.Timestamp last_updated = 5;
}

// Our address book file is just one of these.
message AddressBook {
  repeated Person people = 1;
}

  • Compiling Your Protocol Buffers in shell to generate a class:
protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto
protoc --proto_path=src --python_out=build/gen src/foo.proto src/bar/baz.proto
# The compiler will read the files src/foo.proto and src/bar/baz.proto and produce two output files: build/gen/foo_pb2.py and build/gen/bar/baz_pb2.py. The compiler will automatically create the directory build/gen/bar if necessary, but it will not create build or build/gen; they must already exist.
  • add_person.py
#! /usr/bin/python

import addressbook_pb2
import sys

# This function fills in a Person message based on user input.
def PromptForAddress(person):
  person.id = int(raw_input("Enter person ID number: "))
  person.name = raw_input("Enter name: ")

  email = raw_input("Enter email address (blank for none): ")
  if email != "":
    person.email = email

  while True:
    number = raw_input("Enter a phone number (or leave blank to finish): ")
    if number == "":
      break

    phone_number = person.phones.add()
    phone_number.number = number

    type = raw_input("Is this a mobile, home, or work phone? ")
    if type == "mobile":
      phone_number.type = addressbook_pb2.Person.MOBILE
    elif type == "home":
      phone_number.type = addressbook_pb2.Person.HOME
    elif type == "work":
      phone_number.type = addressbook_pb2.Person.WORK
    else:
      print "Unknown phone type; leaving as default value."

# Main procedure:  Reads the entire address book from a file,
#   adds one person based on user input, then writes it back out to the same
#   file.
if len(sys.argv) != 2:
  print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE"
  sys.exit(-1)

address_book = addressbook_pb2.AddressBook()

# Read the existing address book.
try:
  f = open(sys.argv[1], "rb")
  address_book.ParseFromString(f.read())
  f.close()
except IOError:
  print sys.argv[1] + ": Could not open file.  Creating a new one."

# Add an address.
PromptForAddress(address_book.people.add())

# Write the new address book back to disk.
f = open(sys.argv[1], "wb")
f.write(address_book.SerializeToString())
f.close()
  • try to run above python code in shell:
python add_person.py ADDRESS_BOOK_FILE
  • list_person.py
#! /usr/bin/python

import addressbook_pb2
import sys

# Iterates though all people in the AddressBook and prints info about them.
def ListPeople(address_book):
  for person in address_book.people:
    print "Person ID:", person.id
    print "  Name:", person.name
    if person.HasField('email'):
      print "  E-mail address:", person.email

    for phone_number in person.phones:
      if phone_number.type == addressbook_pb2.Person.MOBILE:
        print "  Mobile phone #: ",
      elif phone_number.type == addressbook_pb2.Person.HOME:
        print "  Home phone #: ",
      elif phone_number.type == addressbook_pb2.Person.WORK:
        print "  Work phone #: ",
      print phone_number.number

# Main procedure:  Reads the entire address book from a file and prints all
#   the information inside.
if len(sys.argv) != 2:
  print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE"
  sys.exit(-1)

address_book = addressbook_pb2.AddressBook()

# Read the existing address book.
f = open(sys.argv[1], "rb")
address_book.ParseFromString(f.read())
f.close()

ListPeople(address_book)
  • read a message:
python list_person.py ADDRESS_BOOK_FILE

10.8 neo4j

10.8.1 install

pip install neo4j-driver py2neo

10.8.2 basic operations

  • create cursor:
g = Graph(host="192.168.4.36",  # neo4j 搭载服务器的ip地址,ifconfig可获取到
            http_port=7474,  # neo4j 服务器监听的端口号
            user="neo4j",  # 数据库user name,如果没有更改过,应该是neo4j
            password="neo4j123")

10.9 mysql

10.9.1 create

  • create table
-- Create table
create table PATIENT_PROFILE
(
  PERSON_ID                VARCHAR2(50) not null,
  PERSON_AGE               INTEGER not null,
  PERSON_SEX               VARCHAR2(10) not null,
  AGE_GROUP                VARCHAR2(10) not null,
  RISK_SCORE               VARCHAR2(10) default 1,
  CHRONIC_DIS              VARCHAR2(10) default 'False',
  INFECTIOUS_DIS           VARCHAR2(10) default 'False',
  TUMOR                    VARCHAR2(10) default 'False',
  PSYCHIATRIC              VARCHAR2(10) default 'False',
  IMPLANTABLE_DEV          VARCHAR2(10) default 'False',
  TREATMENT_PERC_LV        VARCHAR2(10) default 'low',
  INSPECTION_PERC_LV       VARCHAR2(10) default 'low',
  DRUG_PERC_LV             VARCHAR2(10) default 'low',
  OUTPATIENT_PERC_LV       VARCHAR2(10) default 'low',
  DRUG_PURCH_AVG_LV        VARCHAR2(10) default 'low',
  SELF_SURPP_PERC_LV       VARCHAR2(10) default 'low',
  CUM_OUTPATIENT_LV        VARCHAR2(10) default 'low',
  CUM_HOSP_LV              VARCHAR2(10) default 'low',
  HOSP_AVG_LV              VARCHAR2(10) default 'low',
  OUTPATIENT_AVG_LV        VARCHAR2(10) default 'low',
  DRUG_PURCH_FREQ_LV       VARCHAR2(10) default 'low',
  OUTPATIENT_FREQ_LV       VARCHAR2(10) default 'low',
  HOSP_FREQ_LV             VARCHAR2(10) default 'low',
  HOSP_PREFERENCE          VARCHAR2(10),
  FIRST_VISIT_PREFERENCE   VARCHAR2(10),
  TRAILING_12_MONTHS_LEVEL VARCHAR2(10) default 'low',
  TRAILING_36_MONTHS_LEVEL VARCHAR2(10) default 'low',
  SHORT_TERM               VARCHAR2(10) default 'False',
  LONG_TERM                VARCHAR2(10) default 'False'
)
tablespace WUXI
  pctfree 10
  initrans 1
  maxtrans 255;
-- Add comments to the table
comment on table PATIENT_PROFILE
  is '个人用户画像';
-- Add comments to the columns
comment on column PATIENT_PROFILE.PERSON_SEX
  is '描述患者的生理性别
';
comment on column PATIENT_PROFILE.AGE_GROUP
  is '0-6 童年,7-17 少年, 18-40 青年, 41-65 中年, 66以上老年
';
comment on column PATIENT_PROFILE.RISK_SCORE
  is '健康风险等级
';
comment on column PATIENT_PROFILE.CHRONIC_DIS
  is '慢性病患者';
comment on column PATIENT_PROFILE.INFECTIOUS_DIS
  is '传染病病原携带者 ';
comment on column PATIENT_PROFILE.TUMOR
  is '恶性肿瘤患者 ';
comment on column PATIENT_PROFILE.PSYCHIATRIC
  is '精神疾病患者 ';
comment on column PATIENT_PROFILE.IMPLANTABLE_DEV
  is '植入性器材';
comment on column PATIENT_PROFILE.TREATMENT_PERC_LV
  is '治疗费用占比';
comment on column PATIENT_PROFILE.INSPECTION_PERC_LV
  is '检查费用占比';
comment on column PATIENT_PROFILE.DRUG_PERC_LV
  is '药物费用占比';
comment on column PATIENT_PROFILE.OUTPATIENT_PERC_LV
  is '门诊住院费用比例';
comment on column PATIENT_PROFILE.DRUG_PURCH_AVG_LV
  is '单次平均购药金额';
comment on column PATIENT_PROFILE.SELF_SURPP_PERC_LV
  is '自负费用占比';
comment on column PATIENT_PROFILE.CUM_OUTPATIENT_LV
  is '累计门诊金额
';
comment on column PATIENT_PROFILE.CUM_HOSP_LV
  is '累计住院金额';
comment on column PATIENT_PROFILE.HOSP_AVG_LV
  is '平均住院金额';
comment on column PATIENT_PROFILE.OUTPATIENT_AVG_LV
  is '平均门诊金额';
comment on column PATIENT_PROFILE.DRUG_PURCH_FREQ_LV
  is '药店购药频繁程度';
comment on column PATIENT_PROFILE.OUTPATIENT_FREQ_LV
  is '门诊就诊频繁程度';
comment on column PATIENT_PROFILE.HOSP_FREQ_LV
  is '住院就诊频繁程度
';
comment on column PATIENT_PROFILE.HOSP_PREFERENCE
  is '就诊医院偏好
1门诊、2药店购药、3住院
';
comment on column PATIENT_PROFILE.FIRST_VISIT_PREFERENCE
  is '首选就诊偏好,对一次健康问题的首次就医方式的选择偏好。1门诊、2药店购药、3住院
';
comment on column PATIENT_PROFILE.TRAILING_12_MONTHS_LEVEL
  is '近一年费用支出水平
';
comment on column PATIENT_PROFILE.TRAILING_36_MONTHS_LEVEL
  is '近三年费用支出水平
';
comment on column PATIENT_PROFILE.SHORT_TERM
  is '短期内再入院';
comment on column PATIENT_PROFILE.LONG_TERM
  is '长周期住院';

  • create from select:
create table temp_pairs_id as
select t1.GRID person_1, t2.GRID person_2, sysdate as crt_date
  from MMAP_SHB_SPECIAL.Ck10_Ghdj t1
  • create index
  • create multiple index

10.9.2 select

  • select the first row in each group:
WITH summary AS (
select person_id, hosp_lev, clinic_type, substr(out_diag_dis_cd, 0, 3) as icd3, to_date(substr(out_hosp_date, 0, 8), 'yyyymmdd') as discharge_date,
ROW_NUMBER() OVER(PARTITION BY person_id, substr(out_diag_dis_cd, 0, 3)
                                 ORDER BY to_date(substr(out_hosp_date, 0, 8), 'yyyymmdd')) AS rk from t_kc21
where out_diag_dis_cd is not null and clinic_type is not null)
SELECT s.*
  FROM summary s
 WHERE s.rk = 1

10.9.3 insert

  • insert by batches:
from tqdm import tqdm
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))
chunksize = 500
with tqdm(total=len(df_disease)) as pbar:
    for i, cdf in enumerate(chunker(df_disease, chunksize)):
        cdf.to_sql(con=engine, name="disease_profile", if_exists="append", index=False)
        pbar.update(chunksize)
  • insert data into sql:
cursor = CONN.cursor()
s_insert = """INSERT INTO t_jbhlx (TBBH, JB_CD, HLFYSX, HLFYXX, HLZYTSSX, HLZYTSXX, LOGIN_ID, MODIFY_ID, DELE_FLG) VALUES (:1, :2, :3, :4, :5, :6, :7, :8, :9)"""
for idx, row in df_sim.iterrows():
    if counter % 1000 == 0:
        print(idx)
        print(counter)
    counter += 1
try:
    cursor.execute(s_insert, (str(data.table_index.iloc[0]), #TBBH,
                              accident_id, #JB_CD,
                            str(cost_upper_bnd), #HLFYSX
                            str(cost_lower_bnd), #HLFYXX
                            str(period_upper_bnd), #HLZYTSSX
                            str(period_lower_bnd), #HLZYTSXX
                            'admin', #LOGIN_ID
                            'admin', #MODIFY_ID
                            '1'  #DELE_FLG
    ))
except cx_Oracle.IntegrityError:
    pass
CONN.commit()
  • True\False

can't insert True or False directly into table, need to turn True/False into string.

10.9.4 delete

  • clear a table

The TRUNCATE TABLE statement is used to remove all records from a table in Oracle. It performs the same function as a DELETE statement without a WHERE clause.

TRUNCATE TABLE [schema_name.]table_name
  • delete a table
drop table name

10.9.5 update

  • update with python:
s_insert = """update patient_profile set CHRONIC_DIS='{0}',INFECTIOUS_DIS='{1}', TUMOR='{2}', PSYCHIATRIC='{3}' where person_id={4}"""
counter = 0
for idx, row in df_disease.iterrows():
    if counter % 100000 == 0:
        print(counter)
    counter += 1
    cursor.execute(s_insert.format(str(row['慢性病患者']),
                                   str(row['传染病病原携带者']),
                                   str(row['恶性肿瘤患者']),
                                   str(row['精神疾病患者']),
                             str(idx)))

11 Data Compression and Archiving

11.1 zip

zip([iterable, …]) This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The returned list is truncated in length to the length of the shortest argument sequence. When there are multiple arguments which are all of the same length, zip() is similar to map() with an initial argument of None. With a single sequence argument, it returns a list of 1-tuples. With no arguments, it returns an empty list.

The left-to-right evaluation order of the iterables is guaranteed. This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n).

zip() in conjunction with the * operator can be used to unzip a list:

>>>
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
>>> x2, y2 = zip(*zipped)
>>> x == list(x2) and y == list(y2)
True
  • create a dictionary with two iterables
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> zipped
[(1, 4), (2, 5), (3, 6)]
In [172]: dict(zipped)
Out[179]: {1: 4, 2: 5, 3: 6}

11.2 zlib — Compression compatible with gzip

11.3 gzip — Support for gzip files

11.4 bz2 — Support for bzip2 compression

11.5 lzma — Compression using the LZMA algorithm

11.6 zipfile — Work with ZIP archives

11.7 tarfile — Read and write tar archive files

12 File Formats

12.1 csv — CSV File Reading and Writing

12.2 *args, **kwargs

如果我们不确定要往函数中传入多少个参数,或者我们想往函数中以列表和元组的形式传参数时,那就使要用*args;

>>> args = ("two", 3,5)
>>> test_args_kwargs(*args)
arg1: two
arg2: 3
arg3: 5

如果我们不知道要往函数中传入多少个关键词参数,或者想传入字典的值作为关键词参数时,那就要使用**kwargs。args和kwargs这两个标识符是约定俗成的用法,你当然还可以用*bob和**billy,但是这样就并不太妥。

>>> kwargs = {"arg3": 3, "arg2": "two","arg1":5}
>>> test_args_kwargs(**kwargs)
arg1: 5
arg2: two
arg3: 3

  • construct argparse for test:
class argparse(dict):
    """
    Example:
    m = Map({'first_name': 'Eduardo'}, last_name='Pool', age=24, sports=['Soccer'])
    """
    def __init__(self, *args, **kwargs):
        super(argparse, self).__init__(*args, **kwargs)
        for arg in args:
            if isinstance(arg, dict):
                for k, v in arg.iteritems():
                    self[k] = v

        if kwargs:
            for k, v in kwargs.iteritems():
                self[k] = v
    def add_argument(self, *args, **kwargs):
        # super(Map, self).__init__(*args, **kwargs)
        for i in args:
            self[i.strip('-')] = kwargs.get('default', None)
            if 'action' in kwargs:
                if kwargs['action'] == 'store_true':
                    self[i.strip('-')] = True
                else:
                    self[i.strip('-')] = False
    def parse_args(self):
        return self

    def __getattr__(self, attr):
        return self.get(attr)

    def __setattr__(self, key, value):
        self.__setitem__(key, value)

    def __setitem__(self, key, value):
        super(argparse, self).__setitem__(key, value)
        self.__dict__.update({key: value})

    def __delattr__(self, item):
        self.__delitem__(item)

    def __delitem__(self, key):
        super(argparse, self).__delitem__(key)
        del self.__dict__[key]

12.3 configparser — Configuration file parser

  • use yaml and config file.
# config.yaml
engine:
  user:
    'jack'
  password:
    'password'
import yaml
with open(r'config.yaml', 'rb') as f:
    config = yaml.load(f)

  • ylib.yaml_config
from ylib.yaml_config import Configuraion
config = Configuraion()
config.load('../config.yaml')
print(config.__str__)

USER_AGENT = config.USER_AGENT
DOMAIN = config.DOMAIN
BLACK_DOMAIN = config.BLACK_DOMAIN
URL_SEARCH = config.URL_SEARCH

12.4 netrc — netrc file processing

12.5 xdrlib — Encode and decode XDR data

12.6 plistlib — Generate and parse Mac OS X .plist files

13 Cryptographic Services

13.1 hashlib — Secure hashes and message digests

13.2 hmac — Keyed-Hashing for Message Authentication

13.3 secrets — Generate secure random numbers for managing secrets

14 Generic Operating System Services

14.1 os — Miscellaneous operating system interfaces

14.2 io — Core tools for working with streams

14.3 time — Time access and conversions

14.4 argparse — Parser for command-line options, arguments and sub-commands

14.4.1 16.4.2. ArgumentParser objects

class argparse.ArgumentParser(prog=None, usage=None, description=None, epilog=None, parents=[], formatter_class=argparse.HelpFormatter, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, allow_abbrev=True) Create a new ArgumentParser object. All parameters should be passed as keyword arguments. Each parameter has its own more detailed description below, but in short they are:

prog - The name of the program (default: sys.argv[ 0]) usage - The string describing the program usage (default: generated from arguments added to parser) description - Text to display before the argument help (default: none) epilog - Text to display after the argument help (default: none) parents - A list of ArgumentParser objects whose arguments should also be included formatter_class - A class for customizing the help output prefix_chars - The set of characters that prefix optional arguments (default: ‘-‘) fromfile_prefix_chars - The set of characters that prefix files from which additional arguments should be read (default: None) argument_default - The global default value for arguments (default: None) conflict_handler - The strategy for resolving conflicting optionals (usually unnecessary) add_help - Add a -h/–help option to the parser (default: True) allow_abbrev - Allows long options to be abbreviated if the abbreviation is unambiguous. (default: True)

14.4.2 argument_default

>>> parser = argparse.ArgumentParser(argument_default=argparse.SUPPRESS)
>>> parser.add_argument('--foo')
>>> parser.add_argument('bar', nargs='?')
>>> parser.parse_args(['--foo', '1', 'BAR'])
Namespace(bar='BAR', foo='1')
>>> parser.parse_args([])
Namespace()

14.4.3 example

import argparse

logger = logging.getLogger()
handler = logging.StreamHandler()
formatter = logging.Formatter(
    '%(asctime)s %(name)-12s %(levelname)-8s %(message)s')
handler.setFormatter(formatter)
if not logger.handlers:
    logger.addHandler(handler)
    logger.setLevel(logging.DEBUG)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-i", "--input", required=False, help="Input word2vec model")
    parser.add_argument(
        "-o", "--output", required=False, help="Output tensor file name prefix")
    parser.add_argument(
        "-b",
        "--binary",
        required=False,
        help="If word2vec model in binary format, set True, else False")
    parser.add_argument(
        "-l",
        "--logdir",
        required=False,
        help="periodically save model variables in a checkpoint")
    parser.add_argument(
        "--host",
        required=False,
        help="host where holding the tensorboard projector service")
    parser.add_argument("-p", "--port", required=False, help="browser port")
    args = parser.parse_args()

    word2vec2tensor(args.input, args.output, args.binary)

14.4.4 another way to define function parameters and provide parameters:

  • define function parameters inside a function:
def convert_pdf2txt(args=None):
    import argparse
    P = argparse.ArgumentParser(description=__doc__)
    P.add_argument(
        "-m", "--maxpages", type=int, default=0, help="Maximum pages to parse")

    A = P.parse_args(args=args)
    print(A.maxpages)
  • provide parameters:
# all parameters should be strings
convert_pdf2txt(['--maxpages', '123'])

14.5 getopt — C-style parser for command line options

14.6 logging — Logging facility for Python

import logging
logger = logging.getLogger()
handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s %(name)-12s %(levelname)-8s %(message)s')
handler.setFormatter(formatter)
if not logger.handler:
    logger.addHandler(handler)
logger.setLevel(logging.DEBUG)
logger
# or
logging.basicConfig(
    format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)

# at the end of the program
handler.close()
logger.removeHandler(handler)
  • ylog
from ylib import ylog
import logging

ylog.set_level(logging.DEBUG)
ylog.console_on()
ylog.filelog_on("app")

14.7 logging.config — Logging configuration

14.8 logging.handlers — Logging handlers

14.9 getpass — Portable password input

14.10 curses — Terminal handling for character-cell displays

14.11 curses.textpad — Text input widget for curses programs

14.12 curses.ascii — Utilities for ASCII characters

14.13 curses.panel — A panel stack extension for curses

14.14 platform — Access to underlying platform’s identifying data

14.15 errno — Standard errno system symbols

14.16 ctypes — A foreign function library for Python

15 Concurrent Execution

15.1 threading — Thread-based parallelism

15.2 threading & queue

15.2.1 install

pip install queuelib

15.2.2 example

from Queue import Queue
import threading

15.3 multiprocessing — Process-based parallelism

15.4 The concurrent package

15.5 concurrent.futures — Launching parallel tasks

15.6 subprocess — Subprocess management

15.7 sched — Event scheduler

15.8 queue — A synchronized queue class

15.9 dummy_threading — Drop-in replacement for the threading module

15.10 _thread — Low-level threading API

15.11 _dummy_thread — Drop-in replacement for the _thread module

16 Internet Data Handling

16.1 Jupyter notebook

16.1.1 Using a virtualenv in an IPython notebook

  1. Install the ipython kernel module into your virtualenv
workon my-virtualenv-name  # activate your virtualenv, if you haven't already
pip install ipykernel
  1. Now run the kernel "self-install" script:
python -m ipykernel install --user --name=my-virtualenv-name
  1. list all the kernels
jupyter kernelspec list
  1. remove uninstall kernel
jupyter kernelspec uninstall anaconda2.7

16.1.2 Extension & Configuration

  • install extension:
conda install -c conda-forge jupyter_contrib_nbextensions
  • Enable line number by default
touch ~/.jupyter/static/custom/custom.js

add below text in the file:

define([
    'base/js/namespace',
    'base/js/events'
],
       function(IPython, events) {
           events.on("app_initialized.NotebookApp",
                     function () {
                         IPython.Cell.options_default.cm_config.lineNumbers = true;
                     }
                    );
       }
      );
  • enable auto complete
jupyter nbextension enable hinterland/hinterland

16.2 typical structure of the ipython notebook ipynb:

  1. imports
  2. get data
  3. transform data
  4. modeling
  5. visualization
  6. making sense of the data

summary:

  • notebook should have one hypothesis data interpretation loop
  • make a multi-project utils library
  • each cell should have one and only one output
  • try to keep code inside notebooks.

16.3 fetch data from yahoo

install pandas-datareader first.

conda install pandas-datareader
import pandas as pd
import datetime as dt
import numpy as np
from pandas_datareader import data as web

data = pd.DataFrame()
symbols = ['GLD', 'GDX']
for sym in symbols:
    data[sym] = web.DataReader(sym, data_source='yahoo', start='20100510')['Adj Close']
data = data.dropna()

16.4 email — An email and MIME handling package

16.5 json — JSON encoder and decoder

16.6 Graph

16.6.1 networkx

  • add node to a graph
  • add edges to a graph
  • find a loop/cycle in a graph
nx.find_cycle(G)
list(nx.simple_cycles(G))

17 Internet Protocols and Support

17.1 webbrowser — Convenient Web-browser controller

17.2 cgi — Common Gateway Interface support

17.3 cgitb — Traceback manager for CGI scripts

17.4 wsgiref — WSGI Utilities and Reference Implementation

17.5 urllib — URL handling modules

17.6 urllib.request — Extensible library for opening URLs

17.7 urllib.response — Response classes used by urllib

17.8 urllib.parse — Parse URLs into components

17.9 urllib.error — Exception classes raised by urllib.request

17.10 urllib.robotparser — Parser for robots.txt

17.11 http — HTTP modules

17.12 http.client — HTTP protocol client

17.13 ftplib — FTP protocol client

17.14 poplib — POP3 protocol client

17.15 imaplib — IMAP4 protocol client

17.16 nntplib — NNTP protocol client

17.17 smtplib — SMTP protocol client

17.18 smtpd — SMTP Server

17.19 telnetlib — Telnet client

17.20 uuid — UUID objects according to RFC 4122

17.21 socketserver — A framework for network servers

17.22 http.server — HTTP servers

17.23 http.cookies — HTTP state management

17.24 http.cookiejar — Cookie handling for HTTP clients

17.25 xmlrpc — XMLRPC server and client modules

17.26 xmlrpc.client — XML-RPC client access

17.27 xmlrpc.server — Basic XML-RPC servers

17.28 ipaddress — IPv4/IPv6 manipulation library

18 Development Tools

18.1 typing — Support for type hints

18.2 pydoc — Documentation generator and online help system

18.3 doctest — Test interactive Python examples

18.4 unittest — Unit testing framework

  • check data operation:
    • create, select, update, delete.
  • purpose of unit test
    • checking parameter types, classes, or values.
    • checking data structure invariants.
    • checking “can’t happen” situations (duplicates in a list, contradictory state variables.)
    • after calling a function, to make sure that its return is reasonable.

18.5 unittest.mock — mock object library

18.6 unittest.mock — getting started

18.7 test — Regression tests package for Python

18.8 test.support — Utilities for the Python test suite

19 Debugging and Profiling

19.1 bdb — Debugger framework

19.2 faulthandler — Dump the Python traceback

19.3 pdb — The Python Debugger

  • s(tep):

Execute the current line, stop at the first possible occasion (either in a function that is called or in the current function).

  • n(ext):

Continue execution until the next line in the current function is reached or it returns.

  • unt(il):

Continue execution until the line with a number greater than the current one is reached or until the current frame returns.

  • r(eturn):
  • c(ont(inue)):

Continue execution, only stop when a breakpoint is encountered.

  • l(ist): [first [,last]]

List source code for the current file. Without arguments, list 11 lines around the current line or continue the previous listing. With one argument, list 11 lines starting at that line. With two arguments, list the given range; if the second argument is less than the first, it is a count.

  • a(rgs):

Print the argument list of the current function.

  • p expression:

Print the value of the expression.

19.4 The Python Profilers

STEPS: 1). install snakeviz using pip from cmd.

pip install snakeviz

2). profile the test python file using below command.

$ python -m cProfile -o profile.stats test.py
# test.py
from random import randint
max_size = 10**4
data = [randint(0, max_size) for _ in range(max_size)]
test = lambda: insertion_sort(data)

3). check the efficiency result from profile.stats file.

$ snakeviz profile.stats

19.5 timeit — Measure execution time of small code snippets

19.6 trace — Trace or track Python statement execution

19.7 tracemalloc — Trace memory allocations

20 Software Packaging and Distribution

20.1 pip

  • install opencv, pytorch:
pip install opencv-python torch
  • install with a wheel .whl file:
pip install *.whl
  • Upgrading pip
pip install -U pip
  • add below setup to ~/.pip/pip.conf
[global]
#index-url=https://pypi.mirrors.ustc.edu.cn/simple/
#index-url=https://pypi.python.org/simple/
index-url=http://mirrors.aliyun.com/pypi/simple/
#index-url=https://pypi.gocept.com/pypi/simple/
#index-url=https://mirror.picosecond.org/pypi/simple/
[install]
trusted-host=mirrors.aliyun.com
#trusted-host=mirrors.ustc.edu.cn
  • generate a requirements file:
pip freeze > requirements.txt
  • pip install directly:

requirements.txt

--index-url http://mirrors.aliyun.com/pypi/simple/
pandas
pylint
pep8
sphinx
ipython
numpy
ipdb
mock
nose

20.2 distutils — Building and installing Python modules

20.3 ensurepip — Bootstrapping the pip installer

20.4 venv — Creation of virtual environments

20.5 zipapp — Manage executable python zip archives

20.6 pyenv — Simple Python version management

  • check installed versions
pyenv versions
 system
 2.7.13
 3.6.0
 3.6.0/envs/general
 3.6.0/envs/simulate
 3.6.0/envs/venv3.6.0
 3.6.0/envs/venv3.6.0.1
* anaconda3-4.4.0 (set by /home/weiwu/projects/simulate/.python-version)
 general
 simulate
 venv3.6.0
 venv3.6.0.1

21 Python Runtime Services

21.1 sysconfig — Provide access to Python’s configuration information

21.2 os, sys — System-specific parameters and functions

  • get environment variables
import os

env_dist = os.environ # environ是在os.py中定义的一个dict environ = {}

print(env_dist.get('JAVA_HOME'))
print(env_dist['JAVA_HOME'])

  • check if file or directory exists, if not then make directory:
import os
os.path.exists(test_file.txt)
os.path.isfile("test-data")
export_dir = "export/"
if not os.path.exists(export_dir):
    os.mkdir(export_dir)
  • read a file:

import os folder = '/file/path' file = os.path.join(folder, 'file_name')

  • list all the files under a directory:
# os.listdir() 方法用于返回指定的文件夹包含的文件或文件夹的名字的列表。这个列表以字母顺序。 它不包括 '.' 和'..' 即使它在文件夹中.
path = os.getcwd()
dirs = os.listdir(path)
  • check if the file readable:
import os
if os.access("/file/path/foo.txt", os.F_OK):
    print "Given file path is exist."

if os.access("/file/path/foo.txt", os.R_OK):
    print "File is accessible to read"

if os.access("/file/path/foo.txt", os.W_OK):
    print "File is accessible to write"

if os.access("/file/path/foo.txt", os.X_OK):
    print "File is accessible to execute"

  • use sys to get command arguments:
#!/usr/bin/python3

import sys

print ('参数个数为:', len(sys.argv), '个参数。')
print ('参数列表:', str(sys.argv))

$ python3 test.py arg1 arg2 arg3
参数个数为: 4 个参数。
参数列表: ['test.py', 'arg1', 'arg2', 'arg3']

21.3 builtins — Built-in objects

21.4 __main__ — Top-level script environment

21.5 warnings — Warning control

  • SettingWithCopyWarning in Pandas, ignore pandas warning
pd.options.mode.chained_assignment = None  # default='warn'
  • RuntimeWarning in numpy:

numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88

import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

21.6 contextlib — Utilities for with-statement contexts

21.7 abc — Abstract Base Classes

21.8 atexit — Exit handlers

21.9 traceback — Print or retrieve a stack traceback

21.10 __future__ — Future statement definitions

从Python 2.7到Python 3.x就有不兼容的一些改动,比如2.x里的字符串用'xxx'表示str,Unicode字符串用u'xxx'表示unicode,而在3.x中,所有字符串都被视为unicode,因此,写u'xxx'和'xxx'是完全一致的,而在2.x中以'xxx'表示的str就必须写成b'xxx',以此表示“二进制字符串”。

要直接把代码升级到3.x是比较冒进的,因为有大量的改动需要测试。相反,可以在2.7版本中先在一部分代码中测试一些3.x的特性,如果没有问题,再移植到3.x不迟。

Python提供了__future__模块,把下一个新版本的特性导入到当前版本,于是我们就可以在当前版本中测试一些新版本的特性。

from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals
from __future__ import absolute_import
  • unicode vs utf-8 vs binary strings vs strings

unicode 是编码unique code,例如把一个汉字编成了一个码(计算机不可读).

A chinese character: 汉

it's unicode value: U+6C49

convert 6C49 to binary: 01101100 01001001

UTF-8是把character转为binary code的规范, and vice versa. 方便存储。

binary          
1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx       7 007F hex (127)
110xxxxx 10xxxxxx     (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx   (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)

已知“严”的unicode是4E25(100111000100101),根据上表,可以发现4E25处在第三行的范围内(0000 0800-0000 FFFF),因此“严”的UTF-8编码需要三个字节,即格式是“1110xxxx 10xxxxxx 10xxxxxx”。然后,从“严”的最后一个二进制位开始,依次从后向前填入格式中的x,多出的位补0。这样就得到了,“严”的UTF-8编码是“11100100 10111000 10100101”,转换成十六进制就是E4B8A5。

You can use a different encoding from UTF-8 by putting a specially-formatted comment as the first or second line of the source code:

是为了让解释器在执行该文件的时候知道该文件是何种编码方式,从而顺利读取指令去执行计算。

  • division

新的除法特性,本来的除号`/`对于分子分母是整数的情况会取整,但新特性中在此情况下的除法不会取整,取整的使用`//`。如下可见,只有分子分母都是整数时结果不同。

  • print_function

新的print是一个函数,如果导入此特性,之前的print语句就不能用了。

  • unicode_literals

这个是对字符串使用unicode字符

21.11 gc — Garbage Collector interface

21.12 inspect — Inspect live objects

  • find the folder of a module:
import inspect
inspect.getfile(Module)

21.13 site — Site-specific configuration hook

21.14 fpectl — Floating point exception control

22 Custom Python Interpreters

22.1 code — Interpreter base classes

22.2 codeop — Compile Python code

23 Importing Modules

23.1 Module

  • reload a module/lib in ipython without killing it.
# For Python 2 use built-in function reload():

reload(module)
# For Python 2 and 3.2–3.3 use reload from module imp:

import imp
imp.reload(module)
# or
import importlib
importlib.reload(module)
  • 当你运行一个Python模块 python fibo.py <arguments> Remember, everything in python is an object.
  • python解释器CPython.
  • 如果字符串里面有’\’,而实际上要把斜杠加到字符串里面,要在前面加r,代表raw. 例如r’Y:\codes’.
  • 如果主目录下有子目录packages,记得要在子目录下加init.py,最好在主目录下建main.py函数。
  • 解释器如果执行哪个一个文件为主程序,如 python program1.py,it will set the special variable name to program1.py equals to ‘main’
  • One of the reasons for doing this is that sometimes you write a module (a .py file) where it can be executed directly. Alternatively, it can also be imported and used in another module. By doing the main check, you can have that code only execute when you want to run the module as a program and not have it execute when someone just wants to import your module and call your functions themselves.
  • 模块中的代码将会被执行,就像导入它一样,不过此时name 被设置为 “main“。这意味着,通过在你的模块末尾添加此代码.
  • can’t import module from upper directory.
    • need to add the working directory to .bashrc PYTHONPATH
    • using ipython.
  • Add custom folder path to the Windows environment. add PYTHONEXE%; to System Variable PATH; add System variable name: PYTHONEXE , value: C:\Users\Wei Wu\Anaconda2;C:\Users\Wei Wu\Python\ylib\src\py\; add PYTHONPATH: C:\Users\Wei Wu\Anaconda2;C:\Users\Wei Wu\Python\ylib\src\py\; or add module path in Spyder directly;
  • import module temperarily from parental directory without add path to the system.
# folder1
#    \__init__.py
#    \State.py
#    \StateMachine.py
#    \mouse_folder
#        \MouseAction.py
import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir+'\\mouse')
sys.path.insert(0,parentdir)

from State import State
from StateMachine import StateMachine
from MouseAction import MouseAction
  • check module path:
import os
print os.path.abspath(ylib.__file__)
  • make a python 3 virtual environment:
mkvirtual -p python3 ENVNAME
  • install setup.py:

python setup.py install to virtual environment: home/weiwu.virtualenvs/data_analysis/bin/python2 setup.py install

  • install from github:

pip install git+https://github.com/quantopian/zipline.git

conda uninstall tqdm
easy_install git+https://github.com/quantopian/zipline.git
  • change conda source/configuration:
vim ~/.condarc
channels:
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
  - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
show_channel_urls: true

# or
conda config --add channels 'https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/'
conda config --set show_channel_urls yes
  • easy_install multiple versions, remove version:
import pkg_resources
pkg_resources.require("gensim")  # latest installed version
pkg_resources.require("gensim==3.7.2")  # this exact version
pkg_resources.require("gensim>=3.7.2")  # this version or higher

  • Removing an environment

To remove an environment, in your terminal window or an Anaconda Prompt, run:

conda remove --name myenv --all
# You may instead use conda env remove --name myenv.
# To verify that the environment was removed, in your terminal window or an Anaconda Prompt, run:
conda info --envs
  • create conda virtualenv:
# 创建 conda 虚拟环境( :code:`env_name` 是您希望创建的虚拟环境名)
$ conda create --name env_name python=3.5

# 如您想创建一个名为rqalpha的虚拟环境
$ conda create --name rqalpha python=3.5

# 使用 conda 虚拟环境
$ source activate env_name
# 如果是 Windows 环境下 直接执行 activcate
$ activate env_name

# 退出 conda 虚拟环境
$ source deactivate env_name
# 如果是 Windows 环境下 直接执行 deactivate
$ deactivate env_name

# 删除 conda 虚拟环境
$ conda-env remove --name env_name

# add conda for all users
sudo ln -s /share/anaconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh
# Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
# your ~/.bashrc file.  You should manually remove the line that looks like

    export PATH="/share/anaconda3/bin:$PATH"

# ^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^

  • percentage output format:
from future import division
print “%s %.4f%%” % (sid, (len(not_close)/len(ctp)))

23.2 zipimport — Import modules from Zip archives

23.3 pkgutil — Package extension utility

23.4 modulefinder — Find modules used by a script

23.5 runpy — Locating and executing Python modules

23.6 importlib — The implementation of import

24 Python Language Services

24.1 parser — Access Python parse trees

24.2 ast — Abstract Syntax Trees

24.3 symtable — Access to the compiler’s symbol tables

24.4 symbol — Constants used with Python parse trees

24.5 token — Constants used with Python parse trees

24.6 keyword — Testing for Python keywords

24.7 tokenize — Tokenizer for Python source

24.8 tabnanny — Detection of ambiguous indentation

24.9 pyclbr — Python class browser support

24.10 py_compile — Compile Python source files

24.11 compileall — Byte-compile Python libraries

24.12 dis — Disassembler for Python bytecode

24.13 pickletools — Tools for pickle developers

25 Miscellaneous Services

25.1 formatter — Generic output formatting

26 MS Windows Specific Services

26.1 msilib — Read and write Microsoft Installer files

26.2 msvcrt — Useful routines from the MS VC++ runtime

26.3 winreg — Windows registry access

26.4 winsound — Sound-playing interface for Windows

27 Unix Specific Services

27.1 posix — The most common POSIX system calls

27.2 pwd — The password database

27.3 spwd — The shadow password database

27.4 grp — The group database

27.5 crypt — Function to check Unix passwords

27.6 termios — POSIX style tty control

27.7 tty — Terminal control functions

27.8 pty — Pseudo-terminal utilities

27.9 fcntl — The fcntl and ioctl system calls

27.10 pipes — Interface to shell pipelines

27.11 resource — Resource usage information

27.12 nis — Interface to Sun’s NIS (Yellow Pages)

27.13 syslog — Unix syslog library routines

28 Superseded Modules

28.1 optparse — Parser for command line options

28.2 imp — Access the import internals

29 Undocumented Modules

29.1 Platform specific modules

29.2 call java service

import subprocess
try:
    subprocess.call(["java",
                     "-jar", grobid_jar,
                     # Avoid OutOfMemoryException
                     "-Xmx1024m",
                     "-gH", grobid_home,
                     "-gP", os.path.join(grobid_home,
                                         "config/grobid.properties"),
                     "-dIn", pdf_folder,
                     "-exe", "processReferences"])
    return True
except subprocess.CalledProcessError:
    return False

30 Data Analysis:

30.1 pandas:

advanced pandas

  • scientific annotation in float:
pd.set_option('display.float_format', lambda x: '%.4f' % x)
  • Find the column name which has the maximum value for each row
df.idxmax(axis=1)
  • change jupyter view:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
  • add new columns to a dataframe:
>>> df = pd.DataFrame([[i] for i in range(10)], columns=['num'])
>>> df
    num
0    0
1    1
2    2
3    3

>>> def powers(x):
>>>     return x, x**2, x**3, x**4, x**5, x**6

>>> df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
>>>     zip(*df['num'].map(powers))

  • first day of the previous month:
today = datetime.datetime.today()
the_last_day_of_previous_month = today.datetime.timedelta(days=1)
the_first_day_of_previous_month = the_last_day_of_previous_month.replace(day=1)
  • N month before:
today - pd.Timedelta(1, unit='M')
  • pandas columns into default dictionary:
df_patient[['pid','med_clinic_id']].groupby('pid').apply(lambda x:x['med_clinic_id'].tolist()).to_dict()
  • groupby order:
# select the first row in each group, group have multindex.
df_item.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('abs_diff', ascending=False)).groupby(level=0).head(3)
df_most_visited_hospitals.sort_values(['PERSON_ID','DISCHARGE_DATE'],ascending=True).groupby(['PERSON_ID', 'ICD3']).first()
  • filter by groupby count size:
df_patient = df_patient.groupby(['hos_id', 'disease']).filter(lambda x:x['person_id'].unique().size>=3)
  • apply to each group in groupby:
for group in df2.groupby('type'):
    print(group_id, group[0])
    data = group[1]
    if group[0] == 'A':
        print group[1].min()
  • groupby qcut:
df = pd.DataFrame({'A':'foo foo foo bar bar bar'.split(),
                   'B':[0.1, 0.5, 1.0]*2})

df['C'] = df.groupby(['A'])['B'].transform(
                     lambda x: pd.qcut(x, 3, labels=range(1,4)))
print(df)
# if error ValueError: Length mismatch: Expected axis has 5564 elements, new values have 78421 elements
# Groupby does not group the NaNs:
df_patient['disease_code'].fillna('unk', inplace=True)
  • average times groupby:
df_patient.reset_index().groupby('disease_code').apply(lambda x:['med_clinic_id'].count()/x['person_id'].nunique())
  • ValueError: Bin edges must be unique:
pd.qcut(df['column'].rank(method='first'), nbins)
  • pd.to_datetimeindex error:

  • apply to a column for each row:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c'])
def rowFunc(row):
    return row['a'] + row['b'] * row['c']

def rowIndex(row):
    return row.name
df['d'] = df.apply(rowFunc, axis=1)
df['rowIndex'] = df.apply(rowIndex, axis=1)
df
Out[182]:
   a  b  c   d  rowIndex
0  1  2  3   7         0
1  4  5  6  34         1
  • permutation of an array:
person_id = df_disease_sample['person_id'].unique()
from itertools import permutations
perm = permutations(person_id, 2)
df_person_similarity_adj_matrix = pd.DataFrame(index=perm, columns=['similarity'])
df_person_similarity_adj_matrix
  • remove value in pandas index:
index.drop('value')
  • select rows with value in multiple columns:
df_suspicious_pairs[df_suspicious_pairs[['person_id_1', 'person_id_2']].isin(['11076976']).any(axis=1)]
def add_edges(G, row, df_suspicious_pairs):
    person_id = row.index
    df_targets = df_suspicious_pairs[df_suspicious_pairs[['person_id_1', 'person_id_2']].isin(person_id).any(axis=1)]
    G.add_edges_from([tuple(x) for x in df_targets[['person_id_1', 'person_id_2']].values])


df_suspicious_person.progress_apply(lambda x: add_edges(G, x, df_suspicious_pairs))
  • delete rows that contain string value in a column:
df_disease_sample[~df_disease_sample['treatment_code'].str.contains('DE')]
  • pandas values to dict:
med_hos_id_mapping.set_index('med_clinic_id')['hos_id'].to_dict()
  • create dataframe from a list of tuples:
pd.DataFrame.from_records([tuples])
  • create multiple index for index:
similarity_rank.index = pd.MultiIndex.from_tuples(similarity_rank.index)
  • groupby, transform, agg:
df = pd.DataFrame(dict(A=list('aabb'), B=[1, 2, 3, 4], C=[0, 9, 0, 9]))
# groupby is the standard use aggregater
df.groupby('A').mean()
# maybe you want these values broadcast across the whole group and return something with the same index as what you started with.
# use transform

df.groupby('A').transform('mean')
# is equalvilent to groupby("A").mean()
df.set_index('A').groupby(level='A').transform('mean')
# set column equal to groupby mean
df_item_dis_lv['mean'] = df_item_dis_lv.groupby(['name', 'dis', 'JGDJ'])['MODEL_JE'].transform('mean')
# agg is used when you have specific things you want to run for different columns or more than one thing run on the same column.

df.groupby('A').agg(['mean', 'std'])


  • groupby aggregate to list:
df.groupby('a')['b'].apply(list)
  • read oracle database UnicodeDecodeError 'gb' :

set the language of the editor from Chinese to English.

import cx_Oracle as cx
import pandas as pd
import numpy as np
import os
from tqdm import *
import os
from sqlalchemy import create_engine


os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8'

engine = create_engine('oracle://MMAPV41:MMAPV411556@192.168.4.32:1521/orcl?charset=utf8')

conn=cx.connect('MMAPV41/MMAPV411556@192.168.4.32/orcl')
sql_regist = """
select med_clinic_id, person_id, person_nm, person_sex,
person_age, in_hosp_date, out_hosp_date,
med_ser_org_no, clinic_type, in_diag_dis_nm, out_diag_doc_cd,
med_amout, hosp_lev from t_kc21
"""
df_regist = pd.read_sql_query(sql_regist, engine)

s_med_clinic_id = pd.read_pickle('med_clinic_id.pkl')
n = 100
sql_regist = """
select med_clinic_id, person_id, person_nm, person_sex,
person_age, in_hosp_date, out_hosp_date,
med_ser_org_no, clinic_type, in_diag_dis_nm, out_diag_doc_cd,
med_amout, hosp_lev from t_kc21 where med_clinic_id in (%s)
"""
df_regist = pd.DataFrame()
for i in tqdm(range(0, int(len(s_med_clinic_id)/100), n)):
    s = "'"+','.join(s_med_clinic_id.ix[i:(i+n)].values.flatten()).replace(',',"','")+"'"
    sql = sql_regist%(s)
    try:
        df_regist_iter = pd.read_sql(sql, conn)
        df_regist = df_regist.append(df_regist_iter)
    except UnicodeDecodeError:
        continue
df_regist.to_pickle("registration_data.pkl")
  • filter rows by number, groupby filter:
df_item = t_order.groupby(['name']).filter(lambda x:x['hos_id'].unique().size>=10)
  • resample groupby aggregate:
df_simul_sample = df_simul_sample.resample('1H')['PERSON_ID'].agg(list)
  • convert columns from capital to lower:
df_patient = df_patient.rename(columns = lambda x: x.lower())
  • find rows with nearest dates/value:
df_result = pd.DataFrame()
for idx, value in enumerate(groups):
    df_target_group = df_patient.loc[value]
    df_target_group.sort_values('入院日期', inplace=True)
    df_target_group['checkin_diff'] = df_target_group['入院日期'].diff()/np.timedelta64(1, 'D')
    df_target_group.reset_index(inplace=True)
    index = df_target_group[df_target_group['checkin_diff']<=3].index
    result = pd.concat([df_target_group.loc[index], df_target_group.loc[index-1]]).sort_values('入院日期')
    result.drop_duplicates(['个人ID','入院日期'], inplace=True)
    result['group'] = idx
    result['hospitals'] = result['机构'].unique().shape[0]
    df_result = df_result.append(result.drop('checkin_diff',axis=1))
  • save to excel sheet:
writer = pd.ExcelWriter('nanjing_result.xlsx')
df_nanjing.to_excel(writer,'全部分组')
big_groups.to_excel(writer,'超大组')
writer.save()
  • concate multiple columns into one column:
diff_checkout_next_checkin['date'] = dataframe.loc[index, columns].stack().sort_values()
diff_checkout_next_checkin['diff'] = diff_checkout_next_checkin['date'].diff()/np.timedelta64(1, 'M')

  • sort values:
df_nanjing.sort_values(['group_sn', '个人ID', '入院日期'], inplace=True)
  • read excel:
df_nanjing = pd.read_excel('result.xlsx',dtype={'证件号':str,
                                                '个人ID':str},
                         parse_dates=['入院日期','出院日期'])
  • groupby ratio:
df_items_sum = df_items.groupby(['disease_code', 'soc_srt_dire_nm']).agg({'amount': 'sum'})
# Change: groupby state_office and divide by sum
df_items_ratio = df_items_sum.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))
  • groupby group size:
group_size = df_nanjing.groupby(['group_sn'])['个人ID'].unique().apply(len)
big_groups = df_nanjing[df_nanjing['group_sn'].isin(group_size[group_size >= 8].index)]
small_groups = df_nanjing[df_nanjing['group_sn'].isin(group_size[group_size < 8].index)]
  • create a dataframe from a dictionary:
df = pd.DataFrame.from_dict({}, orient='index')

-filter by value counts:

df_person_count = df_simul['PERSON_ID'].value_counts()
# filter person id count is less than 3 times
df_simul = df_simul[df_simul['PERSON_ID'].isin(df_person_count[df_person_count>3].index)]

  • get max value counts column name in each group:
df_sample1 = df_sample.sort_values(['person_id','discharge_date'],ascending=True).groupby(['person_id', 'icd3']).first().agg({'hosp_lev':pd.Series.value_counts})
df_sample1.groupby(['person_id']).sum().idxmax(axis=1)
  • groupby value counts max count name:
s_count = df_patient.groupby('disease_code')['mon'].value_counts()
s_count.name = 'cnt_mon'
df_disease['cnt_mon'] = s_count.reset_index().pivot(index='disease_code', columns='mon', values='cnt_mon').idxmax(axis=1)
  • get unique value counts within each group:
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'],
    'user_id': ['0001', '0001', '0002', '0002', '0002'],
    'duration': [30, 15, 20, 15, 30]})
df.groupby('date').agg({'user_id':pd.Series.nunique})
  • get unique columns value groups:
rels = ['疾病名称','诊疗大类2']
rels_cure = df_cure.groupby(rels).size().reset_index()[rels].values.tolist()
  • fill value must be in categories:

  • fill na with previous column:
df_tree.fillna(method='pad', axis=1, inplace=True)
  • delete a column in a dataframe:
del df['column']
  • create node edges
def create_node(label, nodes):
    count = 0
    for node_name in tqdm(nodes):
        node = Node(label, name=node_name)
        # g.schema.create_uniqueness_constraint(label, node_name)
        try:
            g.create(node)
            count += 1
        except ClientError:
            continue
        # debug(count)
    return


'''创建实体关联边'''
def create_relationship(start_node, end_node, edges, rel_type, rel_name):
    count = 0
    # 去重处理
    set_edges = []
    for edge in edges:
        try:
            set_edges.append('###'.join(edge))
        except TypeError:
            continue
    all = len(set(set_edges))
    for edge in tqdm(set(set_edges)):
        edge = edge.split('###')
        p = edge[0]
        q = edge[1]
        if p==q:
            continue
        query = "match(p:%s),(q:%s) where p.name='%s'and q.name='%s' create (p)-[rel:%s{name:'%s'}]->(q)" % (
            start_node, end_node, p, q, rel_type, rel_name)
        try:
            g.run(query)
            count += 1
            # debug(rel_type)
        except Exception as e:
            info(e)
    return

'''创建知识图谱中心疾病的节点'''
def create_diseases_nodes(disease_infos):
    count = 0
    for disease_dict in tqdm(disease_infos):
        node = Node("Disease", name=disease_dict['name'], desc=disease_dict['desc'],
                    prevent=disease_dict['prevent'] ,cause=disease_dict['cause'],
                    easy_get=disease_dict['easy_get'],cure_lasttime=disease_dict['cure_lasttime'],
                    cure_department=disease_dict['cure_department']
                    ,cure_way=disease_dict['cure_way'] , cured_prob=disease_dict['cured_prob'])
        g.create(node)
        # count += 1
        # debug(count)
    return
  • select rows by column values:
df[(df['a'].isin(condition1))&(df['a'].isin(condition2))]
df[df['a']==a]
df_insurance_disease = df_insurance[(df_insurance['type']=='医疗费用-疾病') & (df_insurance['year'].isin(years[4:]))]

  • create tuples or dictionary from two columns:
subset = hos_ids[['GHDJID', 'REAL_JE', 'hos_id']].reset_index()
tuples = [tuple(x) for x in subset.values]
# another method
dict(df[['a', 'b']].values.tolist())
[tuple(x) for x in df[['a', 'b', 'c']].values]
  • mapping:
dictionary = df_mapping.set_index('a')['b'].to_dict()
df['a'].map(dict)
  • filter not null, filter not NaT rows:

since strings data types have variable length, it is by default stored as object dtype. If you want to store them as string type, you can do something like this.

df_text.ix[df_text.Conclusion.values.nonzero()[0]]
df['column'] = df['column'].astype('|S80') #where the max length is set at 80 bytes,
# or alternatively

df['column'] = df['column'].astype('|S') # which will by default set the length to the max len it encounters
  • change datatypes
df.a.astype(float)
drinks['beer_servings'] = drinks.beer_servings.astype(float)
  • change date type to string:
drinks['beer_servings'] = drinks.beer_servings.astype('str)

  • read csv without header, delimiter is space:
vocab = pd.read_csv("/home/weiwu/share/deep_learning/data/model/phrase/zhwiki/categories/三板.vocab",delim_whitespace=True,header=None)
  • parse text in csv
reddit_news = pd.read_csv('/home/weiwu/share/deep_learning/data/RedditNews.csv')
DJA_news = pd.read_csv(
    '/home/weiwu/share/deep_learning/data/Combined_News_DJIA.csv')
na_str_DJA_news = DJA_news.iloc[:, 2:].values
na_str_DJA_news = na_str_DJA_news.flatten()
na_str_reddit_news = reddit_news.News.values
sentences_reddit = [s.encode('utf-8').split() for s in na_str_reddit_news]
sentences_DJA = [s.encode('utf-8').split() for s in na_str_DJA_news]
  • rank

DataFrame.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)[source] Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values

  • n largest value

DataFrame.nlargest(n, columns, keep='first') Get the rows of a DataFrame sorted by the n largest values of columns.

>>> df = DataFrame({'a': [1, 10, 8, 11, -1],
...                 'b': list('abdce'),
...                 'c': [1.0, 2.0, np.nan, 3.0, 4.0]})
>>> df.nlargest(3, 'a')
    a  b   c
3  11  c   3
1  10  b   2
2   8  d NaN

  • quantile

DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')[source] Return values at the given quantile over requested axis, a la numpy.percentile.

>>> df = DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
                   columns=['a', 'b'])
>>> df.quantile(.1)
a    1.3
b    3.7
dtype: float64
>>> df.quantile([.1, .5])
       a     b
0.1  1.3   3.7
0.5  2.5  55.0
  • generate a dataframe:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
# or
df = pd.DataFrame(data={'a':[1,2],'b':[3,3]})
  • create diagonal matrix/dataframe using a series:
df = pd.DataFrame(np.diag(s), columns=Q.index)
  • connection with mysql:
pandas.read_sql_query(sql, con=engine):
pandas.read_sql_table(table_name, con=engine):
pandas.read_sql(sql, con=engine)
sql = 'DROP TABLE IF EXISTS etf_daily_price;'
result = engine.execute(sql)
  • dropna:
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

  • melt.

pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)[source]

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

"""
Parameters:
frame : DataFrame
id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.
value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_name : scalar
Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
value_name : scalar, default ‘value’
Name to use for the ‘value’ column.
col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
"""
DataFrame['idname'] = DataFrame.index
pd.melt(DataFrame, id_vars=['idname'])
>>> import pandas as pd
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
...                    'B': {0: 1, 1: 3, 2: 5},
...                    'C': {0: 2, 1: 4, 2: 6}})
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
>>> pd.melt(df, id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
  • fill nan:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
# method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
  • select non zero rows from series:
s[s.nonzero()]
  • create value by cretics
df[df.col1.map(lambda x: x != 0)] = 1
  • dataframe to series:
s = df[df.columns[0]]
  • replace value:
DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None)
df = df[df.line_race != 0]
  • pandas has value:
value in df['column_name']
set(a).issubset(df['a'])
  • calculate percentage of sum on a row:
df.apply(lambda x: x / x.sum() * 100, axis=0)
  • pandas has null value:
df.isnull().values.any()

  • find all the values of TRUE in a dataframe:
z=(a!=b)
pd.concat([a.ix[z[reduce(lambda x, y: x | z[y], z, False)].index],b.ix[z[reduce(lambda x, y: x | z[y], z, False)].index]],axis=1)
  • reduce
from functools import reduce
reduce(lambda x, y: x+y, range(1,101))
  • if array a is a subset of another array b:
set(B).issubset(set(A))
  • remove negative value from a column:
filtered_1=b[‘TRADE_size’].apply(lambda x: 0 if x < 0 else x)
b[‘TRADE_size’].loc[ b[‘TRADE_size’]<0, ‘TRADE_size’] = 0
  • drop all rows value equal to 0:
df.loc[~(df==0).all(axis=1)]
  • drop columns/lable:
DataFrame.drop(labels, axis=1, level=None, inplace=False, errors='raise')
  • check if any value is NaN in DataFrame
df.isnull().values.any()
df.isnull().any().any()
  • maximum & minimum value of a dataframe:
df.values.max()
df.values.min()
  • select value by creteria:
logger.debug("all weight are bigger than 0? %s", (df_opts_weight>0).all().all())
logger.debug("all weight are smaller than 1? %s", (df_opts_weight<=1).all().all())
logger.debug("weight sum smaller than 0: %s", df_opts_weight[df_opts_weight<0].sum(1))
  • count all duplicates:
import pandas as pd
In [15]: a=pd.DataFrame({'a':['KBE.US','KBE.US','KBE.US','KBE.US','KBE.US','KBE.US','O.US','O.US','O.US','O.US','O.US'],'b':['KBE','KBE','KBE','KBE','KBE','KBE','O','O','O','O','O']})

In [16]: count = a.groupby('a').count()

In [20]: (count>5).all().all()
Out[20]: False

In [21]: (count>4).all().all()
Out[21]: True

- datetime64[ns] missing data, null:
For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by numpy in a singular dtype (datetime64[ns]). pandas objects provide intercompatibility between NaT and NaN.

#+BEGIN_SRC python
In [16]: df2
Out[16]:
        one       two     three four   five  timestamp
a -0.166778  0.501113 -0.355322  bar  False 2012-01-01
c -0.337890  0.580967  0.983801  bar  False 2012-01-01
e  0.057802  0.761948 -0.712964  bar   True 2012-01-01
f -0.443160 -0.974602  1.047704  bar  False 2012-01-01
h -0.717852 -1.053898 -0.019369  bar  False 2012-01-01

In [17]: df2.loc[['a','c','h'],['one','timestamp']] = np.nan

In [18]: df2
Out[18]:
        one       two     three four   five  timestamp
a       NaN  0.501113 -0.355322  bar  False        NaT
c       NaN  0.580967  0.983801  bar  False        NaT
e  0.057802  0.761948 -0.712964  bar   True 2012-01-01
f -0.443160 -0.974602  1.047704  bar  False 2012-01-01
h       NaN -1.053898 -0.019369  bar  False        NaT
  • rename column names:
df_bbg = df_bbg.rename(columns = lambda x: x[:4].replace(' ',''))
df = df.rename(columns={'a':'A'})
  • rename according to column value type:

    name = {2:'idname', 23:'value', 4:'variable'}
    df.rename(columns=lambda x: name[(gftIO.get_column_type(df,x))], inplace=True)
    
  • rename column according to value:
name = {'INNERCODE': 'contract_code', 'OPTIONCODE': 'contract_name',
        'SETTLEMENTDATE': 'settlement_date', 'ENDDATE': 'date',
        'CLOSEPRICE': 'close_price'}
data.rename(columns=lambda x: name[x], inplace=True)

  • remove characters after space:
df_bbg = df_bbg.rename(columns = lambda x: x.)
  • apply by group:
df_long_term = small_groups.groupby('个人ID').progress_apply(lambda x: long_term_hospitalization(x[['入院日期', '出院日期']], days=30))
  • pandas long format to pivot:
pivoted = df.pivot('name1','name2','name3')
specific_risk = self.risk_model['specificRisk'].pivot(
    index='date', columns='symbol', values='specificrisk')
df_pivot_industries_asset_weights = pd.pivot_table(
        df_industries_asset_weight, values='value', index=['date'],
        columns=['industry', 'symbol'])
  • pivot time series to hourly
df['hour'] = df['date'].dt.hour
df['day'] = df['date'].dt.datetime
pd.pivot_table(df,index='hour',columns='day',values='pb',aggfunc=np.sum)
  • change the time or date or a datetime:
end = end.replace(hour=23, minute=59, second=59)
  • 万德 wind python pandas
df = pd.Dataframe(data = w.wsd().Data[0], index=w.wsd().Times)
  • check DatetimeIndex difference:
# to check the frequency of the strategy, DAILY or MONTHLY
dt_diff = df_single_period_return.index.to_series().diff().mean()
if dt_diff < pd.Timedelta('3 days'):
  • time delta
import datetime
s + datetime.timedelta(minutes=5)
  • resample by a column:

need to set the index as a datetime index, then use the resample function.

  • resample at a fraction but keep at least one:
med_id_sample = df_patient.groupby(['JGID','CYZDDM3']).apply(lambda x :x.iloc[random.choice(range(0,len(x)))])['GHDJID'].values
med_id_sample1 = df_patient.groupby(['JGID','CYZDDM3']).apply(lambda x: x.sample(frac=0.1))['GHDJID'].values
med_id_samples = np.unique(np.concatenate((med_id_sample, med_id_sample1)))
df_patient[df_patient['GHDJID'].isin(med_id_samples)]
  • resample by month and keep the last valid row
benchmark_weight.index.name = 'Date'
m = benchmark_weight.index.to_period('m')
benchmark_weight = benchmark_weight.reset_index().groupby(m).last().set_index('Date')
benchmark_weight.index.name = ''
  • groupby item ratio:
df_items_sum = df_items.groupby(['disease_code', 'soc_srt_dire_nm']).agg({'amount': 'sum'})
# Change: groupby state_office and divide by sum
df_items_ratio = df_items_sum.groupby(level=0).apply(lambda x:
                                                 100 * x / float(x.sum()))
  • groupby and sort by another column:
df_input_text_entity.sort_values(['score'],ascending=False).groupby('mention').head(1) # only take the largest value of score
  • filter two dataframe by columns' value
pd.merge(df_input_text_entity_0, df_input_text_entity_1, on=['mention', 'entity'])

30.1.1 multiplying

  • the multiplying calculation is not about the sequence of the index or column.

pandas will calculate on a sorted index and column value.

In [87]: a=pd.DataFrame({'dog':[1,2],'fox':[3,4]},index=['a','b'])

In [88]: a
Out[88]:
   dog  fox
a    1    3
b    2    4

In [89]: b=pd.DataFrame({'fox':[1,2],'dog':[3,4]},index=['b','a'])

In [94]: b
Out[94]:
   dog  fox
b    3    1
a    4    2

In [95]: a*b
Out[95]:
   dog  fox
a    4    6
b    6    4
  • dot multiplying

dot multiplying will sort the value.

In [99]: a.dot(b.T)
Out[99]:
    b   a
a   6  10
b  10  16

In [100]: b.T
Out[104]:
     b  a
dog  3  4
fox  1  2

In [105]: a
Out[105]:
   dog  fox
a    1    3
b    2    4

30.1.2 Index

  1. Index manuplication
    • set column as datetime index
    index = index.set_index(pd.DatetimeIndex(index['tradeDate'])).drop('tradeDate', axis=1)
    # df = df.set_index(pd.DatetimeIndex(df['Date']))
    
    • concaterate:
    pd.concat([df1, df2], axis=0).sort_index()
    pd.concat([df1, df2], axis=1)
    result = df1.join(df2, how='outer’)
    
    • check if the index is datetimeindex:
    if isinstance(df_otv.index, pd.DatetimeIndex):
        df_otv.reset_index(inplace=True)
    
    
    • pandas are two dataframe identical
    pandas.DataFrame.equals()
    
    
    • change index name:
    df.index.names = ['Date']
    
    • for loop in pandas dataframe:
    for index, value in DataFrame:
    
    • compare two time series:
    s1[s1.isin(s2)]
    ax = df1.plot()
    df2.plot(ax=ax)
    
    • datetime to string:
    df.index.strftime("%Y-%m-%d %H:%M:%S")
    
    • concaterate index
    pd.concat([df1, df2], axis=1)
    

    concate will take two dataframe to a new dataframe by index, preserving the columns. A: index variable value B: index variable value

    pd.concat([A, B]) index variable value variable value

  2. merge

    merge will take two dataframe to a new dataframe by index, on the columns. A: index variable value B: index variable value

    pd.merge(A, B, how='left', on=['index', 'variable']) index variable value value

  3. update

    update dataframe1 with dataframe2

  4. access hierarchical index.
    • A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays), an array of tuples (using MultiIndex.from_tuples), or a crossed set of iterables (using MultiIndex.from_product).
    df.loc[‘date’,’col’], df[‘date’], df.ix[[‘date1’, ‘date2’]]
    
    • slicing:
    df.loc['start':'end',], df['start': 'end']
    
    • slice with a ‘range’ of values, by providing a slice of tuples:
    df.loc[('2006-11-02','USO.US'):('2006-11-06','USO.US')]
    df.loc(axis=0)[:,['SPY.US']]
    
    • select certain columns:
    df.loc(axis=0)[:,['SPY.US']]['updatedTime']
    
    • select rows with certain column value:
    df.loc[df['column_name'].isin(some_values)]
    
    • select date range using pd series.
    date_not_inserted = whole_index[~whole_index.isin(date_in_database['date'])]
    df_need_to_be_updated = whole_df_stack.ix[days_not_in_db]
    
  5. remove pandas duplicated index
    1. #1
      grouped = sym.groupby(level=0)
      sym = grouped.last()
      
    2. #2
      df2[~df2.index.duplicated()]
      
    3. remove duplicated rows
      pandas.DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
      # subset : column label or sequence of labels, optional
      
  6. convert a dataframe to an array:
    pd.dataframe.to_matrix()
    
  7. panel:
    • create from dictionary:
    datetime_index = pd.DatetimeIndex(assets_group['date'].unique())
    panel_model = pd.Panel({date: pd.DataFrame(0, index=assets.loc[date,'variable'],
                                               columns=assets.loc[date,'variable']) for date in datetime_index})
    

    pandas panel item axis should be datetime64, this should not be an array.

  8. unpivot multindex, multindex into colum:
    df_med_similarity_adj_matrix['similarity'] = df_med_similarity_adj_matrix.apply(
        lambda x: 1 - jaccard(docs[x.name[0]], docs[x.name[1]]), axis=1)
    
    df_med_similarity_adj_matrix.index = pd.MultiIndex.from_tuples(df_med_similarity_adj_matrix.index)
    df_med_similarity_adj_matrix.reset_index().pivot(index='level_0', columns='level_1', values='similarity')
    

30.2 numpy

  • numpy unique without sort:
>>> import numpy as np
>>> a = [4,2,1,3,1,2,3,4]
>>> np.unique(a)
array([1, 2, 3, 4])
>>> indexes = np.unique(a, return_index=True)[1]
>>> [a[index] for index in sorted(indexes)]
[4, 2, 1, 3]

  • plot histogram:
>>> import matplotlib.pyplot as plt
>>> rng = np.random.RandomState(10)  # deterministic random data
>>> a = np.hstack((rng.normal(size=1000),
...                rng.normal(loc=5, scale=2, size=1000)))
>>> plt.hist(a, bins='auto')  # arguments are passed to np.histogram
>>> plt.title("Histogram with 'auto' bins")
>>> plt.show()
  • which quantile:
df_average_cost['COST_LEVEL'] = pd.qcut(df_average_cost['MED_AMOUNT'], 3, labels=["high", "medium", "low"])
  • quantile:
numpy.quantile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)
  • upper triangle matrix:
import numpy as np

a = np.array([[1,2,3],[4,5,6],[7,8,9]])

#array([[1, 2, 3],
#       [4, 5, 6],
#       [7, 8, 9]])

a[np.triu_indices(3, k = 1)]

# this returns the following
array([2, 3, 6])
  • sort an array by descending:
In [25]: temp = np.random.randint(1,10, 10)

In [26]: temp
Out[26]: array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4])

In [27]: id(temp)
Out[27]: 139962713524944

In [28]: temp[::-1].sort()

In [29]: temp
Out[29]: array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2])

In [30]: id(temp)
Out[30]: 139962713524944
  • save an array:
import numpy as np
np.save(filename, array)
  • maximum value in each row
np.amax(ar, axis=1)
  • from 2-D array to 1-D array with one column
import numpy as np
a = np.array([[1],[2],[3]]))
a.flattern()
  • Take a sequence of 1-D arrays and stack them as columns to make a single 2-D:
numpy.column_stack(tup)
Parameters:
tup : sequence of 1-D or 2-D arrays.
Arrays to stack. All of them must have the same first dimension.
>>> a = np.array((1,2,3))
>>> b = np.array((2,3,4))
>>> np.column_stack((a,b))

- expand 1-D numpy array to 2-D:

  • expand the shape of an array:
numpy.expand_dims(a, axis)
# Expand the shape of an array.
# Insert a new axis that will appear at the axis position in the expanded array shape.
>>> x = np.array([1,2])
>>> x.shape
(2,)
>>> y = np.expand_dims(x, axis=0)
>>> y
array([[1, 2]])
>>> y.shape
(1, 2)
  • count nan:
np.count_nonzero(~np.isnan(df['series']))
  • count number of negative value:
np.sum((df < 0).values.ravel())
  • check the difference of two arrays:

numpy.setdiff1d: Return the sorted, unique values in ar1 that are not in ar2

np.setdiff1d(ar1, ar2)
  • turn a list of tuples into a list:
[item for t in lt for item in t]
  • sorted a list of tuples
sorted(enumerate(sims), key=lambda item: -item[1])
  • reshape:

np.reshape((1, -1)), -1 means automatic number of columns.

  • select random symbols from a listdir:
# get random symbols at the target position limit
position_limit = 8
arr = list(range(len(target_symbols)))
np.random.shuffle(arr)
target_symbols = target_symbols[arr[:position_limit]]

30.3 plot:

  • %matplotlib inline

To set this up, before any plotting or import of matplotlib is performed you must execute the %matplotlib magic command. This performs the necessary behind-the-scenes setup for IPython to work correctly hand in hand with matplotlib; it does not, however, actually execute any Python import commands, that is, no names are added to the namespace.

30.3.1 subplot with the same axis:

pandas plot. using matplotlib:

  • plot different series on the same chart.
cl_active_contract_pricing.plot()
cl_pricing.plot(style='k--')
  • plot in ipython or jupyter notebook:
ax = contract_data.plot(legend=True)
continuous_price.plot(legend=True, style='k--', ax=ax)
plt.show()

30.3.2 multiple figure一次绘制多个图形

  • same plot, 多个figure:
# figure.py

import matplotlib.pyplot as plt
import numpy as np

data = np.arange(100, 201)
plt.plot(data)

data2 = np.arange(200, 301)
plt.figure()
plt.plot(data2)

plt.show()
  • multiple subplots 在同一个窗口显示多个图形
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)
fig, axes = plt.subplots(2,3)
fig, ax : tuple
# or
data = np.arange(100, 201)
plt.subplot(2, 1, 1)
plt.plot(data)

data2 = np.arange(200, 301)
plt.subplot(2, 1, 2)
plt.plot(data2)

plt.show()

30.3.3 subplot with different axis

plt.subplot(2, 1, 1)
plt.boxplot(x1)
plt.plot(1, x1.ix[-1], 'r*', markersize=15.0)

plt.subplot(2, 1, 2)
x1.plot()
# or
fig, axes = plt.subplots(2, 1, figsize=(10, 14))
axes[0].boxplot(pe000001)
axes[0].plot(1, pe000001.ix[-1], 'r*', markersize=15.0)

pe000001.plot()

30.3.4 plot a secondary y scale

df.price.plot(legend=True)
(100-df.pct_long).plot(secondary_y=True, style='g', legend=True)
  • highlight a certain value in the plot:
a['DGAZ.US'].hist(bins=50)
plt.axvline(a['DGAZ.US'][-1], color='b', linestyle='dashed', linewidth=2)

30.3.5 plot seaborn:

  • plot heatmap:
figure = plt.figure(figsize=(12,12))
ax = sns.heatmap(temp, vmin=0, vmax=10, fmt="d", cmap="YlGnBu", annot=True)
# save plot
ax.get_figure()
figure.savefig('./images/test.png')
  • save seaborn heatmap:
plt.subplots(figsize=(12,12))

# fig, ax = plt.subplots(figsize=(12,12))
ax = sns.heatmap(temp, vmin=0, vmax=10, fmt="d", cmap="YlGnBu", annot=True)
# !!! can't save figure directly, need to get figure first.
figure = ax.get_figure()
figure.savefig('./images/%s.png'%(str(person_ids)))

30.3.6 plot a 3d figure:

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

strike = np.linspace(50, 150, 5)
ttm = np.linspace(0.5, 2.5, 8)

strike, ttm = np.meshgrid(strike, ttm)
iv = (strike - 100) ** 2 / (100 * strike) / ttm
fig = plt.figure(figsize=(9,6))
ax = fig.gca(projection='3d')
surf = ax.plot_surface(strike, ttm, iv, rstride=2, cstride=2,
                       cmap=plt.cm.coolwarm, linewidth=0.5,
                       antialiased=True)
fig.colorbar(surf, shrink=0.5, aspect=5)

fig is the :class:matplotlib.figure.Figure object.

  • ax can be either a single axis object or an array of axis
  • objects if more than one subplot was created.

[http://docs.pythontab.com/interpy/args_kwargs/Usage_args/]

[http://python.usyiyi.cn/python_278/library/index.html]

[https://docs.python.org/2/reference/simple_stmts.html?highlight=assert]

30.3.7 display Chinese:

➜  log git:(master) ✗ fc-list :lang=zh
/usr/share/fonts/truetype/wqy/wqy-microhei.ttc: 文泉驿微米黑,文泉驛微米黑,WenQuanYi Micro Hei:style=Regular
/usr/share/fonts/truetype/droid/DroidSansFallbackFull.ttf: Droid Sans Fallback:style=Regular
/usr/share/fonts/truetype/wqy/wqy-microhei.ttc: 文泉驿等宽微米黑,文泉驛等寬微米黑,WenQuanYi Micro Hei Mono:style=Regular
import matplotlib.font_manager as mfm
import matplotlib.pyplot as plt
font_path = "/usr/share/fonts/truetype/wqy/wqy-microhei.ttc"
prop = mfm.FontProperties(fname=font_path)
plt.text(0.5, 0.5, s=u'测试', fontproperties=prop)
plt.show()

30.3.8 stacked barplot, portfolio change

pivot_df_insurance_accident = df_insurance_accident.pivot(
    index='year', columns='cat_age_sex', values='basic_insurance_fee')
pivot_df_insurance_disease.plot.bar(title='医疗保险-疾病纯保费',stacked=True, figsize=(10,7))

30.4 scipy

  • combination k from n.

\[ {\displaystyle {\binom {n}{k}}={\frac {n(n-1)\dotsb (n-k+1)}{k(k-1)\dotsb 1}},} {\binom {n}{k}}={\frac {n(n-1)\dotsb (n-k+1)}{k(k-1)\dotsb 1}}\]

which can be written using factorials as\[ {\displaystyle \textstyle {\frac {n!}{k!(n-k)!}}} \textstyle {\frac {n!}{k!(n-k)!}} \]

>>> from scipy.special import comb
>>> k = np.array([3, 4])
>>> n = np.array([10, 10])
>>> comb(n, k, exact=False)
array([ 120.,  210.])
>>> comb(10, 3, exact=True)
120L
>>> comb(10, 3, exact=True, repetition=True)
220L

[https://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.comb.html]

30.5 networkx:

  • list edges of a node:
G.neibors('node')
  • count edges of a node:
G.in_edges_degree()
G.out_edges_degree()
  • create an edge:
G.add_edges_from([(a,b)])
G.add_weighted_edges_from([(a,b,weight)])
  • create a graph:
G = nx.Graph()
# directed graph
G = nx.DiGraph()
  • dump a graph:
nx.write_gexf(G, 'file/path.gexf')
  • draw a graph:
nx.draw(G, with_labels=True)

31 Machine learning:

31.1 data processing:

  • coding, mapping a list into range:
from sklearn.preprocessing import LabelEncoder
class_label = LabelEncoder()
data["label"] = class_label.fit_transform(data["label"].values)
# or
label_mapping = {label:idx for idx,label in enumerate(np.unique(data["label"]))}

  • one hot coding:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
X = data[["color", "price"]].values
#通过类标编码将颜色装换成为整数
color_label = LabelEncoder()
X[:,0] = color_label.fit_transform(X[:,0])
#设置颜色列使用oneHot编码
one_hot = OneHotEncoder(categorical_features=[0])
print(one_hot.fit_transform(X).toarray())
# or
pd.get_dummies(data[["color","price"]])

32 Deep Learning

32.1 Tensorflow

  • convert string to number index:
with tf.Session() as sess:
    mapping_strings = tf.constant(["emerson", "lake", "palmer"])
    feats = tf.constant(["emerson", "lake", "and", "palmer"])
    ids = tf.contrib.lookup.string_to_index(
        feats, mapping=mapping_strings, default_value=-1)
    tf.compat.v1.tables_initializer().run()
    idx = ids.eval()#   ==> [0, 1, -1, 2]
  • disable tensorflow warnings:
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2"
  • install tensorflow:
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda install tensorflow-gpu==1.3
conda config --set show_channel_urls yes
conda install tensorflow-gpu==1.3
  • test drive:
python -m tensorflow.models.image.mnist.convolutional

32.1.1 GPU test:

tensorflow

import tensorflow as tf

a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

print(sess.run(c))

import tensorflow as tf with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

print(sess.run(c))

with tf.device('/device:GPU:1'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

print(sess.run(c))

with tf.device('/device:GPU:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto( allow_soft_placement=True, log_device_placement=True))

print(sess.run(c))

from tensorflow.python.client import device_lib print(device_lib.list_local_devices())

import numpy as np import tensorflow as tf from datetime import datetime

device_name = "/GPU:0"

with tf.device(device_name): random_matrix = tf.random_uniform(shape=(1, 1), minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix)) sum_operation = tf.reduce_sum(dot_operation)

startTime = datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session: result = session.run(sum_operation) print(result)

print("\n" * 5) print("Time taken:", datetime.now() - startTime)

print("\n" * 5)

32.2 Computer Vision

32.2.1 style transfer

32.2.2 basic operation

  • convert an image into grey:
In [1]: from PIL import Image

In [2]: image = Image.open('/tmp/capcha.png')

In [7]: image = image.convert('L')

In [3]: data = image.load()
   ...: w, h = image.size
   ...: for i in range(w):
   ...:     for j in range(h):
   ...:         if data[i, j] > 125:
   ...:             data[i, j] = 255  # 纯白
   ...:         else:
   ...:             data[i, j] = 0  # 纯黑

image.save('clean_captcha.png')

img = cv2.imread(r'D:/UNI/Y3/DIA/2K18/lab.jpg')
RGB_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
gray = cv2.cvtColor(RGB_img, cv2.COLOR_RGB2GRAY)
equ = cv2.equalizeHist(gray)
res = np.hstack((img,equ))
plt.imshow(gray, cmap='gray', vmin = 0, vmax = 255)
  • show image in jupyter notebook:
from IPython.display import Image
Image(filename=content_seg_path)
  • read image into numpy arrays:
from scipy.misc import imread, imresize
import cv2
init_img = cv2.imread(initImg) # [y, x, 3] BGR
content = scipy.misc.imread(contentImg, mode='RGB')
  • convert numpy array to image:
import scipy.misc
rgb = scipy.misc.toimage(np_array)
cv2.imwrite('color_img.jpg', np_array)
  • read image into torch tensor:
import numpy as np
from PIL import Image
from torchvision import transforms
from torchvision.utils import save_image
def open_image(image_path, image_size=None):
    """

    """
    image = Image.open(image_path)
    _transforms = []
    if image_size is not None:
        image = transforms.Resize(image_size)(image)
        # _transforms.append(transforms.Resize(image_size))
    w, h = image.size
    _transforms.append(transforms.CenterCrop((h // 16 * 16, w // 16 * 16)))
    _transforms.append(transforms.ToTensor())
    transform = transforms.Compose(_transforms)
    result = transform(image)[:3].unsqueeze(0)
    return result

  • concatenate images into one:
def cat_images(fname, ls_images_path):
    images = []
    max_width = 0 # find the max width of all the images
    total_height = 0 # the total height of the images (vertical stacking)
    for name in image_names:
        # open all images and find their sizes
        images.append(cv2.imread(name))
        if images[-1].shape[1] > max_width:
            max_width = images[-1].shape[1]
        total_height += images[-1].shape[0]

    def cat_arrays(total_height,max_width,arrays):
        # create a new array with a size large enough to contain all the images
        final_array = np.zeros((total_height,max_width,3),dtype=np.uint8)

        current_y = 0 # keep track of where your current image was last placed in the y coordinate
        for array in arrays:
            # add an image to the final array and increment the y coordinate
            final_array[current_y:array.shape[0]+current_y,:array.shape[1],:] = array
            current_y += array.shape[0]
        return final_array
    final_image = cat_arrays(total_height, max_width, images)
    cv2.imwrite(fname,final_image)
cat_images(fname, image_names)
  • overlay visualize:
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
from PIL import Image

def vis_overlay(original_im, seg_map):
  """Visualizes input image, segmentation map and overlay view."""
  original_im = Image.open(original_im)
  seg_map = Image.open(seg_map)
  plt.figure(figsize=(15, 5))
  grid_spec = gridspec.GridSpec(1, 4, width_ratios=[6, 6, 6, 1])

  plt.subplot(grid_spec[0])
  plt.imshow(original_im)
  plt.axis('off')
  plt.title('input image')

  plt.subplot(grid_spec[1])
  # seg_image = label_to_color_image(seg_map).astype(np.uint8)
  plt.imshow(seg_map)
  plt.axis('off')
  plt.title('result image')

  plt.subplot(grid_spec[2])
  plt.imshow(original_im)
  plt.imshow(seg_map, alpha=0.7)
  plt.axis('off')
  plt.title('overlay image')
  plt.show()

33 NLP

33.1 Keywords

  • corpus
input_text_translations = """
The Chinese government’s top management obviously also hopes to avoid the deterioration of the Sino-US conflict. The Sino-US trade war has started. After the Sino-US trade war began, it emphasized that China has "five advantages" in the trade war. He stressed: "We must especially prevent Sino-US cooperation. Trade conflict spreads to the ideological field
"""
from gensim import corpora, models, similarities
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

test_model = word_tokenize(input_text_translations.lower())

wordstest_model = sent_tokenize(input_text_translations)
test_model = [word_tokenize(_d.lower()) for _d in docs]
dictionary = corpora.Dictionary(test_model, prune_at=2000000)
# for key in dictionary.iterkeys():
#     print key,dictionary.get(key),dictionary.dfs[key]
corpus_model = [dictionary.doc2bow(test) for test in test_model]
tfidf_model = models.TfidfModel(corpus_model)
# 对语料生成tfidf
corpus_tfidf = tfidf_model[corpus_model]
d = {dictionary.get(id): value for doc in corpus_tfidf for id, value in doc}

# get the topic-word distribution
corpus = [dictionary.doc2bow(text) for text in tokenized_data]
dictionary1 = corpora.Dictionary(tokenized_data)
dictionary1.filter_n_most_frequent(10)
dictionary1.filter_extremes(no_above=0.9)
filtered_words = [dictionary[y] for y in [x for x in dictionary.keys() if x not in dictionary1.keys()]]
NUM_TOPICS = 5
# Build the LDA model
lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary1, per_word_topics=True, alpha='asymmetric', minimum_probability=0.0)
topic_distribution = lda_model.show_topics(num_words=50)
df_topic_word_dis = pd.DataFrame([x[1].split(' + ') for x in topic_distribution]).T
# frequency of word
dictionary.dfs[dictionary.token2id["血清α羟基丁酸脱氢酶测定"]]/dictionary.num_docs