Table of Contents
- 1. Introduction
- 2. Built-in Functions
- 3. Built-in Types
- 3.1. Truth Value Testing
- 3.2. Boolean Operations — and, or, not
- 3.3. Comparisons
- 3.4. Numeric Types — int, float, complex
- 3.5. Iterator Types
- 3.6. Sequence Types — list, tuple, range
- 3.7. 生成器generator
- 3.8. Text Sequence Type — str
- 3.9. Binary Sequence Types — bytes, bytearray, memoryview
- 3.10. Set Types — set, frozenset
- 3.11. Mapping Types — dict
- 3.12. Context Manager Types
- 3.13. Other Built-in Types
- 3.14. Special Attributes – magic method:
- 4. Built-in Exceptions
- 5. Text Processing Services
- 6. Data Types
- 6.1. datetime — Basic date and time types
- 6.2. Time
- 6.3. calendar — General calendar-related functions
- 6.4. collections — Container datatypes
- 6.5. collections — High-performance container datatypes
- 6.6. collections.abc — Abstract Base Classes for Containers
- 6.7. heapq — Heap queue algorithm
- 6.8. bisect — Array bisection algorithm
- 6.9. array — Efficient arrays of numeric values
- 6.10. weakref — Weak references
- 6.11. types — Dynamic type creation and names for built-in types
- 6.12. copy — Shallow and deep copy operations
- 6.13. pprint — Data pretty printer
- 6.14. reprlib — Alternate repr() implementation
- 6.15. enum — Support for enumerations
- 7. Numeric and Mathematical Modules
- 7.1. numbers — Numeric abstract base classes
- 7.2. math — Mathematical functions
- 7.3. cmath — Mathematical functions for complex numbers
- 7.4. decimal — Decimal fixed point and floating point arithmetic
- 7.5. fractions — Rational numbers
- 7.6. random — Generate pseudo-random numbers
- 7.7. statistics — Mathematical statistics functions
- 8. Functional Programming Modules
- 9. File and Directory Access
- 9.1. pathlib — Object-oriented filesystem paths
- 9.2. os.path — Common pathname manipulations
- 9.3. fileinput — Iterate over lines from multiple input streams
- 9.4. stat — Interpreting stat() results
- 9.5. filecmp — File and Directory Comparisons
- 9.6. tempfile — Generate temporary files and directories
- 9.7. glob — Unix style pathname pattern expansion
- 9.8. fnmatch — Unix filename pattern matching
- 9.9. linecache — Random access to text lines
- 9.10. shutil — High-level file operations
- 9.11. macpath — Mac OS 9 path manipulation functions
- 10. Data Persistence
- 10.1. pickle — Python object serialization
- 10.2. copyreg — Register pickle support functions
- 10.3. shelve — Python object persistence
- 10.4. marshal — Internal Python object serialization
- 10.5. dbm — Interfaces to Unix “databases”
- 10.6. sqlite3 — DB-API interface for SQLite databases
- 10.7. protobuf
- 10.8. neo4j
- 10.9. mysql
- 11. Data Compression and Archiving
- 12. File Formats
- 13. Cryptographic Services
- 14. Generic Operating System Services
- 14.1. os — Miscellaneous operating system interfaces
- 14.2. io — Core tools for working with streams
- 14.3. time — Time access and conversions
- 14.4. argparse — Parser for command-line options, arguments and sub-commands
- 14.5. getopt — C-style parser for command line options
- 14.6. logging — Logging facility for Python
- 14.7. logging.config — Logging configuration
- 14.8. logging.handlers — Logging handlers
- 14.9. getpass — Portable password input
- 14.10. curses — Terminal handling for character-cell displays
- 14.11. curses.textpad — Text input widget for curses programs
- 14.12. curses.ascii — Utilities for ASCII characters
- 14.13. curses.panel — A panel stack extension for curses
- 14.14. platform — Access to underlying platform’s identifying data
- 14.15. errno — Standard errno system symbols
- 14.16. ctypes — A foreign function library for Python
- 15. Concurrent Execution
- 15.1. threading — Thread-based parallelism
- 15.2. threading & queue
- 15.3. multiprocessing — Process-based parallelism
- 15.4. The concurrent package
- 15.5. concurrent.futures — Launching parallel tasks
- 15.6. subprocess — Subprocess management
- 15.7. sched — Event scheduler
- 15.8. queue — A synchronized queue class
- 15.9. dummy_threading — Drop-in replacement for the threading module
- 15.10. _thread — Low-level threading API
- 15.11. _dummy_thread — Drop-in replacement for the _thread module
- 16. Internet Data Handling
- 17. Internet Protocols and Support
- 17.1. webbrowser — Convenient Web-browser controller
- 17.2. cgi — Common Gateway Interface support
- 17.3. cgitb — Traceback manager for CGI scripts
- 17.4. wsgiref — WSGI Utilities and Reference Implementation
- 17.5. urllib — URL handling modules
- 17.6. urllib.request — Extensible library for opening URLs
- 17.7. urllib.response — Response classes used by urllib
- 17.8. urllib.parse — Parse URLs into components
- 17.9. urllib.error — Exception classes raised by urllib.request
- 17.10. urllib.robotparser — Parser for robots.txt
- 17.11. http — HTTP modules
- 17.12. http.client — HTTP protocol client
- 17.13. ftplib — FTP protocol client
- 17.14. poplib — POP3 protocol client
- 17.15. imaplib — IMAP4 protocol client
- 17.16. nntplib — NNTP protocol client
- 17.17. smtplib — SMTP protocol client
- 17.18. smtpd — SMTP Server
- 17.19. telnetlib — Telnet client
- 17.20. uuid — UUID objects according to RFC 4122
- 17.21. socketserver — A framework for network servers
- 17.22. http.server — HTTP servers
- 17.23. http.cookies — HTTP state management
- 17.24. http.cookiejar — Cookie handling for HTTP clients
- 17.25. xmlrpc — XMLRPC server and client modules
- 17.26. xmlrpc.client — XML-RPC client access
- 17.27. xmlrpc.server — Basic XML-RPC servers
- 17.28. ipaddress — IPv4/IPv6 manipulation library
- 18. Development Tools
- 18.1. typing — Support for type hints
- 18.2. pydoc — Documentation generator and online help system
- 18.3. doctest — Test interactive Python examples
- 18.4. unittest — Unit testing framework
- 18.5. unittest.mock — mock object library
- 18.6. unittest.mock — getting started
- 18.7. test — Regression tests package for Python
- 18.8. test.support — Utilities for the Python test suite
- 19. Debugging and Profiling
- 20. Software Packaging and Distribution
- 21. Python Runtime Services
- 21.1. sysconfig — Provide access to Python’s configuration information
- 21.2. os, sys — System-specific parameters and functions
- 21.3. builtins — Built-in objects
- 21.4. __main__ — Top-level script environment
- 21.5. warnings — Warning control
- 21.6. contextlib — Utilities for with-statement contexts
- 21.7. abc — Abstract Base Classes
- 21.8. atexit — Exit handlers
- 21.9. traceback — Print or retrieve a stack traceback
- 21.10. __future__ — Future statement definitions
- 21.11. gc — Garbage Collector interface
- 21.12. inspect — Inspect live objects
- 21.13. site — Site-specific configuration hook
- 21.14. fpectl — Floating point exception control
- 22. Custom Python Interpreters
- 23. Importing Modules
- 24. Python Language Services
- 24.1. parser — Access Python parse trees
- 24.2. ast — Abstract Syntax Trees
- 24.3. symtable — Access to the compiler’s symbol tables
- 24.4. symbol — Constants used with Python parse trees
- 24.5. token — Constants used with Python parse trees
- 24.6. keyword — Testing for Python keywords
- 24.7. tokenize — Tokenizer for Python source
- 24.8. tabnanny — Detection of ambiguous indentation
- 24.9. pyclbr — Python class browser support
- 24.10. py_compile — Compile Python source files
- 24.11. compileall — Byte-compile Python libraries
- 24.12. dis — Disassembler for Python bytecode
- 24.13. pickletools — Tools for pickle developers
- 25. Miscellaneous Services
- 26. MS Windows Specific Services
- 27. Unix Specific Services
- 27.1. posix — The most common POSIX system calls
- 27.2. pwd — The password database
- 27.3. spwd — The shadow password database
- 27.4. grp — The group database
- 27.5. crypt — Function to check Unix passwords
- 27.6. termios — POSIX style tty control
- 27.7. tty — Terminal control functions
- 27.8. pty — Pseudo-terminal utilities
- 27.9. fcntl — The fcntl and ioctl system calls
- 27.10. pipes — Interface to shell pipelines
- 27.11. resource — Resource usage information
- 27.12. nis — Interface to Sun’s NIS (Yellow Pages)
- 27.13. syslog — Unix syslog library routines
- 28. Superseded Modules
- 29. Undocumented Modules
- 30. Data Analysis:
- 31. Machine learning:
- 32. Deep Learning
- 33. NLP
1 Introduction
2 Built-in Functions
2.1 Function
2.1.1 print
- output formatting:
To use formatted string literals, begin a string with f or F before the opening quotation mark or triple quotation mark. Inside this string, you can write a Python expression between { and } characters that can refer to variables or literal values.
>>> year = 2016 >>> event = 'Referendum' >>> f'Results of the {year} {event}' 'Results of ls_abnormal_diagnosis_trend_med = trend_abn() the 2016 Referendum'
- The str.format() method
>>> yes_votes = 42_572_654 >>> no_votes = 43_132_495 >>> percentage = yes_votes / (yes_votes + no_votes) >>> '{:-9} YES votes {:2.2%}'.format(yes_votes, percentage) ' 42572654 YES votes 49.67%'
- Formatted String Literals
>>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 7678} >>> for name, phone in table.items(): ... print(f'{name:10} ==> {phone:10d}') ... Sjoerd ==> 412项目和7 Jack ==> 4098 Dcab ==> 7678 >>> animals = 'eels' >>> print(f'My hovercraft is full of {animals}.') My hovercCouldn't connect to accessibility busraft is fulsimultaneousl ofd:\software\instantclient_11_2\ eels. >>> print(f'My hovercraft is full of {animals!r}.') My hovercraft is full of 'eels'.
- String format() method:
>>> print('We are the {} who say "{}!"'.format('knights', 'Ni')) We are the knights who say "Ni!" >>> print('{0} and {1}'.format('spam', 'eggs')) spam and eggs >>> print('{1} and {0}'.format('spam', 'eggs')) eggs and spam >>> print('This {food} is {adjective}.'.format( ... food='spam', adjective=tern'absolutely horrible')) This spam is absolutely horrible. >>> print('The story of {0}, {1}, and {other}.'.format('Bill', 'Manfred', other='Georg')) The story of Bill, Manfred, and Georg. >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678} >>> print('Jack: {0[Jack]:d}; Sjoerd: {0[Sjoerd]:d}; ' ... 'Dcab: {0[Dcab]:d}'.format(table)) Jack: 4098; Sjoerd: 4127; Dcab: 8637678 >>> table = {'Sjoerd': 4127, 'Jack': 4098, 'Dcab': 8637678} >>> print('Jack: {Jack:d}; Sjoerd: {Sjoerd:d}; Dcab: {Dcab:d}'.format(**table)) Jack: 4098; Sjoerd: 4127; Dcab: 8637678
- print into file:
# python 3.x syntax. print(args, file=f1) # For python 2.x use print >> f1, args.
- pythonic returning values:
def foo(): return 'a' if value is True else 'b'
- function parameter
- pass the parameters boo(a=1,b=2) won’t change the value of the parameters themselves. the sequence of the parameters are certain, you can’t change it.
- if the input argument is un-mutable,函数中改变形参值不会改变原值。
if the input is mutable, operate on the input like append operation will change the input argument.
- 11076976a, b = b, a + b # 相当于:
t = (b, a + b) # t是一个tuple table_resulta = t[0] b = t[1]
2.1.2 normal argument, args, kwargs
*args and **kwargs allow you to pass a variable number of arguments to a function.
- *args:
def test_var_args(f_arg, *argv): print "first normal arg:", f_arg for arg in argv: print "another arg through *argv :", arg test_var_args('yasoob','python','eggs','test')
- **kwargs:
>>> kwargs = {"arg3": 3, "arg2": "two","arg1":5} >>> test_args_kwargs(**kwargs) arg1: 5 arg2: two arg3: 3
2.2 trouble shooting
- linux python FileNotFoundEr
ror: [Errno 2] No such file or directory:
try to use absolute path instead of relative path to read a file.
- HDF5
pip install tables
2.3 Decorator
Decorator is way to dynamically add some new behavior to some objects. We achieve the same in Python by using closuhttp://cnki.cn-ki.net/KCMS/detail/detail.aspx?QueryID=1&CurRec=1&recid=&filename=1019008654.nh&dbname=CDFDTEMP&dbcode=CDFD&yx=&pr=&URLID=&forcenew=nores.
In the example we will create a simple example which will print some statement before and after the execution of a function.
>>> def my_decorator(func): ... def wrapper(*args, **kwargs): ... print("Before call") ... result = func(*args, **kwargs) ... print("After call") ... return result ... return wrapper ... >>> @my_decorator ... def add(a, b): ... "Our add function" ... return a + b ... >>> add(1, 3) Before call After call 4
Common examples for decorators are classmethod() and staticmethod().
2.3.1 classmethod(function)
Return a class method for function.
A class method receives the class as implicit first argument, just like an instance method receives the instance. To declare a class method, use this idiom:
class C(object): @classmethod def f(cls, arg1, arg2, …): … The @classmethod form is a function decorator – see the description of function definitions in Function definitions for details.
It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class. If a class method is called for a derived class, the derived class object is passed as the implied first argument.
Class methods are different than C++ or Java static methods. If you want those, see staticmethod() in this section.
For more information on class methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.
2.3.2 staticmethod(function)
Return a static method for function.
A static method does not receive an implicit first argument. To declare a static method, use this idiom:
class C(object): @staticmethod def f(arg1, arg2, …): … The @staticmethod form is a function decorator – see the description of function definitions in Function definitions for details.
It can be called either on the class (such as C.f()) or on an instance (such as C().f()). The instance is ignored except for its class.
Static methods in Python are similar to those found in Java or C++. Also see classmethod() for a variant that is useful for creating alternate class constructors.
For more information on static methods, consult the documentation on the standard type hierarchy in The standard type hierarchy.
2.4 Closures
Closures are nothing but functions that are returned by another function. We use closures to remove code duplication. In the following example we create a simple closure for adding numbers.
>>> def add_number(num): ... def adder(number): ... 'adder is a closure' ... return num + number ... return adder ... >>> a_10 = add_number(10) >>> a_10(21) 31 >>> a_10(34) 44 >>> a_5 = add_number(5) >>> a_5(3) 8
2.5 iterable
An object capable of returning its members one at a time. Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict and file and objects of any classes you define with an __iter__() or __getitem__() method. Iterables can be used in a for loop and in many other places where a sequence is needed (zip(), map(), …). When an iterable object is passed as an argument to the built-in function iter(), it returns an iterator for the object. This iterator is good for one pass over the set of values. When using iterables, it is usually not necessary to call iter() or deal with iterator objects yourself. The for statement does that automatically for you, creating a temporary unnamed variable to hold the iterator for the duration of the loop. See also iterator, sequence, and generator.
- check if an object is iterable
>>> from collections import Iterable >>> l = [1, 2, 3, 4] >>> isinstance(l, Iterable) True
2.6 iterator
An object representing a stream of data. Repeated calls to the iterator’s next() method return successive items in the stream. When no more data are available a StopIteration exception is raised instead. At this point, the iterator object is exhausted and any further calls to its next() method just raise StopIteration again. Iterators are required to have an __iter__() method that returns the iterator object itself so every iterator is also iterable and may be used in most places where other iterables are accepted. One notable exception is code which attempts multiple iteration passes. A container object (such as a list) produces a fresh new iterator each time you pass it to the iter() function or use it in a for loop. Attempting this with an iterator will just return the same exhausted iterator object used in the previous iteration pass, making it appear like an empty container.
2.7 generator
A function which returns an iterator. It looks like a normal function except that it contains yield statements for producing a series of values usable in a for-loop or that can be retrieved one at a time with the next() function. Each yield temporarily suspends processing, remembering the location execution state (including local variables and pending try-statements). When the generator resumes, it picks-up where it left-off (in contrast to functions which start fresh on every invocation).
2.8 generator expression
An expression that returns an iterator. It looks like a normal expression followed by a for expression defining a loop variable, range, and an optional if expression. The combined expression generates values for an enclosing function:
>>> sum(i*i for i in range(10)) # sum of squares 0, 1, 4, ... 81 285
3 Built-in Types
3.1 Truth Value Testing
3.2 Boolean Operations — and, or, not
- The ^ symbol
- The ^ symbol is for the bitwise ‘xor’ operation, but in Python, the exponent operator symbol is **.
- the minimum value between nan and infinity is infinity.
min(np.nan, np.inf) = np.inf
- eval
eval the value of a variable name from string.
text = '{'a':1}' eval(text) # will turn text from a string object into a dictionary object
3.3 Comparisons
3.4 Numeric Types — int, float, complex
- ValueError: ("invalid literal for int() with base 10:
>>> int('5') 5 >>> float('5.0') 5.0 >>> float('5') 5.0 >>> int(5.0) 5 >>> float(5) 5.0 >>> int('5.0') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: '5.0' >>> int(float('5.0')) 5
3.5 Iterator Types
- xrange vs range:
there's no xrange in python 3.
- Finding the index of an item given a list:
>>> ["foo", "bar", "baz"].index("bar") 1
- access index and value looping a list:
for idx, val in enumerate(list): print(idx, val)
- iterable vs iterator vs generator:
The difference between iterables and generators: once you’ve burned through a generator once, you’re done, no more data.
generator = (word + '!' for word in 'baby let me iterate ya'.split()) # The generator object is now created, ready to be iterated over. # No exclamation marks added yet at this point. for val in generator: # real processing happens here, during iteration print val, baby! let! me! iterate! ya! for val in generator: print val, # Nothing printed! No more data, generator stream already exhausted above.
an iterable creates a new iterator every time it’s looped over (technically, every time iterable.__iter__() is called, such as when Python hits a “for” loop):
class BeyonceIterable(object): def __iter__(self): """ The iterable interface: return an iterator from __iter__(). Every generator is an iterator implicitly (but not vice versa!), so implementing `__iter__` as a generator is the easiest way to create streamed iterables. """ for word in 'baby let me iterate ya'.split(): yield word + '!' # uses yield => __iter__ is a generator iterable = BeyonceIterable() for val in iterable: # iterator created here print val, baby! let! me! iterate! ya! for val in iterable: # another iterator created here print val, baby! let! me! iterate! ya!
- magic method iter:
Iterators are everywhere in Python. They are elegantly implemented within for loops, comprehensions, generators etc. but hidden in plain sight.
Iterator in Python is simply an object that can be iterated upon. An object which will return data, one element at a time.
Technically speaking, Python iterator object must implement two special methods, __iter__() and __next__(), collectively called the iterator protocol.
An object is called iterable if we can get an iterator from it. Most of built-in containers in Python like: list, tuple, string etc. are iterables.
The iter() function (which in turn calls the __iter__() method) returns an iterator from them.
- Iterating Through an Iterator in Python
use the \(next()\) function to manually iterate through all the items of an iterator.
# define a list my_list = [4, 7, 0, 3] # get an iterator using iter() my_iter = iter(my_list) ## iterate through it using next() #prints 4 print(next(my_iter)) #prints 7 print(next(my_iter)) ## next(obj) is same as obj.__next__() #prints 0 print(my_iter.__next__()) #prints 3 print(my_iter.__next__()) ## This will raise error, no items left next(my_iter)
A more elegant way of automatically iterating is by using the for loop. Using this, we can iterate over any object that can return an iterator, for example list, string, file etc.
for element in my_list: print(element)
- How for loop actually works?
for element in iterable: # do something with element # Is actually implemented as. # create an iterator object from that iterable iter_obj = iter(iterable) # infinite loop while True: try: # get the next item element = next(iter_obj) # do something with element except StopIteration: # if StopIteration is raised, break from loop break
- example:
class PowTwo: """Class to implement an iterator of powers of two""" def __init__(self, max = 0): self.max = max def __iter__(self): self.n = 0 return self def __next__(self): if self.n <= self.max: result = 2 ** self.n self.n += 1 return result else: raise StopIteration # create an iterator and iterate through it as follows. >>> a = PowTwo(4) >>> i = iter(a) >>> next(i) 1 >>> next(i) 2 >>> next(i) 4 >>> next(i) 8 >>> next(i) 16 >>> next(i) Traceback (most recent call last): ... StopIteration # use a for loop to iterate over our iterator class. >>> for i in PowTwo(5): ... print(i)
3.6 Sequence Types — list, tuple, range
- create a0 to a9:
import itertools infectious_big_list = [['A{}'.format(i) for i in range(10, 100)], ['A0{}'.format(i) for i in range(0, 10)], ['B{}'.format(i) for i in range(10, 100)], ['B0{}'.format(i) for i in range(0, 10)]] infectious_disease_cd = list(itertools.chain.from_iterable(infectious_big_list)) mapping_illustrated_disease = {x:True for x in infectious_disease_cd}
- multiply a list with another list:
def multiply_strings(x, col=('med_code', 'med_usage')): # print(type(x)) ls_x = x[col[0]].split(',') ls_y = x[col[1]].split(',') # assert len(ls_x) == len(ls_y) if len(ls_x) != len(ls_y): return x[col[0]] ls_new = [] for i in range(len(ls_x)): # remove stop words if int(ls_y[i]) < 50: ls_new.append((ls_x[i]+',')*int(ls_y[i])) else: ls_new.append((ls_x[i]+',')) return ''.join(ls_new) df.apply(lambda x:multiply_strings(x), axis=1)
- find all occurrences of a substring
import re [m.start() for m in re.finditer('test', 'test test test test')]
- find position of sub list in a list
greeting = ['hello','my','name','is','bob','how','are','you','my','name','is'] def find_sub_list(sl,l): sll=len(sl) for ind in (i for i,e in enumerate(l) if e==sl[0]): if l[ind:ind+sll]==sl: return ind,ind+sll-1 print find_sub_list(['my','name','is'], greeting)
- is list equal:
a = [1,2,3] b = [3,2,1] a.sort() b.sort() a == b
- Unique value from list of lists:
testdata = [list(x) for x in set(tuple(x) for x in testdata)]
- nested list comprehension:
[x+y for x in [1,2,3] for y in [4,5,6]] # equal to res =[] for x in [1,2,3]: for y in [4,5,6]: res.append(x+y) [y+1 for x in [[1,2],[2,2],[3,2]] for y in x] # equal to res =[] for x in [[1,2],[2,2],[3,2]]: for y in x: res.append(1+y)
- remove value in a list:
list.remove('value')
- tuple vs set
A set is a slightly different concept from a list or a tuple. A set, in Python, is just like the mathematical set. It does not hold duplicate values, and is unordered. However, it is not immutable unlike a tuple.Jan 9, 2018
- combine list of lists into one list, join list of lists
import itertools a = [["a","b"], ["c"]] print list(itertools.chain.from_iterable(a)) # or lambda: (lambda b: map(b.extend, big_list))([])
- find difference of two lists:
a = [1,2,3,2,1,5,6,5,5,5] import collections print [item for item, count in collections.Counter(a).items() if count > 1]
- 列表生成式list comprehension
[a.lower() for a in x=['Hello', 'World', 18, 'Apple', None] if isinstance(a,str)]
- read file to a list:
with open(r'y:\codes\data\smart_beta_etf_list.txt', 'rb') as f: etf_list = f.readlines() etf_list = [x.strip() for x in etf_list] # you may also want to remove whitespace characters like `\n` at the end of each line
- save a list to a file:
thefile = open('test.txt', 'w') import pickle with open('outfile', 'wb') as fp: pickle.dump(itemlist, fp) # To read it back: with open ('outfile', 'rb') as fp: itemlist = pickle.load(fp)
- save a list of Chinese string to a file:
values = [u'股市'] import codecs with codecs.open("file.txt", "w", encoding="utf-8") as d: d.write(str(x0))
- for item in the list:
thefile.write("%s\n" % item)
- replace comma as next line (enter):
choose extend mode: replace ',' as \r\n
- split strings by space delimiter from reverse:
text.rsplit(' ', 1)[0]
- split strings by space delimiter from beginning:
text.split(' ', 1)[0] >>>a.split('.',1) ['alvy','test.txt'] 后面多了一个参数1,以第一个'.'分界,分成两个字符串,组成一个list >>>a.rsplit('.',1) ['alvy.test','txt'] 现在是rsplit函数,从右边第一个'.'分界,分成两个字符串,组成一个list
- split by comma:
string.split(",")
- split by multiple delimiter:
import re re.split('; |, |\*|\n',str)
3.7 生成器generator
通过列表生成式,我们可以直接创建一个列表。但是,受到内存限制,列表容量肯定是有限的。 而且,创建一个包含100万个元素的列表,不仅占用很大的存储空间,如果我们仅仅需要访问前面几个元素,那后面绝大多数元素占用的空间都白白浪费了。 要创建一个generator,有很多种方法。第一种方法很简单,只要把一个列表生成式的[]改成(),就创建了一个generator: 如果要一个一个打印出来,可以通过next()函数获得generator的下一个返回值: next(g) 这里,最难理解的就是generator和函数的执行流程不一样。函数是顺序执行,遇到return语句或者最后一行函数语句就返回。 而变成generator的函数,在每次调用next()的时候执行,遇到yield语句返回,再次执行时从上次返回的yield语句处继续执行。
def odd(): print('step 1') yield 1 print('step 2') yield(3) print('step 3') yield(5) >>> o = odd() >>> next(o) step 1 1 >>> next(o) step 2 3 >>> next(o) step 3 5 >>> next(o) Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
A Generator is an Iterator
A function with yield in it is still a function, that, when called, returns an instance of a generator object:
def a_function(): "when called, returns generator object" yield
A generator expression also returns a generator:
a_generator = (i for i in range(0))
A Generator is an Iterator
An Iterator is an Iterable
Iterators require a next or next method
3.7.1 loop
- loop with batches:
for i in tqdm(range(0, len(category), batch_size)): re_batch = {} for j in range(batch_size): re_batch[j] = wiki_category_re.search(category, last_span) if re_batch[j] is not None: last_span = re_batch[j].span()[1] upload_cat_node(re_batch)
- don't care the interator sequence:
for _ in range(10): print(_)
- fetch several pairs from a dictionary:
from itertools import islice def take(n, iterable): "Return first n items of the iterable as a list" return list(islice(iterable, n)) n_items = take(3, dict_df_ret.items()) n_items # or list(islice(dictionary.items(), 3))
- iterate key and value in a dictionary:
# python 2 for index, value in dict.iteritems(): # python 3 for index, value in dict.items(): print index, value
- iterate keys in a dictionary:
for k in dict:
- iterate a row in pandas dataframe:
DataFrame.iterrows(): return generator. >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float']) >>> row = next(df.iterrows())[1] >>> row int 1.0 float 1.5 Name: 0, dtype: float64 >>> print(row['int'].dtype) float64 >>> print(df['int'].dtype) int64
- To preserve dtypes while iterating over the rows, it is better to use itertuples()
- which returns tuples of the values and which is generally faster as iterrows.
3.8 Text Sequence Type — str
3.9 Binary Sequence Types — bytes, bytearray, memoryview
- convert bytes into string:
b"abcde".decode("utf-8")
3.10 Set Types — set, frozenset
- access an element in a set.
a=set([1,2,3]) element = a.pop(0) list(a)[0]
from random import sample def ForLoop(s): for e in s: break return e def IterNext(s): return next(iter(s)) def ListIndex(s): return list(s)[0] def PopAdd(s): e = s.pop() s.add(e) return e def RandomSample(s): return sample(s, 1) def SetUnpacking(s): e, *_ = s return e from simple_benchmark import benchmark b = benchmark([ForLoop, IterNext, ListIndex, PopAdd, RandomSample, SetUnpacking], {2**i: set(range(2**i)) for i in range(1, 20)}, argument_name='set size', function_aliases={first: 'First'}) b.plot()
3.11 Mapping Types — dict
- concat two dictionary:
dict1.update(dict2)
- dump dictionary into pickle:
with open('./data/disease.pkl', 'wb') as f: pickle.dump(dict_intravenous_thrombolysis, f)
- dump dictionary into json:
import json with open('multiple_paths.json', 'w', encoding='utf-8') as fp: js_obj = json.dumps(filtered_dict) fp.write(js_obj)
- get key from value:
for name, age in word2id.items(): # for name, age in list.items(): (for Python 3.x) if age == 16116: print(name) # or mydict = {'george':16,'amber':19} print(list(mydict.keys())[list(mydict.values()).index(16)]) # Prints george print(list(word2id.keys())[list(word2id.values()).index(16116)]) # Prints
- get some keys value according to a list in a dictionary:
value = {} for key in finance_vocab: value[key] = dict_vocab.get(key)
- filter dictionary by value:
filtered_dict = {k:v for k,v in dict.items() if v<0}
- set all values in a dict:
visited = dict.fromkeys(self.graph, False)
- check if a value is in a dict:
'红鲱鱼招股书' in g.graph.values()
- check if a value is in a defaultdict collection list:
any('波动性' in v for v in g.graph.values()) # or def in_values(s, d): """Does `s` appear in any of the values in `d`?""" for v in d.values(): if s in v: return True return False in_values('cow', animals)
- sort a dict by its value:
s = [(k, d[k]) for k in sorted(d, key=d.get, reverse=True)]
- count key values in a dict:
d = defaultdict(list) for name in g.graph.keys(): key = len(g.graph[name]) d[name] = key
- convert a list of tuples with the same key into a dictionary:
from collections import defaultdict d = defaultdict(list) for k, v in list(graph.out_edges('财务管理')): d[k].append(v) # or d = {} for k, v in list(graph.out_edges('财务管理')): d.setdefault(k,[]).append(v)
- multiply diction by another dictionary:
from copy import copy my_dict = copy(another_dict) my_dict.update((x, y*2) for x, y in my_dict.items())
- add value in a dictionary:
In [231]: d Out[231]: defaultdict(list, {'上海证券交易所上市公司': 383, '各证券交易所上市公司': 37, '深圳证券交易所上市公司': 511, '证券': 64, '证券交易所': 8}) sum(d.values())
- write defaultdict to a json file:
import json # writing json.dump(yourdict, open(filename, 'w')) # reading yourdict = json.load(open(filename))
3.12 Context Manager Types
3.13 Other Built-in Types
3.14 Special Attributes – magic method:
- getitem in a class allows its instances to use the [ ] (indexer) operators
- setitem Called to implement assignment to self[key]
- call magic method in a class causes its instances to become callables – in other words, those instances now behave like functions.
- getattr overrides Python’s default mechanism for member access.
- getattr magic method only gets invoked for attributes that are not in the dict magic attribute. Implementing getattr causes the hasattr built-in function to always return True, unless an exception is raised from within getattr.
- setattr allows you to override Python’s default mechanism for member assignment.
- The repr function also converts an object to a string. It can also be invoked using the reverse quotes (`), also called accent grave, (underneath the tilde, ~, on most keyboards). But it will convert unambitiously the object. For example, repr(datetime.datetime.now) = datetime.datetime(2018, 1, 20, 13, 32, 51, 483232).
- __str__
get string of elements inside.
print `a` print repr(a)
- find out where module is installed
import os import spacy print(os.path.dirname(spacy.__file__))
4 Built-in Exceptions
4.1 Base classes
4.2 Concrete exceptions
4.3 Warnings
4.4 Exception hierarchy
4.5 exception
- oracle cx error:
try: xx except cx.DatabaseError: continue
- retry:
response = None error = None while response is None: try: response = doing_something() if response is not None: if 'good' in response: print("successfully uploaded") else: exit("reason %s"%response) except HttpError as e: if e.code in RETRIABLE_STATUS_CODES: error = 'A retriable HTTP error %d occurred:\n%s' % (e.resp.status, e.content) else: raise except RETRIABLE_EXCEPTIONS as e: error = 'A retriable error occurred: %s' % e if error is not None: print error retry += 1 if retry > MAX_RETRIES: exit('No longer attempting to retry.') max_sleep = 2 ** retry sleep_seconds = random.random() * max_sleep print 'Sleeping %f seconds and then retrying...' % sleep_seconds time.sleep(sleep_seconds)
- capture urllib error:
import urllib2 req = urllib2.Request('http://www.python.org/fish.html') try: resp = urllib2.urlopen(req) except urllib2.HTTPError as e: if e.code == 404: # do something... else: # ... except urllib2.URLError as e: # Not an HTTP-specific error (e.g. connection refused) # ... else: # 200 body = resp.read()
- create an exception:
class ConstraintError(Exception): def __init__(self, arg): self.args = arg if error: raise ConstraintError("error") class Networkerror(RuntimeError): def __init__(self, arg): self.args = arg try: raise Networkerror("Bad hostname") except Networkerror,e: print e.args print e.message
- clean-up actions
>>> def divide(x, y): ... try: ... result = x / y ... except ZeroDivisionError: ... print "division by zero!" ... else: ... print "result is", result ... finally: ... print "executing finally clause" ... >>> divide(2, 1) result is 2 executing finally clause >>> divide(2, 0) division by zero! executing finally clause >>> divide("2", "1") executing finally clause Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 3, in divide TypeError: unsupported operand type(s) for /: 'str' and 'str'
5 Text Processing Services
5.1 string — Common string operations
- check if string is Chinese:
from googletrans import Translator translator = Translator(proxies ={ 'http': 'http://192.168.1.126:1080', 'https': 'http://192.168.1.126:1080' } ) input_text_language_0 = translator.detect(input_text_0).lang
- find if strings are almost equal:
from difflib import SequenceMatcher s_1 = 'Mohan Mehta' s_2 = 'Mohan Mehte' print(SequenceMatcher(a=s_1,b=s_2).ratio()) 0.909090909091
- check if string is empty:
if not text: print('text is empty')
- if the string is an English word.
from nltk.corpus import wordnet if not wordnet.synsets(word) and not word.isdigit()
- jieba cut, remove signs.
punct = set(u''':!),.:;?]}¢'"、。〉》」』】〕〗〞︰︱︳﹐、﹒ ﹔﹕﹖﹗﹚﹜﹞!),.:;?|}︴︶︸︺︼︾﹀﹂﹄﹏、~¢ 々‖•·ˇˉ―--′’”([{£¥'"‵〈《「『【〔〖([{£¥〝︵︷︹︻ ︽︿﹁﹃﹙﹛﹝({“‘-—_…''') str_in = u"小明硕士毕业于中国科学院计算所,\ 后在日本京都大学深造,凭借过人天赋,旁人若在另一方面爱他,他每即躲开。" # 对str/unicode filterpunt = lambda s: ''.join(filter(lambda x: x not in punct, s)) # 对list filterpuntl = lambda l: list(filter(lambda x: x not in punct, l)) seg_list = jieba.cut(str_in, cut_all=False) sent_list = filterpuntl(seg_list)
- jieba cut on bash:
python -m jieba news.txt > cut_result.txt
- create a list from jieba generator: sentence = [x for x in seg_list]
- manually download nltk tokenizer:
unzip downloaded file nltk_data to /usr/local/share/nltk_data/tokenizers
- tokenize unicode or string to sentence list.
from nltk import tokenize as n_tokenize sent= n_tokenize.sent_tokenize(page) # or sent_list = page.split()
- list comprehension
[x for x in t if x not in s if x.isdigit()] l = [22, 13, 45, 50, 98, 69, 43, 44, 1] [True if x >= 45 else False for x in l]
- if string are digits.
str.isdigit()
- concatenate two strings
" ".join((str1, str2))
- 移除字符串头尾指定的字符(默认为空格)
#!/usr/bin/python str = "0000000this is string example....wow!!!0000000"; print(str.strip( '0' ))
- checks whether the string consists of alphabetic characters only.
nilbody
5.2 re — Regular expression operations
5.2.1 useage:
- find strings
- convert strings
- convert syntax from python2 to python3
regular expression: find print(\S*), replace with print(\1)
5.2.2 types:
In : rex_property.search? Signature: rex_property.search(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) Docstring: Scan through string looking for a match, and return a corresponding match object instance. This object has start, end, group, groups, span.
Return None if no position in the string matches. Type: builtin_function_or_method
In : rex_property.findall? Signature: rex_property.findall(string=None, pos=0, endpos=9223372036854775807, , source=None) Docstring: *Return a list of all non-overlapping matches of pattern in string. Type: builtin_function_or_method
In : rex_property.match? Signature: rex_property.match(string=None, pos=0, endpos=9223372036854775807, *, pattern=None) Docstring: Matches zero or more characters at the beginning of the string. Type: builtin_function_or_method
5.2.3 string array
[Pp]ython: find Python or python
- parts
re.search('[a-zA-Z0-9]', 'x')
- not
re.search('[^0-9]', 'x')
- shortcut
- word: \w
- number: \d
- space, tab, next line: \s
- 0 length sub string: \b
re.search('\bcorn\b', 'corner')
- start and end with strings
re.search('^Python', 'Python 3') re.search('Python$', 'this is Python')
- any character
"."
- all lines contain a specific string
^.*Deeplearning4j$
5.2.4 optional words
'color' vs 'colour' re.search('colou?r', 'my favoriate color')
5.2.5 repeat
{N}
# find a telephone number re.search(r'[\d]{3}-[\d]{4}', '867-5309 /Jenny') # find 32big GID [x for x in risk_model_merge.keys() if re.match("[A-Z0-9]{32}$", x)]
5.2.6 search for a pattern within a text file
- bulk read:
import re textfile = open(filename, 'r') filetext = textfile.read() textfile.close() matches = re.findall("(<(\d{4,5})>)?", filetext)
- read line by line:
import re textfile = open(filename, 'r') matches = [] reg = re.compile("(<(\d{4,5})>)?") for line in textfile: matches += reg.findall(line) textfile.close()
5.3 difflib — Helpers for computing deltas
5.4 textwrap — Text wrapping and filling
5.5 unicodedata — Unicode Database
5.6 stringprep — Internet String Preparation
5.7 readline — GNU readline interface
- read certain line from a file:
import linecache linecache.getline('Sample.txt', Number_of_Line)
6 Data Types
6.1 datetime — Basic date and time types
- Converting unix timestamp string to readable date in Python
import datetime print( datetime.datetime.fromtimestamp( int("1284101485") ).strftime('%Y-%m-%d %H:%M:%S') )
6.2 Time
- create timestamp:
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore') datetime.datetime(1300, 1, 1, 0, 0)
- get specific timezone datetime
tz = pytz.timezone('America/Los_Angeles') #date = date.today() now = datetime.now() los_angeles_time = datetime.now(tz)
- use tqdm as a status bar:
from tqdm import tqdm from time import sleep for i in tqdm(range(10)): sleep(0.1) # enumerate for i in enumerate(tqdm(list)): do things() # pandas df = pd.DataFrame(np.random.randint(0, int(1e8), (10000, 1000))) # Create and register a new `tqdm` instance with `pandas` # (can use tqdm_gui, optional kwargs, etc.) tqdm.pandas() # Now you can use `progress_apply` instead of `apply` df.groupby(0).progress_apply(lambda x: x**2)
- string to datetime:
time.strptime(string[, format])
- datetime, Timestamp, datetime64
pandas, Timestamp np.dtype('<M8[ns]')
– DatetimeIndex is composed by Timestamps.
#Timestamp to string:i str_timestamp = pd.to_datetime(Timestamp, format = '%Y%m%d') str_timestamp = str_timestamp.strftime('%Y-%m-%d')
datetime, utc datetime64
- get the location of a date in datetimeindex:
pd.DatetimeIndex.get_loc(datetime)
- datetime off set, subtract
TimeStamp +/- pd.DateOffset(years=1) pd.Timedelta(days=365) #allowed keywords are [weeks, days, hours, minutes, seconds, milliseconds, microseconds, nanoseconds]
- pandas date range selection:
start = ls_dates[d] start = pd.to_datetime(start) period = start + pd.DateOffset(30) print(start, period) df_range = df_simul[(df_simul.index < period) & (df_simul.index > start)]
6.3 calendar — General calendar-related functions
6.4 collections — Container datatypes
6.5 collections — High-performance container datatypes
module | function |
---|---|
deque | list-like container with fast appends and pops on either end |
Counter | dict subclass for counting hashable objects |
defaultdict | dict subclass that calls a factory function to supply missing values |
6.6 collections.abc — Abstract Base Classes for Containers
6.7 heapq — Heap queue algorithm
6.8 bisect — Array bisection algorithm
6.9 array — Efficient arrays of numeric values
6.10 weakref — Weak references
6.11 types — Dynamic type creation and names for built-in types
6.12 copy — Shallow and deep copy operations
from copy import copy
6.13 pprint — Data pretty printer
适合打印列表。
>>> import pprint >>> tup = ('spam', ('eggs', ('lumberjack', ('knights', ('ni', ('dead', ... ('parrot', ('fresh fruit',)))))))) >>> stuff = ['a' * 10, tup, ['a' * 30, 'b' * 30], ['c' * 20, 'd' * 20]] >>> pprint.pprint(stuff) ['aaaaaaaaaa', ('spam', ('eggs', ('lumberjack', ('knights', ('ni', ('dead', ('parrot', ('fresh fruit',)))))))), ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'], ['cccccccccccccccccccc', 'dddddddddddddddddddd']]
6.14 reprlib — Alternate repr() implementation
6.15 enum — Support for enumerations
7 Numeric and Mathematical Modules
7.1 numbers — Numeric abstract base classes
7.2 math — Mathematical functions
7.3 cmath — Mathematical functions for complex numbers
7.4 decimal — Decimal fixed point and floating point arithmetic
Floating-point numbers are represented in computer hardware as base 2 (binary) fractions. For example, the decimal fraction 0.001 has value 0/2 + 0/4 + 1/8. On a typical machine running Python, there are 53 bits of precision available for a Python float, so the value stored internally when you enter the decimal number 0.1 is the binary fraction.
0.00011001100110011001100110011001100110011001100110011010
>>> round(2.675, 2) 2.67
it’s again replaced with a binary approximation, whose exact value is
2.67499999999999982236431605997495353221893310546875
- precision, scientific number:
import numpy as np np.set_printoptions(suppress=True) %precision %.4g
7.5 fractions — Rational numbers
7.6 random — Generate pseudo-random numbers
7.7 statistics — Mathematical statistics functions
8 Functional Programming Modules
8.1 itertools — Functions creating iterators for efficient looping
8.2 functools — Higher-order functions and operations on callable objects
8.3 operator — Standard operators as functions
9 File and Directory Access
9.1 pathlib — Object-oriented filesystem paths
9.2 os.path — Common pathname manipulations
- delete a file:
if os.path.exists("demofile.txt"): os.remove("demofile.txt")
- change file name
import os def change_filename(dir_name, filename, suffix, extension=None): # name = filename.split('/')[-1] path, ext = os.path.splitext(filename) if extension is None: extension = ext return os.path.join(dir_name, path + suffix + extension)
- move file
import os os.move(source_file_path, destination)
- find current working dir:
import sys, os # run python file.py ROOTDIR = os.path.join(os.path.dirname(__file__), os.pardir) sys.path.append(os.path.join(ROOTDIR, "lib")) # run in python ROOTDIR = os.path.join(os.path.dirname("__file__"), os.pardir)
- temperary folder:
import os import tempfile TEMP_FOLDER = tempfile.gettempdir() print('Folder "{}" will be used to save temporary dictionary and corpus.'.format(TEMP_FOLDER))
- walk all file from a directory and its sub-directory
Directory tree generator.
For each directory in the directory tree rooted at top (including top itself, but excluding '.' and '..'), yields a 3-tuple
dirpath, dirnames, filenames
import os from os.path import join, getsize for root, dirs, files in os.walk('/home/weiwu/share/deep_learning/data/enwiki'): print(root, "consumes, ") print(sum([getsize(join(root, name)) for name in files]), '\s') print("bytes in", len(files), "non-directory files")
- check if file exist
os.path.isfile(os.path.join(path,name))
- get current work directory
import os cwd = os.getcwd()
- get temporary work directory
from tempfile import gettempdir tmp_dir = gettempdir()
9.3 fileinput — Iterate over lines from multiple input streams
- open
open() returns a file object, and is most commonly used with two arguments: open(filename, mode).
>>> f = open('workfile', 'w') >>> print f <open file 'workfile', mode 'w' at 80a0960>
The first argument is a string containing the filename. The second argument is another string containing a few characters describing the way in which the file will be used. mode can be 'r' when the file will only be read, 'w' for only writing (an existing file with the same name will be erased), and 'a' opens the file for appending; any data written to the file is automatically added to the end. 'r+' opens the file for both reading and writing. The mode argument is optional; r will be assumed if it’s omitted.
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'.
- write text at the end of a file without overwrite that file:
f = open('filename.txt', 'a') f.write("stuff") f.close()
- read specific lines
fp = open("file") for i, line in enumerate(fp): if i == 25: # 26th line elif i == 29: # 30th line elif i > 29: break fp.close() # Note that i == n-1 for the nth line. # In Python 2.6 or later: with open("file") as fp: for i, line in enumerate(fp): if i == 25: # 26th line elif i == 29: # 30th line elif i > 29: break
9.4 stat — Interpreting stat() results
9.5 filecmp — File and Directory Comparisons
9.6 tempfile — Generate temporary files and directories
9.7 glob — Unix style pathname pattern expansion
9.8 fnmatch — Unix filename pattern matching
9.9 linecache — Random access to text lines
9.10 shutil — High-level file operations
9.11 macpath — Mac OS 9 path manipulation functions
10 Data Persistence
10.1 pickle — Python object serialization
10.1.1 dump:
import pickle data1 = {'a': [1, 2.0, 3, 4+6j], 'b': ('string', u'Unicode string'), 'c': None} selfref_list = [1, 2, 3] selfref_list.append(selfref_list) output = open('data.pkl', 'wb') # Pickle dictionary using protocol 0. pickle.dump(data1, output) # Pickle the list using the highest protocol available. pickle.dump(selfref_list, output, -1) output.close() pickle.dump( x0, open( "x0.pkl", "wb" ) )
10.1.2 load:
- read all the objects in the pickle dump file:
pickle_file = open('./data/city_20190228.pkl', 'rb') dict_disease_seed_graph = [] while True: try: dict_disease_seed_graph.append(pickle.load(pickle_file)) except EOFError: pickle_file.close() break
import pprint, pickle pkl_file = open('data.pkl', 'rb') data1 = pickle.load(pkl_file) pprint.pprint(data1) data2 = pickle.load(pkl_file) pprint.pprint(data2) pkl_file.close()
10.2 copyreg — Register pickle support functions
10.3 shelve — Python object persistence
10.4 marshal — Internal Python object serialization
10.5 dbm — Interfaces to Unix “databases”
10.6 sqlite3 — DB-API interface for SQLite databases
- ModuleNotFoundError: No module named 'MySQLdb'
pip install mysqlclient
- install mysql connector
python ModuleNotFoundError: No module named 'mysql'
pip search mysql-connector | grep --color mysql-connector-python pip install mysql-connector-python-rf
10.7 protobuf
10.7.1 tutorial
- need a .proto file for the structure:
syntax = "proto3"; // or proto2 package tutorial; import "google/protobuf/timestamp.proto"; // [END declaration] // [START java_declaration] option java_package = "com.example.tutorial"; option java_outer_classname = "AddressBookProtos"; // [END java_declaration] // [START csharp_declaration] option csharp_namespace = "Google.Protobuf.Examples.AddressBook"; // [END csharp_declaration] // [START messages] message Person { string name = 1; int32 id = 2; // Unique ID number for this person. string email = 3; enum PhoneType { MOBILE = 0; HOME = 1; WORK = 2; } message PhoneNumber { string number = 1; PhoneType type = 2; } repeated PhoneNumber phones = 4; google.protobuf.Timestamp last_updated = 5; } // Our address book file is just one of these. message AddressBook { repeated Person people = 1; }
- Compiling Your Protocol Buffers in shell to generate a class:
protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/addressbook.proto protoc --proto_path=src --python_out=build/gen src/foo.proto src/bar/baz.proto # The compiler will read the files src/foo.proto and src/bar/baz.proto and produce two output files: build/gen/foo_pb2.py and build/gen/bar/baz_pb2.py. The compiler will automatically create the directory build/gen/bar if necessary, but it will not create build or build/gen; they must already exist.
- add_person.py
#! /usr/bin/python import addressbook_pb2 import sys # This function fills in a Person message based on user input. def PromptForAddress(person): person.id = int(raw_input("Enter person ID number: ")) person.name = raw_input("Enter name: ") email = raw_input("Enter email address (blank for none): ") if email != "": person.email = email while True: number = raw_input("Enter a phone number (or leave blank to finish): ") if number == "": break phone_number = person.phones.add() phone_number.number = number type = raw_input("Is this a mobile, home, or work phone? ") if type == "mobile": phone_number.type = addressbook_pb2.Person.MOBILE elif type == "home": phone_number.type = addressbook_pb2.Person.HOME elif type == "work": phone_number.type = addressbook_pb2.Person.WORK else: print "Unknown phone type; leaving as default value." # Main procedure: Reads the entire address book from a file, # adds one person based on user input, then writes it back out to the same # file. if len(sys.argv) != 2: print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE" sys.exit(-1) address_book = addressbook_pb2.AddressBook() # Read the existing address book. try: f = open(sys.argv[1], "rb") address_book.ParseFromString(f.read()) f.close() except IOError: print sys.argv[1] + ": Could not open file. Creating a new one." # Add an address. PromptForAddress(address_book.people.add()) # Write the new address book back to disk. f = open(sys.argv[1], "wb") f.write(address_book.SerializeToString()) f.close()
- try to run above python code in shell:
python add_person.py ADDRESS_BOOK_FILE
- list_person.py
#! /usr/bin/python import addressbook_pb2 import sys # Iterates though all people in the AddressBook and prints info about them. def ListPeople(address_book): for person in address_book.people: print "Person ID:", person.id print " Name:", person.name if person.HasField('email'): print " E-mail address:", person.email for phone_number in person.phones: if phone_number.type == addressbook_pb2.Person.MOBILE: print " Mobile phone #: ", elif phone_number.type == addressbook_pb2.Person.HOME: print " Home phone #: ", elif phone_number.type == addressbook_pb2.Person.WORK: print " Work phone #: ", print phone_number.number # Main procedure: Reads the entire address book from a file and prints all # the information inside. if len(sys.argv) != 2: print "Usage:", sys.argv[0], "ADDRESS_BOOK_FILE" sys.exit(-1) address_book = addressbook_pb2.AddressBook() # Read the existing address book. f = open(sys.argv[1], "rb") address_book.ParseFromString(f.read()) f.close() ListPeople(address_book)
- read a message:
python list_person.py ADDRESS_BOOK_FILE
10.8 neo4j
10.8.1 install
pip install neo4j-driver py2neo
10.8.2 basic operations
- create cursor:
g = Graph(host="192.168.4.36", # neo4j 搭载服务器的ip地址,ifconfig可获取到 http_port=7474, # neo4j 服务器监听的端口号 user="neo4j", # 数据库user name,如果没有更改过,应该是neo4j password="neo4j123")
10.9 mysql
10.9.1 create
- create table
-- Create table create table PATIENT_PROFILE ( PERSON_ID VARCHAR2(50) not null, PERSON_AGE INTEGER not null, PERSON_SEX VARCHAR2(10) not null, AGE_GROUP VARCHAR2(10) not null, RISK_SCORE VARCHAR2(10) default 1, CHRONIC_DIS VARCHAR2(10) default 'False', INFECTIOUS_DIS VARCHAR2(10) default 'False', TUMOR VARCHAR2(10) default 'False', PSYCHIATRIC VARCHAR2(10) default 'False', IMPLANTABLE_DEV VARCHAR2(10) default 'False', TREATMENT_PERC_LV VARCHAR2(10) default 'low', INSPECTION_PERC_LV VARCHAR2(10) default 'low', DRUG_PERC_LV VARCHAR2(10) default 'low', OUTPATIENT_PERC_LV VARCHAR2(10) default 'low', DRUG_PURCH_AVG_LV VARCHAR2(10) default 'low', SELF_SURPP_PERC_LV VARCHAR2(10) default 'low', CUM_OUTPATIENT_LV VARCHAR2(10) default 'low', CUM_HOSP_LV VARCHAR2(10) default 'low', HOSP_AVG_LV VARCHAR2(10) default 'low', OUTPATIENT_AVG_LV VARCHAR2(10) default 'low', DRUG_PURCH_FREQ_LV VARCHAR2(10) default 'low', OUTPATIENT_FREQ_LV VARCHAR2(10) default 'low', HOSP_FREQ_LV VARCHAR2(10) default 'low', HOSP_PREFERENCE VARCHAR2(10), FIRST_VISIT_PREFERENCE VARCHAR2(10), TRAILING_12_MONTHS_LEVEL VARCHAR2(10) default 'low', TRAILING_36_MONTHS_LEVEL VARCHAR2(10) default 'low', SHORT_TERM VARCHAR2(10) default 'False', LONG_TERM VARCHAR2(10) default 'False' ) tablespace WUXI pctfree 10 initrans 1 maxtrans 255; -- Add comments to the table comment on table PATIENT_PROFILE is '个人用户画像'; -- Add comments to the columns comment on column PATIENT_PROFILE.PERSON_SEX is '描述患者的生理性别 '; comment on column PATIENT_PROFILE.AGE_GROUP is '0-6 童年,7-17 少年, 18-40 青年, 41-65 中年, 66以上老年 '; comment on column PATIENT_PROFILE.RISK_SCORE is '健康风险等级 '; comment on column PATIENT_PROFILE.CHRONIC_DIS is '慢性病患者'; comment on column PATIENT_PROFILE.INFECTIOUS_DIS is '传染病病原携带者 '; comment on column PATIENT_PROFILE.TUMOR is '恶性肿瘤患者 '; comment on column PATIENT_PROFILE.PSYCHIATRIC is '精神疾病患者 '; comment on column PATIENT_PROFILE.IMPLANTABLE_DEV is '植入性器材'; comment on column PATIENT_PROFILE.TREATMENT_PERC_LV is '治疗费用占比'; comment on column PATIENT_PROFILE.INSPECTION_PERC_LV is '检查费用占比'; comment on column PATIENT_PROFILE.DRUG_PERC_LV is '药物费用占比'; comment on column PATIENT_PROFILE.OUTPATIENT_PERC_LV is '门诊住院费用比例'; comment on column PATIENT_PROFILE.DRUG_PURCH_AVG_LV is '单次平均购药金额'; comment on column PATIENT_PROFILE.SELF_SURPP_PERC_LV is '自负费用占比'; comment on column PATIENT_PROFILE.CUM_OUTPATIENT_LV is '累计门诊金额 '; comment on column PATIENT_PROFILE.CUM_HOSP_LV is '累计住院金额'; comment on column PATIENT_PROFILE.HOSP_AVG_LV is '平均住院金额'; comment on column PATIENT_PROFILE.OUTPATIENT_AVG_LV is '平均门诊金额'; comment on column PATIENT_PROFILE.DRUG_PURCH_FREQ_LV is '药店购药频繁程度'; comment on column PATIENT_PROFILE.OUTPATIENT_FREQ_LV is '门诊就诊频繁程度'; comment on column PATIENT_PROFILE.HOSP_FREQ_LV is '住院就诊频繁程度 '; comment on column PATIENT_PROFILE.HOSP_PREFERENCE is '就诊医院偏好 1门诊、2药店购药、3住院 '; comment on column PATIENT_PROFILE.FIRST_VISIT_PREFERENCE is '首选就诊偏好,对一次健康问题的首次就医方式的选择偏好。1门诊、2药店购药、3住院 '; comment on column PATIENT_PROFILE.TRAILING_12_MONTHS_LEVEL is '近一年费用支出水平 '; comment on column PATIENT_PROFILE.TRAILING_36_MONTHS_LEVEL is '近三年费用支出水平 '; comment on column PATIENT_PROFILE.SHORT_TERM is '短期内再入院'; comment on column PATIENT_PROFILE.LONG_TERM is '长周期住院';
- create from select:
create table temp_pairs_id as select t1.GRID person_1, t2.GRID person_2, sysdate as crt_date from MMAP_SHB_SPECIAL.Ck10_Ghdj t1
- create index
- create multiple index
10.9.2 select
- select the first row in each group:
WITH summary AS ( select person_id, hosp_lev, clinic_type, substr(out_diag_dis_cd, 0, 3) as icd3, to_date(substr(out_hosp_date, 0, 8), 'yyyymmdd') as discharge_date, ROW_NUMBER() OVER(PARTITION BY person_id, substr(out_diag_dis_cd, 0, 3) ORDER BY to_date(substr(out_hosp_date, 0, 8), 'yyyymmdd')) AS rk from t_kc21 where out_diag_dis_cd is not null and clinic_type is not null) SELECT s.* FROM summary s WHERE s.rk = 1
10.9.3 insert
- insert by batches:
from tqdm import tqdm def chunker(seq, size): return (seq[pos:pos + size] for pos in range(0, len(seq), size)) chunksize = 500 with tqdm(total=len(df_disease)) as pbar: for i, cdf in enumerate(chunker(df_disease, chunksize)): cdf.to_sql(con=engine, name="disease_profile", if_exists="append", index=False) pbar.update(chunksize)
- insert data into sql:
cursor = CONN.cursor() s_insert = """INSERT INTO t_jbhlx (TBBH, JB_CD, HLFYSX, HLFYXX, HLZYTSSX, HLZYTSXX, LOGIN_ID, MODIFY_ID, DELE_FLG) VALUES (:1, :2, :3, :4, :5, :6, :7, :8, :9)""" for idx, row in df_sim.iterrows(): if counter % 1000 == 0: print(idx) print(counter) counter += 1 try: cursor.execute(s_insert, (str(data.table_index.iloc[0]), #TBBH, accident_id, #JB_CD, str(cost_upper_bnd), #HLFYSX str(cost_lower_bnd), #HLFYXX str(period_upper_bnd), #HLZYTSSX str(period_lower_bnd), #HLZYTSXX 'admin', #LOGIN_ID 'admin', #MODIFY_ID '1' #DELE_FLG )) except cx_Oracle.IntegrityError: pass CONN.commit()
- True\False
can't insert True or False directly into table, need to turn True/False into string.
10.9.4 delete
- clear a table
The TRUNCATE TABLE statement is used to remove all records from a table in Oracle. It performs the same function as a DELETE statement without a WHERE clause.
TRUNCATE TABLE [schema_name.]table_name
- delete a table
drop table name
10.9.5 update
- update with python:
s_insert = """update patient_profile set CHRONIC_DIS='{0}',INFECTIOUS_DIS='{1}', TUMOR='{2}', PSYCHIATRIC='{3}' where person_id={4}""" counter = 0 for idx, row in df_disease.iterrows(): if counter % 100000 == 0: print(counter) counter += 1 cursor.execute(s_insert.format(str(row['慢性病患者']), str(row['传染病病原携带者']), str(row['恶性肿瘤患者']), str(row['精神疾病患者']), str(idx)))
11 Data Compression and Archiving
11.1 zip
zip([iterable, …]) This function returns a list of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The returned list is truncated in length to the length of the shortest argument sequence. When there are multiple arguments which are all of the same length, zip() is similar to map() with an initial argument of None. With a single sequence argument, it returns a list of 1-tuples. With no arguments, it returns an empty list.
The left-to-right evaluation order of the iterables is guaranteed. This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n).
zip() in conjunction with the * operator can be used to unzip a list:
>>> >>> x = [1, 2, 3] >>> y = [4, 5, 6] >>> zipped = zip(x, y) >>> zipped [(1, 4), (2, 5), (3, 6)] >>> x2, y2 = zip(*zipped) >>> x == list(x2) and y == list(y2) True
- create a dictionary with two iterables
>>> x = [1, 2, 3] >>> y = [4, 5, 6] >>> zipped = zip(x, y) >>> zipped [(1, 4), (2, 5), (3, 6)] In [172]: dict(zipped) Out[179]: {1: 4, 2: 5, 3: 6}
11.2 zlib — Compression compatible with gzip
11.3 gzip — Support for gzip files
11.4 bz2 — Support for bzip2 compression
11.5 lzma — Compression using the LZMA algorithm
11.6 zipfile — Work with ZIP archives
11.7 tarfile — Read and write tar archive files
12 File Formats
12.1 csv — CSV File Reading and Writing
12.2 *args, **kwargs
如果我们不确定要往函数中传入多少个参数,或者我们想往函数中以列表和元组的形式传参数时,那就使要用*args;
>>> args = ("two", 3,5) >>> test_args_kwargs(*args) arg1: two arg2: 3 arg3: 5
如果我们不知道要往函数中传入多少个关键词参数,或者想传入字典的值作为关键词参数时,那就要使用**kwargs。args和kwargs这两个标识符是约定俗成的用法,你当然还可以用*bob和**billy,但是这样就并不太妥。
>>> kwargs = {"arg3": 3, "arg2": "two","arg1":5} >>> test_args_kwargs(**kwargs) arg1: 5 arg2: two arg3: 3
- construct argparse for test:
class argparse(dict): """ Example: m = Map({'first_name': 'Eduardo'}, last_name='Pool', age=24, sports=['Soccer']) """ def __init__(self, *args, **kwargs): super(argparse, self).__init__(*args, **kwargs) for arg in args: if isinstance(arg, dict): for k, v in arg.iteritems(): self[k] = v if kwargs: for k, v in kwargs.iteritems(): self[k] = v def add_argument(self, *args, **kwargs): # super(Map, self).__init__(*args, **kwargs) for i in args: self[i.strip('-')] = kwargs.get('default', None) if 'action' in kwargs: if kwargs['action'] == 'store_true': self[i.strip('-')] = True else: self[i.strip('-')] = False def parse_args(self): return self def __getattr__(self, attr): return self.get(attr) def __setattr__(self, key, value): self.__setitem__(key, value) def __setitem__(self, key, value): super(argparse, self).__setitem__(key, value) self.__dict__.update({key: value}) def __delattr__(self, item): self.__delitem__(item) def __delitem__(self, key): super(argparse, self).__delitem__(key) del self.__dict__[key]
12.3 configparser — Configuration file parser
- use yaml and config file.
# config.yaml engine: user: 'jack' password: 'password'
import yaml with open(r'config.yaml', 'rb') as f: config = yaml.load(f)
- ylib.yaml_config
from ylib.yaml_config import Configuraion config = Configuraion() config.load('../config.yaml') print(config.__str__) USER_AGENT = config.USER_AGENT DOMAIN = config.DOMAIN BLACK_DOMAIN = config.BLACK_DOMAIN URL_SEARCH = config.URL_SEARCH
12.4 netrc — netrc file processing
12.5 xdrlib — Encode and decode XDR data
12.6 plistlib — Generate and parse Mac OS X .plist files
13 Cryptographic Services
13.1 hashlib — Secure hashes and message digests
13.2 hmac — Keyed-Hashing for Message Authentication
13.3 secrets — Generate secure random numbers for managing secrets
14 Generic Operating System Services
14.1 os — Miscellaneous operating system interfaces
14.2 io — Core tools for working with streams
14.3 time — Time access and conversions
14.4 argparse — Parser for command-line options, arguments and sub-commands
14.4.1 16.4.2. ArgumentParser objects
class argparse.ArgumentParser(prog=None, usage=None, description=None, epilog=None, parents=[], formatter_class=argparse.HelpFormatter, prefix_chars='-', fromfile_prefix_chars=None, argument_default=None, conflict_handler='error', add_help=True, allow_abbrev=True) Create a new ArgumentParser object. All parameters should be passed as keyword arguments. Each parameter has its own more detailed description below, but in short they are:
prog - The name of the program (default: sys.argv[ 0]) usage - The string describing the program usage (default: generated from arguments added to parser) description - Text to display before the argument help (default: none) epilog - Text to display after the argument help (default: none) parents - A list of ArgumentParser objects whose arguments should also be included formatter_class - A class for customizing the help output prefix_chars - The set of characters that prefix optional arguments (default: ‘-‘) fromfile_prefix_chars - The set of characters that prefix files from which additional arguments should be read (default: None) argument_default - The global default value for arguments (default: None) conflict_handler - The strategy for resolving conflicting optionals (usually unnecessary) add_help - Add a -h/–help option to the parser (default: True) allow_abbrev - Allows long options to be abbreviated if the abbreviation is unambiguous. (default: True)
14.4.2 argument_default
>>> parser = argparse.ArgumentParser(argument_default=argparse.SUPPRESS) >>> parser.add_argument('--foo') >>> parser.add_argument('bar', nargs='?') >>> parser.parse_args(['--foo', '1', 'BAR']) Namespace(bar='BAR', foo='1') >>> parser.parse_args([]) Namespace()
14.4.3 example
import argparse logger = logging.getLogger() handler = logging.StreamHandler() formatter = logging.Formatter( '%(asctime)s %(name)-12s %(levelname)-8s %(message)s') handler.setFormatter(formatter) if not logger.handlers: logger.addHandler(handler) logger.setLevel(logging.DEBUG) if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument( "-i", "--input", required=False, help="Input word2vec model") parser.add_argument( "-o", "--output", required=False, help="Output tensor file name prefix") parser.add_argument( "-b", "--binary", required=False, help="If word2vec model in binary format, set True, else False") parser.add_argument( "-l", "--logdir", required=False, help="periodically save model variables in a checkpoint") parser.add_argument( "--host", required=False, help="host where holding the tensorboard projector service") parser.add_argument("-p", "--port", required=False, help="browser port") args = parser.parse_args() word2vec2tensor(args.input, args.output, args.binary)
14.4.4 another way to define function parameters and provide parameters:
- define function parameters inside a function:
def convert_pdf2txt(args=None): import argparse P = argparse.ArgumentParser(description=__doc__) P.add_argument( "-m", "--maxpages", type=int, default=0, help="Maximum pages to parse") A = P.parse_args(args=args) print(A.maxpages)
- provide parameters:
# all parameters should be strings convert_pdf2txt(['--maxpages', '123'])
14.5 getopt — C-style parser for command line options
14.6 logging — Logging facility for Python
import logging logger = logging.getLogger() handler = logging.StreamHandler() formatter = logging.Formatter('%(asctime)s %(name)-12s %(levelname)-8s %(message)s') handler.setFormatter(formatter) if not logger.handler: logger.addHandler(handler) logger.setLevel(logging.DEBUG) logger # or logging.basicConfig( format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG) # at the end of the program handler.close() logger.removeHandler(handler)
- ylog
from ylib import ylog import logging ylog.set_level(logging.DEBUG) ylog.console_on() ylog.filelog_on("app")
14.7 logging.config — Logging configuration
14.8 logging.handlers — Logging handlers
14.9 getpass — Portable password input
14.10 curses — Terminal handling for character-cell displays
14.11 curses.textpad — Text input widget for curses programs
14.12 curses.ascii — Utilities for ASCII characters
14.13 curses.panel — A panel stack extension for curses
14.14 platform — Access to underlying platform’s identifying data
14.15 errno — Standard errno system symbols
14.16 ctypes — A foreign function library for Python
15 Concurrent Execution
15.1 threading — Thread-based parallelism
15.2 threading & queue
15.2.1 install
pip install queuelib
15.2.2 example
from Queue import Queue import threading
15.3 multiprocessing — Process-based parallelism
15.4 The concurrent package
15.5 concurrent.futures — Launching parallel tasks
15.6 subprocess — Subprocess management
15.7 sched — Event scheduler
15.8 queue — A synchronized queue class
15.9 dummy_threading — Drop-in replacement for the threading module
15.10 _thread — Low-level threading API
15.11 _dummy_thread — Drop-in replacement for the _thread module
16 Internet Data Handling
16.1 Jupyter notebook
16.1.1 Using a virtualenv in an IPython notebook
- Install the ipython kernel module into your virtualenv
workon my-virtualenv-name # activate your virtualenv, if you haven't already pip install ipykernel
- Now run the kernel "self-install" script:
python -m ipykernel install --user --name=my-virtualenv-name
- list all the kernels
jupyter kernelspec list
- remove uninstall kernel
jupyter kernelspec uninstall anaconda2.7
16.1.2 Extension & Configuration
- install extension:
conda install -c conda-forge jupyter_contrib_nbextensions
- Enable line number by default
touch ~/.jupyter/static/custom/custom.js
add below text in the file:
define([ 'base/js/namespace', 'base/js/events' ], function(IPython, events) { events.on("app_initialized.NotebookApp", function () { IPython.Cell.options_default.cm_config.lineNumbers = true; } ); } );
- enable auto complete
jupyter nbextension enable hinterland/hinterland
16.2 typical structure of the ipython notebook ipynb:
- imports
- get data
- transform data
- modeling
- visualization
- making sense of the data
summary:
- notebook should have one hypothesis data interpretation loop
- make a multi-project utils library
- each cell should have one and only one output
- try to keep code inside notebooks.
16.3 fetch data from yahoo
install pandas-datareader first.
conda install pandas-datareader
import pandas as pd import datetime as dt import numpy as np from pandas_datareader import data as web data = pd.DataFrame() symbols = ['GLD', 'GDX'] for sym in symbols: data[sym] = web.DataReader(sym, data_source='yahoo', start='20100510')['Adj Close'] data = data.dropna()
16.4 email — An email and MIME handling package
16.5 json — JSON encoder and decoder
16.6 Graph
16.6.1 networkx
- add node to a graph
- add edges to a graph
- find a loop/cycle in a graph
nx.find_cycle(G) list(nx.simple_cycles(G))
17 Internet Protocols and Support
17.1 webbrowser — Convenient Web-browser controller
17.2 cgi — Common Gateway Interface support
17.3 cgitb — Traceback manager for CGI scripts
17.4 wsgiref — WSGI Utilities and Reference Implementation
17.5 urllib — URL handling modules
17.6 urllib.request — Extensible library for opening URLs
17.7 urllib.response — Response classes used by urllib
17.8 urllib.parse — Parse URLs into components
17.9 urllib.error — Exception classes raised by urllib.request
17.10 urllib.robotparser — Parser for robots.txt
17.11 http — HTTP modules
17.12 http.client — HTTP protocol client
17.13 ftplib — FTP protocol client
17.14 poplib — POP3 protocol client
17.15 imaplib — IMAP4 protocol client
17.16 nntplib — NNTP protocol client
17.17 smtplib — SMTP protocol client
17.18 smtpd — SMTP Server
17.19 telnetlib — Telnet client
17.20 uuid — UUID objects according to RFC 4122
17.21 socketserver — A framework for network servers
17.22 http.server — HTTP servers
17.23 http.cookies — HTTP state management
17.24 http.cookiejar — Cookie handling for HTTP clients
17.25 xmlrpc — XMLRPC server and client modules
17.26 xmlrpc.client — XML-RPC client access
17.27 xmlrpc.server — Basic XML-RPC servers
17.28 ipaddress — IPv4/IPv6 manipulation library
18 Development Tools
18.1 typing — Support for type hints
18.2 pydoc — Documentation generator and online help system
18.3 doctest — Test interactive Python examples
18.4 unittest — Unit testing framework
- check data operation:
- create, select, update, delete.
- purpose of unit test
- checking parameter types, classes, or values.
- checking data structure invariants.
- checking “can’t happen” situations (duplicates in a list, contradictory state variables.)
- after calling a function, to make sure that its return is reasonable.
18.5 unittest.mock — mock object library
18.6 unittest.mock — getting started
18.7 test — Regression tests package for Python
18.8 test.support — Utilities for the Python test suite
19 Debugging and Profiling
19.1 bdb — Debugger framework
19.2 faulthandler — Dump the Python traceback
19.3 pdb — The Python Debugger
- s(tep):
Execute the current line, stop at the first possible occasion (either in a function that is called or in the current function).
- n(ext):
Continue execution until the next line in the current function is reached or it returns.
- unt(il):
Continue execution until the line with a number greater than the current one is reached or until the current frame returns.
- r(eturn):
- c(ont(inue)):
Continue execution, only stop when a breakpoint is encountered.
- l(ist): [first [,last]]
List source code for the current file. Without arguments, list 11 lines around the current line or continue the previous listing. With one argument, list 11 lines starting at that line. With two arguments, list the given range; if the second argument is less than the first, it is a count.
- a(rgs):
Print the argument list of the current function.
- p expression:
Print the value of the expression.
19.4 The Python Profilers
STEPS: 1). install snakeviz using pip from cmd.
pip install snakeviz
2). profile the test python file using below command.
$ python -m cProfile -o profile.stats test.py
# test.py from random import randint max_size = 10**4 data = [randint(0, max_size) for _ in range(max_size)] test = lambda: insertion_sort(data)
3). check the efficiency result from profile.stats file.
$ snakeviz profile.stats
19.5 timeit — Measure execution time of small code snippets
19.6 trace — Trace or track Python statement execution
19.7 tracemalloc — Trace memory allocations
20 Software Packaging and Distribution
20.1 pip
- install opencv, pytorch:
pip install opencv-python torch
- install with a wheel .whl file:
pip install *.whl
- Upgrading pip
pip install -U pip
- add below setup to ~/.pip/pip.conf
[global] #index-url=https://pypi.mirrors.ustc.edu.cn/simple/ #index-url=https://pypi.python.org/simple/ index-url=http://mirrors.aliyun.com/pypi/simple/ #index-url=https://pypi.gocept.com/pypi/simple/ #index-url=https://mirror.picosecond.org/pypi/simple/ [install] trusted-host=mirrors.aliyun.com #trusted-host=mirrors.ustc.edu.cn
- generate a requirements file:
pip freeze > requirements.txt
- pip install directly:
requirements.txt
--index-url http://mirrors.aliyun.com/pypi/simple/ pandas pylint pep8 sphinx ipython numpy ipdb mock nose
20.2 distutils — Building and installing Python modules
20.3 ensurepip — Bootstrapping the pip installer
20.4 venv — Creation of virtual environments
20.5 zipapp — Manage executable python zip archives
20.6 pyenv — Simple Python version management
- check installed versions
pyenv versions
system 2.7.13 3.6.0 3.6.0/envs/general 3.6.0/envs/simulate 3.6.0/envs/venv3.6.0 3.6.0/envs/venv3.6.0.1 * anaconda3-4.4.0 (set by /home/weiwu/projects/simulate/.python-version) general simulate venv3.6.0 venv3.6.0.1
21 Python Runtime Services
21.1 sysconfig — Provide access to Python’s configuration information
21.2 os, sys — System-specific parameters and functions
- get environment variables
import os env_dist = os.environ # environ是在os.py中定义的一个dict environ = {} print(env_dist.get('JAVA_HOME')) print(env_dist['JAVA_HOME'])
- check if file or directory exists, if not then make directory:
import os os.path.exists(test_file.txt) os.path.isfile("test-data") export_dir = "export/" if not os.path.exists(export_dir): os.mkdir(export_dir)
- read a file:
import os folder = '/file/path' file = os.path.join(folder, 'file_name')
- list all the files under a directory:
# os.listdir() 方法用于返回指定的文件夹包含的文件或文件夹的名字的列表。这个列表以字母顺序。 它不包括 '.' 和'..' 即使它在文件夹中. path = os.getcwd() dirs = os.listdir(path)
- check if the file readable:
import os if os.access("/file/path/foo.txt", os.F_OK): print "Given file path is exist." if os.access("/file/path/foo.txt", os.R_OK): print "File is accessible to read" if os.access("/file/path/foo.txt", os.W_OK): print "File is accessible to write" if os.access("/file/path/foo.txt", os.X_OK): print "File is accessible to execute"
- use sys to get command arguments:
#!/usr/bin/python3 import sys print ('参数个数为:', len(sys.argv), '个参数。') print ('参数列表:', str(sys.argv))
$ python3 test.py arg1 arg2 arg3 参数个数为: 4 个参数。 参数列表: ['test.py', 'arg1', 'arg2', 'arg3']
21.3 builtins — Built-in objects
21.4 __main__ — Top-level script environment
21.5 warnings — Warning control
- SettingWithCopyWarning in Pandas, ignore pandas warning
pd.options.mode.chained_assignment = None # default='warn'
- RuntimeWarning in numpy:
numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
import warnings warnings.filterwarnings("ignore", message="numpy.dtype size changed") warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
21.6 contextlib — Utilities for with-statement contexts
21.7 abc — Abstract Base Classes
21.8 atexit — Exit handlers
21.9 traceback — Print or retrieve a stack traceback
21.10 __future__ — Future statement definitions
从Python 2.7到Python 3.x就有不兼容的一些改动,比如2.x里的字符串用'xxx'表示str,Unicode字符串用u'xxx'表示unicode,而在3.x中,所有字符串都被视为unicode,因此,写u'xxx'和'xxx'是完全一致的,而在2.x中以'xxx'表示的str就必须写成b'xxx',以此表示“二进制字符串”。
要直接把代码升级到3.x是比较冒进的,因为有大量的改动需要测试。相反,可以在2.7版本中先在一部分代码中测试一些3.x的特性,如果没有问题,再移植到3.x不迟。
Python提供了__future__模块,把下一个新版本的特性导入到当前版本,于是我们就可以在当前版本中测试一些新版本的特性。
from __future__ import print_function from __future__ import division from __future__ import unicode_literals from __future__ import absolute_import
- unicode vs utf-8 vs binary strings vs strings
unicode 是编码unique code,例如把一个汉字编成了一个码(计算机不可读).
A chinese character: 汉
it's unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
UTF-8是把character转为binary code的规范, and vice versa. 方便存储。
binary | |||||
1st Byte | 2nd Byte | 3rd Byte | 4th Byte | Number of Free Bits | Maximum Expressible Unicode Value |
0xxxxxxx | 7 | 007F hex (127) | |||
110xxxxx | 10xxxxxx | (5+6)=11 | 07FF hex (2047) | ||
1110xxxx | 10xxxxxx | 10xxxxxx | (4+6+6)=16 | FFFF hex (65535) | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | (3+6+6+6)=21 | 10FFFF hex (1,114,111) |
已知“严”的unicode是4E25(100111000100101),根据上表,可以发现4E25处在第三行的范围内(0000 0800-0000 FFFF),因此“严”的UTF-8编码需要三个字节,即格式是“1110xxxx 10xxxxxx 10xxxxxx”。然后,从“严”的最后一个二进制位开始,依次从后向前填入格式中的x,多出的位补0。这样就得到了,“严”的UTF-8编码是“11100100 10111000 10100101”,转换成十六进制就是E4B8A5。
You can use a different encoding from UTF-8 by putting a specially-formatted comment as the first or second line of the source code:
是为了让解释器在执行该文件的时候知道该文件是何种编码方式,从而顺利读取指令去执行计算。
- division
新的除法特性,本来的除号`/`对于分子分母是整数的情况会取整,但新特性中在此情况下的除法不会取整,取整的使用`//`。如下可见,只有分子分母都是整数时结果不同。
- print_function
新的print是一个函数,如果导入此特性,之前的print语句就不能用了。
- unicode_literals
这个是对字符串使用unicode字符
21.11 gc — Garbage Collector interface
21.12 inspect — Inspect live objects
- find the folder of a module:
import inspect inspect.getfile(Module)
21.13 site — Site-specific configuration hook
21.14 fpectl — Floating point exception control
22 Custom Python Interpreters
22.1 code — Interpreter base classes
22.2 codeop — Compile Python code
23 Importing Modules
23.1 Module
- reload a module/lib in ipython without killing it.
# For Python 2 use built-in function reload(): reload(module) # For Python 2 and 3.2–3.3 use reload from module imp: import imp imp.reload(module) # or import importlib importlib.reload(module)
- 当你运行一个Python模块 python fibo.py <arguments> Remember, everything in python is an object.
- python解释器CPython.
- 如果字符串里面有’\’,而实际上要把斜杠加到字符串里面,要在前面加r,代表raw. 例如r’Y:\codes’.
- 如果主目录下有子目录packages,记得要在子目录下加init.py,最好在主目录下建main.py函数。
- 解释器如果执行哪个一个文件为主程序,如 python program1.py,it will set the special variable name to program1.py equals to ‘main’
- One of the reasons for doing this is that sometimes you write a module (a .py file) where it can be executed directly. Alternatively, it can also be imported and used in another module. By doing the main check, you can have that code only execute when you want to run the module as a program and not have it execute when someone just wants to import your module and call your functions themselves.
- 模块中的代码将会被执行,就像导入它一样,不过此时name 被设置为 “main“。这意味着,通过在你的模块末尾添加此代码.
- can’t import module from upper directory.
- need to add the working directory to .bashrc PYTHONPATH
- using ipython.
- Add custom folder path to the Windows environment. add PYTHONEXE%; to System Variable PATH; add System variable name: PYTHONEXE , value: C:\Users\Wei Wu\Anaconda2;C:\Users\Wei Wu\Python\ylib\src\py\; add PYTHONPATH: C:\Users\Wei Wu\Anaconda2;C:\Users\Wei Wu\Python\ylib\src\py\; or add module path in Spyder directly;
- import module temperarily from parental directory without add path to the system.
# folder1 # \__init__.py # \State.py # \StateMachine.py # \mouse_folder # \MouseAction.py import os,sys,inspect currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))) parentdir = os.path.dirname(currentdir) sys.path.insert(0,parentdir+'\\mouse') sys.path.insert(0,parentdir) from State import State from StateMachine import StateMachine from MouseAction import MouseAction
- check module path:
import os print os.path.abspath(ylib.__file__)
- make a python 3 virtual environment:
mkvirtual -p python3 ENVNAME
- install setup.py:
python setup.py install to virtual environment: home/weiwu.virtualenvs/data_analysis/bin/python2 setup.py install
- install from github:
pip install git+https://github.com/quantopian/zipline.git
conda uninstall tqdm easy_install git+https://github.com/quantopian/zipline.git
- change conda source/configuration:
vim ~/.condarc channels: - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/ - https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/ show_channel_urls: true # or conda config --add channels 'https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/' conda config --set show_channel_urls yes
- easy_install multiple versions, remove version:
import pkg_resources pkg_resources.require("gensim") # latest installed version pkg_resources.require("gensim==3.7.2") # this exact version pkg_resources.require("gensim>=3.7.2") # this version or higher
- Removing an environment
To remove an environment, in your terminal window or an Anaconda Prompt, run:
conda remove --name myenv --all # You may instead use conda env remove --name myenv. # To verify that the environment was removed, in your terminal window or an Anaconda Prompt, run: conda info --envs
- create conda virtualenv:
# 创建 conda 虚拟环境( :code:`env_name` 是您希望创建的虚拟环境名) $ conda create --name env_name python=3.5 # 如您想创建一个名为rqalpha的虚拟环境 $ conda create --name rqalpha python=3.5 # 使用 conda 虚拟环境 $ source activate env_name # 如果是 Windows 环境下 直接执行 activcate $ activate env_name # 退出 conda 虚拟环境 $ source deactivate env_name # 如果是 Windows 环境下 直接执行 deactivate $ deactivate env_name # 删除 conda 虚拟环境 $ conda-env remove --name env_name # add conda for all users sudo ln -s /share/anaconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh # Previous to conda 4.4, the recommended way to activate conda was to modify PATH in # your ~/.bashrc file. You should manually remove the line that looks like export PATH="/share/anaconda3/bin:$PATH" # ^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^
- percentage output format:
from future import division print “%s %.4f%%” % (sid, (len(not_close)/len(ctp)))
23.2 zipimport — Import modules from Zip archives
23.3 pkgutil — Package extension utility
23.4 modulefinder — Find modules used by a script
23.5 runpy — Locating and executing Python modules
23.6 importlib — The implementation of import
24 Python Language Services
24.1 parser — Access Python parse trees
24.2 ast — Abstract Syntax Trees
24.3 symtable — Access to the compiler’s symbol tables
24.4 symbol — Constants used with Python parse trees
24.5 token — Constants used with Python parse trees
24.6 keyword — Testing for Python keywords
24.7 tokenize — Tokenizer for Python source
24.8 tabnanny — Detection of ambiguous indentation
24.9 pyclbr — Python class browser support
24.10 py_compile — Compile Python source files
24.11 compileall — Byte-compile Python libraries
24.12 dis — Disassembler for Python bytecode
24.13 pickletools — Tools for pickle developers
25 Miscellaneous Services
25.1 formatter — Generic output formatting
26 MS Windows Specific Services
26.1 msilib — Read and write Microsoft Installer files
26.2 msvcrt — Useful routines from the MS VC++ runtime
26.3 winreg — Windows registry access
26.4 winsound — Sound-playing interface for Windows
27 Unix Specific Services
27.1 posix — The most common POSIX system calls
27.2 pwd — The password database
27.3 spwd — The shadow password database
27.4 grp — The group database
27.5 crypt — Function to check Unix passwords
27.6 termios — POSIX style tty control
27.7 tty — Terminal control functions
27.8 pty — Pseudo-terminal utilities
27.9 fcntl — The fcntl and ioctl system calls
27.10 pipes — Interface to shell pipelines
27.11 resource — Resource usage information
27.12 nis — Interface to Sun’s NIS (Yellow Pages)
27.13 syslog — Unix syslog library routines
28 Superseded Modules
28.1 optparse — Parser for command line options
28.2 imp — Access the import internals
29 Undocumented Modules
29.1 Platform specific modules
29.2 call java service
import subprocess try: subprocess.call(["java", "-jar", grobid_jar, # Avoid OutOfMemoryException "-Xmx1024m", "-gH", grobid_home, "-gP", os.path.join(grobid_home, "config/grobid.properties"), "-dIn", pdf_folder, "-exe", "processReferences"]) return True except subprocess.CalledProcessError: return False
30 Data Analysis:
30.1 pandas:
- scientific annotation in float:
pd.set_option('display.float_format', lambda x: '%.4f' % x)
- Find the column name which has the maximum value for each row
df.idxmax(axis=1)
- change jupyter view:
pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 500) pd.set_option('display.width', 1000)
- add new columns to a dataframe:
>>> df = pd.DataFrame([[i] for i in range(10)], columns=['num']) >>> df num 0 0 1 1 2 2 3 3 >>> def powers(x): >>> return x, x**2, x**3, x**4, x**5, x**6 >>> df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \ >>> zip(*df['num'].map(powers))
- first day of the previous month:
today = datetime.datetime.today() the_last_day_of_previous_month = today.datetime.timedelta(days=1) the_first_day_of_previous_month = the_last_day_of_previous_month.replace(day=1)
- N month before:
today - pd.Timedelta(1, unit='M')
- pandas columns into default dictionary:
df_patient[['pid','med_clinic_id']].groupby('pid').apply(lambda x:x['med_clinic_id'].tolist()).to_dict()
- groupby order:
# select the first row in each group, group have multindex. df_item.groupby(level=0, group_keys=False).apply(lambda x: x.sort_values('abs_diff', ascending=False)).groupby(level=0).head(3) df_most_visited_hospitals.sort_values(['PERSON_ID','DISCHARGE_DATE'],ascending=True).groupby(['PERSON_ID', 'ICD3']).first()
- filter by groupby count size:
df_patient = df_patient.groupby(['hos_id', 'disease']).filter(lambda x:x['person_id'].unique().size>=3)
- apply to each group in groupby:
for group in df2.groupby('type'): print(group_id, group[0]) data = group[1] if group[0] == 'A': print group[1].min()
- groupby qcut:
df = pd.DataFrame({'A':'foo foo foo bar bar bar'.split(), 'B':[0.1, 0.5, 1.0]*2}) df['C'] = df.groupby(['A'])['B'].transform( lambda x: pd.qcut(x, 3, labels=range(1,4))) print(df) # if error ValueError: Length mismatch: Expected axis has 5564 elements, new values have 78421 elements # Groupby does not group the NaNs: df_patient['disease_code'].fillna('unk', inplace=True)
- average times groupby:
df_patient.reset_index().groupby('disease_code').apply(lambda x:['med_clinic_id'].count()/x['person_id'].nunique())
- ValueError: Bin edges must be unique:
pd.qcut(df['column'].rank(method='first'), nbins)
- pd.to_datetimeindex error:
- apply to a column for each row:
df = pd.DataFrame([[1,2,3],[4,5,6]], columns=['a','b','c']) def rowFunc(row): return row['a'] + row['b'] * row['c'] def rowIndex(row): return row.name df['d'] = df.apply(rowFunc, axis=1) df['rowIndex'] = df.apply(rowIndex, axis=1) df Out[182]: a b c d rowIndex 0 1 2 3 7 0 1 4 5 6 34 1
- permutation of an array:
person_id = df_disease_sample['person_id'].unique() from itertools import permutations perm = permutations(person_id, 2) df_person_similarity_adj_matrix = pd.DataFrame(index=perm, columns=['similarity']) df_person_similarity_adj_matrix
- remove value in pandas index:
index.drop('value')
- select rows with value in multiple columns:
df_suspicious_pairs[df_suspicious_pairs[['person_id_1', 'person_id_2']].isin(['11076976']).any(axis=1)] def add_edges(G, row, df_suspicious_pairs): person_id = row.index df_targets = df_suspicious_pairs[df_suspicious_pairs[['person_id_1', 'person_id_2']].isin(person_id).any(axis=1)] G.add_edges_from([tuple(x) for x in df_targets[['person_id_1', 'person_id_2']].values]) df_suspicious_person.progress_apply(lambda x: add_edges(G, x, df_suspicious_pairs))
- delete rows that contain string value in a column:
df_disease_sample[~df_disease_sample['treatment_code'].str.contains('DE')]
- pandas values to dict:
med_hos_id_mapping.set_index('med_clinic_id')['hos_id'].to_dict()
- create dataframe from a list of tuples:
pd.DataFrame.from_records([tuples])
- create multiple index for index:
similarity_rank.index = pd.MultiIndex.from_tuples(similarity_rank.index)
- groupby, transform, agg:
df = pd.DataFrame(dict(A=list('aabb'), B=[1, 2, 3, 4], C=[0, 9, 0, 9])) # groupby is the standard use aggregater df.groupby('A').mean() # maybe you want these values broadcast across the whole group and return something with the same index as what you started with. # use transform df.groupby('A').transform('mean') # is equalvilent to groupby("A").mean() df.set_index('A').groupby(level='A').transform('mean') # set column equal to groupby mean df_item_dis_lv['mean'] = df_item_dis_lv.groupby(['name', 'dis', 'JGDJ'])['MODEL_JE'].transform('mean') # agg is used when you have specific things you want to run for different columns or more than one thing run on the same column. df.groupby('A').agg(['mean', 'std'])
- groupby aggregate to list:
df.groupby('a')['b'].apply(list)
- read oracle database UnicodeDecodeError 'gb' :
set the language of the editor from Chinese to English.
import cx_Oracle as cx import pandas as pd import numpy as np import os from tqdm import * import os from sqlalchemy import create_engine os.environ['NLS_LANG'] = 'SIMPLIFIED CHINESE_CHINA.UTF8' engine = create_engine('oracle://MMAPV41:MMAPV411556@192.168.4.32:1521/orcl?charset=utf8') conn=cx.connect('MMAPV41/MMAPV411556@192.168.4.32/orcl') sql_regist = """ select med_clinic_id, person_id, person_nm, person_sex, person_age, in_hosp_date, out_hosp_date, med_ser_org_no, clinic_type, in_diag_dis_nm, out_diag_doc_cd, med_amout, hosp_lev from t_kc21 """ df_regist = pd.read_sql_query(sql_regist, engine) s_med_clinic_id = pd.read_pickle('med_clinic_id.pkl') n = 100 sql_regist = """ select med_clinic_id, person_id, person_nm, person_sex, person_age, in_hosp_date, out_hosp_date, med_ser_org_no, clinic_type, in_diag_dis_nm, out_diag_doc_cd, med_amout, hosp_lev from t_kc21 where med_clinic_id in (%s) """ df_regist = pd.DataFrame() for i in tqdm(range(0, int(len(s_med_clinic_id)/100), n)): s = "'"+','.join(s_med_clinic_id.ix[i:(i+n)].values.flatten()).replace(',',"','")+"'" sql = sql_regist%(s) try: df_regist_iter = pd.read_sql(sql, conn) df_regist = df_regist.append(df_regist_iter) except UnicodeDecodeError: continue df_regist.to_pickle("registration_data.pkl")
- filter rows by number, groupby filter:
df_item = t_order.groupby(['name']).filter(lambda x:x['hos_id'].unique().size>=10)
- resample groupby aggregate:
df_simul_sample = df_simul_sample.resample('1H')['PERSON_ID'].agg(list)
- convert columns from capital to lower:
df_patient = df_patient.rename(columns = lambda x: x.lower())
- find rows with nearest dates/value:
df_result = pd.DataFrame() for idx, value in enumerate(groups): df_target_group = df_patient.loc[value] df_target_group.sort_values('入院日期', inplace=True) df_target_group['checkin_diff'] = df_target_group['入院日期'].diff()/np.timedelta64(1, 'D') df_target_group.reset_index(inplace=True) index = df_target_group[df_target_group['checkin_diff']<=3].index result = pd.concat([df_target_group.loc[index], df_target_group.loc[index-1]]).sort_values('入院日期') result.drop_duplicates(['个人ID','入院日期'], inplace=True) result['group'] = idx result['hospitals'] = result['机构'].unique().shape[0] df_result = df_result.append(result.drop('checkin_diff',axis=1))
- save to excel sheet:
writer = pd.ExcelWriter('nanjing_result.xlsx') df_nanjing.to_excel(writer,'全部分组') big_groups.to_excel(writer,'超大组') writer.save()
- concate multiple columns into one column:
diff_checkout_next_checkin['date'] = dataframe.loc[index, columns].stack().sort_values() diff_checkout_next_checkin['diff'] = diff_checkout_next_checkin['date'].diff()/np.timedelta64(1, 'M')
- sort values:
df_nanjing.sort_values(['group_sn', '个人ID', '入院日期'], inplace=True)
- read excel:
df_nanjing = pd.read_excel('result.xlsx',dtype={'证件号':str, '个人ID':str}, parse_dates=['入院日期','出院日期'])
- groupby ratio:
df_items_sum = df_items.groupby(['disease_code', 'soc_srt_dire_nm']).agg({'amount': 'sum'}) # Change: groupby state_office and divide by sum df_items_ratio = df_items_sum.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
- groupby group size:
group_size = df_nanjing.groupby(['group_sn'])['个人ID'].unique().apply(len) big_groups = df_nanjing[df_nanjing['group_sn'].isin(group_size[group_size >= 8].index)] small_groups = df_nanjing[df_nanjing['group_sn'].isin(group_size[group_size < 8].index)]
- create a dataframe from a dictionary:
df = pd.DataFrame.from_dict({}, orient='index')
-filter by value counts:
df_person_count = df_simul['PERSON_ID'].value_counts() # filter person id count is less than 3 times df_simul = df_simul[df_simul['PERSON_ID'].isin(df_person_count[df_person_count>3].index)]
- get max value counts column name in each group:
df_sample1 = df_sample.sort_values(['person_id','discharge_date'],ascending=True).groupby(['person_id', 'icd3']).first().agg({'hosp_lev':pd.Series.value_counts}) df_sample1.groupby(['person_id']).sum().idxmax(axis=1)
- groupby value counts max count name:
s_count = df_patient.groupby('disease_code')['mon'].value_counts() s_count.name = 'cnt_mon' df_disease['cnt_mon'] = s_count.reset_index().pivot(index='disease_code', columns='mon', values='cnt_mon').idxmax(axis=1)
- get unique value counts within each group:
import pandas as pd df = pd.DataFrame({'date': ['2013-04-01','2013-04-01','2013-04-01','2013-04-02', '2013-04-02'], 'user_id': ['0001', '0001', '0002', '0002', '0002'], 'duration': [30, 15, 20, 15, 30]}) df.groupby('date').agg({'user_id':pd.Series.nunique})
- get unique columns value groups:
rels = ['疾病名称','诊疗大类2'] rels_cure = df_cure.groupby(rels).size().reset_index()[rels].values.tolist()
- fill value must be in categories:
- fill na with previous column:
df_tree.fillna(method='pad', axis=1, inplace=True)
- delete a column in a dataframe:
del df['column']
- create node edges
def create_node(label, nodes): count = 0 for node_name in tqdm(nodes): node = Node(label, name=node_name) # g.schema.create_uniqueness_constraint(label, node_name) try: g.create(node) count += 1 except ClientError: continue # debug(count) return '''创建实体关联边''' def create_relationship(start_node, end_node, edges, rel_type, rel_name): count = 0 # 去重处理 set_edges = [] for edge in edges: try: set_edges.append('###'.join(edge)) except TypeError: continue all = len(set(set_edges)) for edge in tqdm(set(set_edges)): edge = edge.split('###') p = edge[0] q = edge[1] if p==q: continue query = "match(p:%s),(q:%s) where p.name='%s'and q.name='%s' create (p)-[rel:%s{name:'%s'}]->(q)" % ( start_node, end_node, p, q, rel_type, rel_name) try: g.run(query) count += 1 # debug(rel_type) except Exception as e: info(e) return '''创建知识图谱中心疾病的节点''' def create_diseases_nodes(disease_infos): count = 0 for disease_dict in tqdm(disease_infos): node = Node("Disease", name=disease_dict['name'], desc=disease_dict['desc'], prevent=disease_dict['prevent'] ,cause=disease_dict['cause'], easy_get=disease_dict['easy_get'],cure_lasttime=disease_dict['cure_lasttime'], cure_department=disease_dict['cure_department'] ,cure_way=disease_dict['cure_way'] , cured_prob=disease_dict['cured_prob']) g.create(node) # count += 1 # debug(count) return
- select rows by column values:
df[(df['a'].isin(condition1))&(df['a'].isin(condition2))] df[df['a']==a] df_insurance_disease = df_insurance[(df_insurance['type']=='医疗费用-疾病') & (df_insurance['year'].isin(years[4:]))]
- create tuples or dictionary from two columns:
subset = hos_ids[['GHDJID', 'REAL_JE', 'hos_id']].reset_index() tuples = [tuple(x) for x in subset.values] # another method dict(df[['a', 'b']].values.tolist()) [tuple(x) for x in df[['a', 'b', 'c']].values]
- mapping:
dictionary = df_mapping.set_index('a')['b'].to_dict() df['a'].map(dict)
- filter not null, filter not NaT rows:
since strings data types have variable length, it is by default stored as object dtype. If you want to store them as string type, you can do something like this.
df_text.ix[df_text.Conclusion.values.nonzero()[0]] df['column'] = df['column'].astype('|S80') #where the max length is set at 80 bytes, # or alternatively df['column'] = df['column'].astype('|S') # which will by default set the length to the max len it encounters
- change datatypes
df.a.astype(float) drinks['beer_servings'] = drinks.beer_servings.astype(float)
- change date type to string:
drinks['beer_servings'] = drinks.beer_servings.astype('str)
- read csv without header, delimiter is space:
vocab = pd.read_csv("/home/weiwu/share/deep_learning/data/model/phrase/zhwiki/categories/三板.vocab",delim_whitespace=True,header=None)
- parse text in csv
reddit_news = pd.read_csv('/home/weiwu/share/deep_learning/data/RedditNews.csv') DJA_news = pd.read_csv( '/home/weiwu/share/deep_learning/data/Combined_News_DJIA.csv') na_str_DJA_news = DJA_news.iloc[:, 2:].values na_str_DJA_news = na_str_DJA_news.flatten() na_str_reddit_news = reddit_news.News.values sentences_reddit = [s.encode('utf-8').split() for s in na_str_reddit_news] sentences_DJA = [s.encode('utf-8').split() for s in na_str_DJA_news]
- rank
DataFrame.rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)[source] Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values
- n largest value
DataFrame.nlargest(n, columns, keep='first') Get the rows of a DataFrame sorted by the n largest values of columns.
>>> df = DataFrame({'a': [1, 10, 8, 11, -1], ... 'b': list('abdce'), ... 'c': [1.0, 2.0, np.nan, 3.0, 4.0]}) >>> df.nlargest(3, 'a') a b c 3 11 c 3 1 10 b 2 2 8 d NaN
- quantile
DataFrame.quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')[source] Return values at the given quantile over requested axis, a la numpy.percentile.
>>> df = DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), columns=['a', 'b']) >>> df.quantile(.1) a 1.3 b 3.7 dtype: float64 >>> df.quantile([.1, .5]) a b 0.1 1.3 3.7 0.5 2.5 55.0
- generate a dataframe:
dates = pd.date_range('1/1/2000', periods=8) df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D']) # or df = pd.DataFrame(data={'a':[1,2],'b':[3,3]})
- create diagonal matrix/dataframe using a series:
df = pd.DataFrame(np.diag(s), columns=Q.index)
- connection with mysql:
pandas.read_sql_query(sql, con=engine): pandas.read_sql_table(table_name, con=engine): pandas.read_sql(sql, con=engine) sql = 'DROP TABLE IF EXISTS etf_daily_price;' result = engine.execute(sql)
- dropna:
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
- melt.
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)[source]
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
""" Parameters: frame : DataFrame id_vars : tuple, list, or ndarray, optional Column(s) to use as identifier variables. value_vars : tuple, list, or ndarray, optional Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars. var_name : scalar Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’. value_name : scalar, default ‘value’ Name to use for the ‘value’ column. col_level : int or string, optional If columns are a MultiIndex then use this level to melt. """ DataFrame['idname'] = DataFrame.index pd.melt(DataFrame, id_vars=['idname'])
>>> import pandas as pd >>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'}, ... 'B': {0: 1, 1: 3, 2: 5}, ... 'C': {0: 2, 1: 4, 2: 6}}) >>> df A B C 0 a 1 2 1 b 3 4 2 c 5 6 >>> pd.melt(df, id_vars=['A'], value_vars=['B']) A variable value 0 a B 1 1 b B 3 2 c B 5
- fill nan:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs) # method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
- select non zero rows from series:
s[s.nonzero()]
- create value by cretics
df[df.col1.map(lambda x: x != 0)] = 1
- dataframe to series:
s = df[df.columns[0]]
- replace value:
DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad', axis=None) df = df[df.line_race != 0]
- pandas has value:
value in df['column_name'] set(a).issubset(df['a'])
- calculate percentage of sum on a row:
df.apply(lambda x: x / x.sum() * 100, axis=0)
- pandas has null value:
df.isnull().values.any()
- find all the values of TRUE in a dataframe:
z=(a!=b) pd.concat([a.ix[z[reduce(lambda x, y: x | z[y], z, False)].index],b.ix[z[reduce(lambda x, y: x | z[y], z, False)].index]],axis=1)
- reduce
from functools import reduce reduce(lambda x, y: x+y, range(1,101))
- if array a is a subset of another array b:
set(B).issubset(set(A))
- remove negative value from a column:
filtered_1=b[‘TRADE_size’].apply(lambda x: 0 if x < 0 else x) b[‘TRADE_size’].loc[ b[‘TRADE_size’]<0, ‘TRADE_size’] = 0
- drop all rows value equal to 0:
df.loc[~(df==0).all(axis=1)]
- drop columns/lable:
DataFrame.drop(labels, axis=1, level=None, inplace=False, errors='raise')
- check if any value is NaN in DataFrame
df.isnull().values.any() df.isnull().any().any()
- maximum & minimum value of a dataframe:
df.values.max() df.values.min()
- select value by creteria:
logger.debug("all weight are bigger than 0? %s", (df_opts_weight>0).all().all()) logger.debug("all weight are smaller than 1? %s", (df_opts_weight<=1).all().all()) logger.debug("weight sum smaller than 0: %s", df_opts_weight[df_opts_weight<0].sum(1))
- count all duplicates:
import pandas as pd In [15]: a=pd.DataFrame({'a':['KBE.US','KBE.US','KBE.US','KBE.US','KBE.US','KBE.US','O.US','O.US','O.US','O.US','O.US'],'b':['KBE','KBE','KBE','KBE','KBE','KBE','O','O','O','O','O']}) In [16]: count = a.groupby('a').count() In [20]: (count>5).all().all() Out[20]: False In [21]: (count>4).all().all() Out[21]: True - datetime64[ns] missing data, null: For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented by numpy in a singular dtype (datetime64[ns]). pandas objects provide intercompatibility between NaT and NaN. #+BEGIN_SRC python In [16]: df2 Out[16]: one two three four five timestamp a -0.166778 0.501113 -0.355322 bar False 2012-01-01 c -0.337890 0.580967 0.983801 bar False 2012-01-01 e 0.057802 0.761948 -0.712964 bar True 2012-01-01 f -0.443160 -0.974602 1.047704 bar False 2012-01-01 h -0.717852 -1.053898 -0.019369 bar False 2012-01-01 In [17]: df2.loc[['a','c','h'],['one','timestamp']] = np.nan In [18]: df2 Out[18]: one two three four five timestamp a NaN 0.501113 -0.355322 bar False NaT c NaN 0.580967 0.983801 bar False NaT e 0.057802 0.761948 -0.712964 bar True 2012-01-01 f -0.443160 -0.974602 1.047704 bar False 2012-01-01 h NaN -1.053898 -0.019369 bar False NaT
- rename column names:
df_bbg = df_bbg.rename(columns = lambda x: x[:4].replace(' ','')) df = df.rename(columns={'a':'A'})
rename according to column value type:
name = {2:'idname', 23:'value', 4:'variable'} df.rename(columns=lambda x: name[(gftIO.get_column_type(df,x))], inplace=True)
- rename column according to value:
name = {'INNERCODE': 'contract_code', 'OPTIONCODE': 'contract_name', 'SETTLEMENTDATE': 'settlement_date', 'ENDDATE': 'date', 'CLOSEPRICE': 'close_price'} data.rename(columns=lambda x: name[x], inplace=True)
- remove characters after space:
df_bbg = df_bbg.rename(columns = lambda x: x.)
- apply by group:
df_long_term = small_groups.groupby('个人ID').progress_apply(lambda x: long_term_hospitalization(x[['入院日期', '出院日期']], days=30))
- pandas long format to pivot:
pivoted = df.pivot('name1','name2','name3') specific_risk = self.risk_model['specificRisk'].pivot( index='date', columns='symbol', values='specificrisk') df_pivot_industries_asset_weights = pd.pivot_table( df_industries_asset_weight, values='value', index=['date'], columns=['industry', 'symbol'])
- pivot time series to hourly
df['hour'] = df['date'].dt.hour df['day'] = df['date'].dt.datetime pd.pivot_table(df,index='hour',columns='day',values='pb',aggfunc=np.sum)
- change the time or date or a datetime:
end = end.replace(hour=23, minute=59, second=59)
- 万德 wind python pandas
df = pd.Dataframe(data = w.wsd().Data[0], index=w.wsd().Times)
- check DatetimeIndex difference:
# to check the frequency of the strategy, DAILY or MONTHLY dt_diff = df_single_period_return.index.to_series().diff().mean() if dt_diff < pd.Timedelta('3 days'):
- time delta
import datetime s + datetime.timedelta(minutes=5)
- resample by a column:
need to set the index as a datetime index, then use the resample function.
- resample at a fraction but keep at least one:
med_id_sample = df_patient.groupby(['JGID','CYZDDM3']).apply(lambda x :x.iloc[random.choice(range(0,len(x)))])['GHDJID'].values med_id_sample1 = df_patient.groupby(['JGID','CYZDDM3']).apply(lambda x: x.sample(frac=0.1))['GHDJID'].values med_id_samples = np.unique(np.concatenate((med_id_sample, med_id_sample1))) df_patient[df_patient['GHDJID'].isin(med_id_samples)]
- resample by month and keep the last valid row
benchmark_weight.index.name = 'Date' m = benchmark_weight.index.to_period('m') benchmark_weight = benchmark_weight.reset_index().groupby(m).last().set_index('Date') benchmark_weight.index.name = ''
- groupby item ratio:
df_items_sum = df_items.groupby(['disease_code', 'soc_srt_dire_nm']).agg({'amount': 'sum'}) # Change: groupby state_office and divide by sum df_items_ratio = df_items_sum.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))
- groupby and sort by another column:
df_input_text_entity.sort_values(['score'],ascending=False).groupby('mention').head(1) # only take the largest value of score
- filter two dataframe by columns' value
pd.merge(df_input_text_entity_0, df_input_text_entity_1, on=['mention', 'entity'])
30.1.1 multiplying
- the multiplying calculation is not about the sequence of the index or column.
pandas will calculate on a sorted index and column value.
In [87]: a=pd.DataFrame({'dog':[1,2],'fox':[3,4]},index=['a','b']) In [88]: a Out[88]: dog fox a 1 3 b 2 4 In [89]: b=pd.DataFrame({'fox':[1,2],'dog':[3,4]},index=['b','a']) In [94]: b Out[94]: dog fox b 3 1 a 4 2 In [95]: a*b Out[95]: dog fox a 4 6 b 6 4
- dot multiplying
dot multiplying will sort the value.
In [99]: a.dot(b.T) Out[99]: b a a 6 10 b 10 16 In [100]: b.T Out[104]: b a dog 3 4 fox 1 2 In [105]: a Out[105]: dog fox a 1 3 b 2 4
30.1.2 Index
- Index manuplication
- set column as datetime index
index = index.set_index(pd.DatetimeIndex(index['tradeDate'])).drop('tradeDate', axis=1) # df = df.set_index(pd.DatetimeIndex(df['Date']))
- concaterate:
pd.concat([df1, df2], axis=0).sort_index() pd.concat([df1, df2], axis=1) result = df1.join(df2, how='outer’)
- check if the index is datetimeindex:
if isinstance(df_otv.index, pd.DatetimeIndex): df_otv.reset_index(inplace=True)
- pandas are two dataframe identical
pandas.DataFrame.equals()
- change index name:
df.index.names = ['Date']
- for loop in pandas dataframe:
for index, value in DataFrame:
- compare two time series:
s1[s1.isin(s2)] ax = df1.plot() df2.plot(ax=ax)
- datetime to string:
df.index.strftime("%Y-%m-%d %H:%M:%S")
- concaterate index
pd.concat([df1, df2], axis=1)
concate will take two dataframe to a new dataframe by index, preserving the columns. A: index variable value B: index variable value
pd.concat([A, B]) index variable value variable value
- merge
merge will take two dataframe to a new dataframe by index, on the columns. A: index variable value B: index variable value
pd.merge(A, B, how='left', on=['index', 'variable']) index variable value value
- update
update dataframe1 with dataframe2
- access hierarchical index.
- A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays), an array of tuples (using MultiIndex.from_tuples), or a crossed set of iterables (using MultiIndex.from_product).
df.loc[‘date’,’col’], df[‘date’], df.ix[[‘date1’, ‘date2’]]
- slicing:
df.loc['start':'end',], df['start': 'end']
- slice with a ‘range’ of values, by providing a slice of tuples:
df.loc[('2006-11-02','USO.US'):('2006-11-06','USO.US')] df.loc(axis=0)[:,['SPY.US']]
- select certain columns:
df.loc(axis=0)[:,['SPY.US']]['updatedTime']
- select rows with certain column value:
df.loc[df['column_name'].isin(some_values)]
- select date range using pd series.
date_not_inserted = whole_index[~whole_index.isin(date_in_database['date'])] df_need_to_be_updated = whole_df_stack.ix[days_not_in_db]
- remove pandas duplicated index
- convert a dataframe to an array:
pd.dataframe.to_matrix()
- panel:
- create from dictionary:
datetime_index = pd.DatetimeIndex(assets_group['date'].unique()) panel_model = pd.Panel({date: pd.DataFrame(0, index=assets.loc[date,'variable'], columns=assets.loc[date,'variable']) for date in datetime_index})
pandas panel item axis should be datetime64, this should not be an array.
- unpivot multindex, multindex into colum:
df_med_similarity_adj_matrix['similarity'] = df_med_similarity_adj_matrix.apply( lambda x: 1 - jaccard(docs[x.name[0]], docs[x.name[1]]), axis=1) df_med_similarity_adj_matrix.index = pd.MultiIndex.from_tuples(df_med_similarity_adj_matrix.index) df_med_similarity_adj_matrix.reset_index().pivot(index='level_0', columns='level_1', values='similarity')
30.2 numpy
- numpy unique without sort:
>>> import numpy as np >>> a = [4,2,1,3,1,2,3,4] >>> np.unique(a) array([1, 2, 3, 4]) >>> indexes = np.unique(a, return_index=True)[1] >>> [a[index] for index in sorted(indexes)] [4, 2, 1, 3]
- plot histogram:
>>> import matplotlib.pyplot as plt >>> rng = np.random.RandomState(10) # deterministic random data >>> a = np.hstack((rng.normal(size=1000), ... rng.normal(loc=5, scale=2, size=1000))) >>> plt.hist(a, bins='auto') # arguments are passed to np.histogram >>> plt.title("Histogram with 'auto' bins") >>> plt.show()
- which quantile:
df_average_cost['COST_LEVEL'] = pd.qcut(df_average_cost['MED_AMOUNT'], 3, labels=["high", "medium", "low"])
- quantile:
numpy.quantile(a, q, axis=None, out=None, overwrite_input=False, interpolation='linear', keepdims=False)
- upper triangle matrix:
import numpy as np a = np.array([[1,2,3],[4,5,6],[7,8,9]]) #array([[1, 2, 3], # [4, 5, 6], # [7, 8, 9]]) a[np.triu_indices(3, k = 1)] # this returns the following array([2, 3, 6])
- sort an array by descending:
In [25]: temp = np.random.randint(1,10, 10) In [26]: temp Out[26]: array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4]) In [27]: id(temp) Out[27]: 139962713524944 In [28]: temp[::-1].sort() In [29]: temp Out[29]: array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2]) In [30]: id(temp) Out[30]: 139962713524944
- save an array:
import numpy as np np.save(filename, array)
- maximum value in each row
np.amax(ar, axis=1)
- from 2-D array to 1-D array with one column
import numpy as np a = np.array([[1],[2],[3]])) a.flattern()
- Take a sequence of 1-D arrays and stack them as columns to make a single 2-D:
numpy.column_stack(tup) Parameters: tup : sequence of 1-D or 2-D arrays. Arrays to stack. All of them must have the same first dimension. >>> a = np.array((1,2,3)) >>> b = np.array((2,3,4)) >>> np.column_stack((a,b)) - expand 1-D numpy array to 2-D:
- expand the shape of an array:
numpy.expand_dims(a, axis) # Expand the shape of an array. # Insert a new axis that will appear at the axis position in the expanded array shape. >>> x = np.array([1,2]) >>> x.shape (2,) >>> y = np.expand_dims(x, axis=0) >>> y array([[1, 2]]) >>> y.shape (1, 2)
- count nan:
np.count_nonzero(~np.isnan(df['series']))
- count number of negative value:
np.sum((df < 0).values.ravel())
- check the difference of two arrays:
numpy.setdiff1d: Return the sorted, unique values in ar1 that are not in ar2
np.setdiff1d(ar1, ar2)
- turn a list of tuples into a list:
[item for t in lt for item in t]
- sorted a list of tuples
sorted(enumerate(sims), key=lambda item: -item[1])
- reshape:
np.reshape((1, -1)), -1 means automatic number of columns.
- select random symbols from a listdir:
# get random symbols at the target position limit position_limit = 8 arr = list(range(len(target_symbols))) np.random.shuffle(arr) target_symbols = target_symbols[arr[:position_limit]]
30.3 plot:
- %matplotlib inline
To set this up, before any plotting or import of matplotlib is performed you must execute the %matplotlib magic command. This performs the necessary behind-the-scenes setup for IPython to work correctly hand in hand with matplotlib; it does not, however, actually execute any Python import commands, that is, no names are added to the namespace.
30.3.1 subplot with the same axis:
pandas plot. using matplotlib:
- plot different series on the same chart.
cl_active_contract_pricing.plot() cl_pricing.plot(style='k--')
- plot in ipython or jupyter notebook:
ax = contract_data.plot(legend=True) continuous_price.plot(legend=True, style='k--', ax=ax) plt.show()
30.3.2 multiple figure一次绘制多个图形
- same plot, 多个figure:
# figure.py import matplotlib.pyplot as plt import numpy as np data = np.arange(100, 201) plt.plot(data) data2 = np.arange(200, 301) plt.figure() plt.plot(data2) plt.show()
- multiple subplots 在同一个窗口显示多个图形
import matplotlib.pyplot as plt fig = plt.figure() ax1 = fig.add_subplot(2, 2, 1) ax2 = fig.add_subplot(2, 2, 2) ax3 = fig.add_subplot(2, 2, 3) fig, axes = plt.subplots(2,3) fig, ax : tuple # or data = np.arange(100, 201) plt.subplot(2, 1, 1) plt.plot(data) data2 = np.arange(200, 301) plt.subplot(2, 1, 2) plt.plot(data2) plt.show()
30.3.3 subplot with different axis
plt.subplot(2, 1, 1) plt.boxplot(x1) plt.plot(1, x1.ix[-1], 'r*', markersize=15.0) plt.subplot(2, 1, 2) x1.plot() # or fig, axes = plt.subplots(2, 1, figsize=(10, 14)) axes[0].boxplot(pe000001) axes[0].plot(1, pe000001.ix[-1], 'r*', markersize=15.0) pe000001.plot()
30.3.4 plot a secondary y scale
df.price.plot(legend=True) (100-df.pct_long).plot(secondary_y=True, style='g', legend=True)
- highlight a certain value in the plot:
a['DGAZ.US'].hist(bins=50) plt.axvline(a['DGAZ.US'][-1], color='b', linestyle='dashed', linewidth=2)
30.3.5 plot seaborn:
- plot heatmap:
figure = plt.figure(figsize=(12,12)) ax = sns.heatmap(temp, vmin=0, vmax=10, fmt="d", cmap="YlGnBu", annot=True) # save plot ax.get_figure() figure.savefig('./images/test.png')
- save seaborn heatmap:
plt.subplots(figsize=(12,12)) # fig, ax = plt.subplots(figsize=(12,12)) ax = sns.heatmap(temp, vmin=0, vmax=10, fmt="d", cmap="YlGnBu", annot=True) # !!! can't save figure directly, need to get figure first. figure = ax.get_figure() figure.savefig('./images/%s.png'%(str(person_ids)))
30.3.6 plot a 3d figure:
from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt strike = np.linspace(50, 150, 5) ttm = np.linspace(0.5, 2.5, 8) strike, ttm = np.meshgrid(strike, ttm) iv = (strike - 100) ** 2 / (100 * strike) / ttm fig = plt.figure(figsize=(9,6)) ax = fig.gca(projection='3d') surf = ax.plot_surface(strike, ttm, iv, rstride=2, cstride=2, cmap=plt.cm.coolwarm, linewidth=0.5, antialiased=True) fig.colorbar(surf, shrink=0.5, aspect=5)
fig is the :class:matplotlib.figure.Figure object.
- ax can be either a single axis object or an array of axis
- objects if more than one subplot was created.
[http://docs.pythontab.com/interpy/args_kwargs/Usage_args/]
[http://python.usyiyi.cn/python_278/library/index.html]
[https://docs.python.org/2/reference/simple_stmts.html?highlight=assert]
30.3.7 display Chinese:
➜ log git:(master) ✗ fc-list :lang=zh /usr/share/fonts/truetype/wqy/wqy-microhei.ttc: 文泉驿微米黑,文泉驛微米黑,WenQuanYi Micro Hei:style=Regular /usr/share/fonts/truetype/droid/DroidSansFallbackFull.ttf: Droid Sans Fallback:style=Regular /usr/share/fonts/truetype/wqy/wqy-microhei.ttc: 文泉驿等宽微米黑,文泉驛等寬微米黑,WenQuanYi Micro Hei Mono:style=Regular
import matplotlib.font_manager as mfm import matplotlib.pyplot as plt font_path = "/usr/share/fonts/truetype/wqy/wqy-microhei.ttc" prop = mfm.FontProperties(fname=font_path) plt.text(0.5, 0.5, s=u'测试', fontproperties=prop) plt.show()
30.3.8 stacked barplot, portfolio change
pivot_df_insurance_accident = df_insurance_accident.pivot( index='year', columns='cat_age_sex', values='basic_insurance_fee') pivot_df_insurance_disease.plot.bar(title='医疗保险-疾病纯保费',stacked=True, figsize=(10,7))
30.4 scipy
- combination k from n.
\[ {\displaystyle {\binom {n}{k}}={\frac {n(n-1)\dotsb (n-k+1)}{k(k-1)\dotsb 1}},} {\binom {n}{k}}={\frac {n(n-1)\dotsb (n-k+1)}{k(k-1)\dotsb 1}}\]
which can be written using factorials as\[ {\displaystyle \textstyle {\frac {n!}{k!(n-k)!}}} \textstyle {\frac {n!}{k!(n-k)!}} \]
>>> from scipy.special import comb >>> k = np.array([3, 4]) >>> n = np.array([10, 10]) >>> comb(n, k, exact=False) array([ 120., 210.]) >>> comb(10, 3, exact=True) 120L >>> comb(10, 3, exact=True, repetition=True) 220L
[https://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.comb.html]
30.5 networkx:
- list edges of a node:
G.neibors('node')
- count edges of a node:
G.in_edges_degree() G.out_edges_degree()
- create an edge:
G.add_edges_from([(a,b)]) G.add_weighted_edges_from([(a,b,weight)])
- create a graph:
G = nx.Graph() # directed graph G = nx.DiGraph()
- dump a graph:
nx.write_gexf(G, 'file/path.gexf')
- draw a graph:
nx.draw(G, with_labels=True)
31 Machine learning:
31.1 data processing:
- coding, mapping a list into range:
from sklearn.preprocessing import LabelEncoder class_label = LabelEncoder() data["label"] = class_label.fit_transform(data["label"].values) # or label_mapping = {label:idx for idx,label in enumerate(np.unique(data["label"]))}
- one hot coding:
from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder X = data[["color", "price"]].values #通过类标编码将颜色装换成为整数 color_label = LabelEncoder() X[:,0] = color_label.fit_transform(X[:,0]) #设置颜色列使用oneHot编码 one_hot = OneHotEncoder(categorical_features=[0]) print(one_hot.fit_transform(X).toarray()) # or pd.get_dummies(data[["color","price"]])
32 Deep Learning
32.1 Tensorflow
- convert string to number index:
with tf.Session() as sess: mapping_strings = tf.constant(["emerson", "lake", "palmer"]) feats = tf.constant(["emerson", "lake", "and", "palmer"]) ids = tf.contrib.lookup.string_to_index( feats, mapping=mapping_strings, default_value=-1) tf.compat.v1.tables_initializer().run() idx = ids.eval()# ==> [0, 1, -1, 2]
- disable tensorflow warnings:
os.environ["TF_CPP_MIN_LOG_LEVEL"]="2"
- install tensorflow:
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/ conda install tensorflow-gpu==1.3 conda config --set show_channel_urls yes conda install tensorflow-gpu==1.3
- test drive:
python -m tensorflow.models.image.mnist.convolutional
32.1.1 GPU test:
import tensorflow as tf
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
import tensorflow as tf with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
with tf.device('/device:GPU:1'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print(sess.run(c))
with tf.device('/device:GPU:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto( allow_soft_placement=True, log_device_placement=True))
print(sess.run(c))
from tensorflow.python.client import device_lib print(device_lib.list_local_devices())
import numpy as np import tensorflow as tf from datetime import datetime
device_name = "/GPU:0"
with tf.device(device_name): random_matrix = tf.random_uniform(shape=(1, 1), minval=0, maxval=1) dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix)) sum_operation = tf.reduce_sum(dot_operation)
startTime = datetime.now() with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session: result = session.run(sum_operation) print(result)
print("\n" * 5) print("Time taken:", datetime.now() - startTime)
print("\n" * 5)
32.2 Computer Vision
32.2.1 style transfer
32.2.2 basic operation
- convert an image into grey:
In [1]: from PIL import Image In [2]: image = Image.open('/tmp/capcha.png') In [7]: image = image.convert('L') In [3]: data = image.load() ...: w, h = image.size ...: for i in range(w): ...: for j in range(h): ...: if data[i, j] > 125: ...: data[i, j] = 255 # 纯白 ...: else: ...: data[i, j] = 0 # 纯黑 image.save('clean_captcha.png') img = cv2.imread(r'D:/UNI/Y3/DIA/2K18/lab.jpg') RGB_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) gray = cv2.cvtColor(RGB_img, cv2.COLOR_RGB2GRAY) equ = cv2.equalizeHist(gray) res = np.hstack((img,equ)) plt.imshow(gray, cmap='gray', vmin = 0, vmax = 255)
- show image in jupyter notebook:
from IPython.display import Image Image(filename=content_seg_path)
- read image into numpy arrays:
from scipy.misc import imread, imresize import cv2 init_img = cv2.imread(initImg) # [y, x, 3] BGR content = scipy.misc.imread(contentImg, mode='RGB')
- convert numpy array to image:
import scipy.misc rgb = scipy.misc.toimage(np_array) cv2.imwrite('color_img.jpg', np_array)
- read image into torch tensor:
import numpy as np from PIL import Image from torchvision import transforms from torchvision.utils import save_image def open_image(image_path, image_size=None): """ """ image = Image.open(image_path) _transforms = [] if image_size is not None: image = transforms.Resize(image_size)(image) # _transforms.append(transforms.Resize(image_size)) w, h = image.size _transforms.append(transforms.CenterCrop((h // 16 * 16, w // 16 * 16))) _transforms.append(transforms.ToTensor()) transform = transforms.Compose(_transforms) result = transform(image)[:3].unsqueeze(0) return result
- concatenate images into one:
def cat_images(fname, ls_images_path): images = [] max_width = 0 # find the max width of all the images total_height = 0 # the total height of the images (vertical stacking) for name in image_names: # open all images and find their sizes images.append(cv2.imread(name)) if images[-1].shape[1] > max_width: max_width = images[-1].shape[1] total_height += images[-1].shape[0] def cat_arrays(total_height,max_width,arrays): # create a new array with a size large enough to contain all the images final_array = np.zeros((total_height,max_width,3),dtype=np.uint8) current_y = 0 # keep track of where your current image was last placed in the y coordinate for array in arrays: # add an image to the final array and increment the y coordinate final_array[current_y:array.shape[0]+current_y,:array.shape[1],:] = array current_y += array.shape[0] return final_array final_image = cat_arrays(total_height, max_width, images) cv2.imwrite(fname,final_image) cat_images(fname, image_names)
- overlay visualize:
from matplotlib import gridspec from matplotlib import pyplot as plt import numpy as np from PIL import Image def vis_overlay(original_im, seg_map): """Visualizes input image, segmentation map and overlay view.""" original_im = Image.open(original_im) seg_map = Image.open(seg_map) plt.figure(figsize=(15, 5)) grid_spec = gridspec.GridSpec(1, 4, width_ratios=[6, 6, 6, 1]) plt.subplot(grid_spec[0]) plt.imshow(original_im) plt.axis('off') plt.title('input image') plt.subplot(grid_spec[1]) # seg_image = label_to_color_image(seg_map).astype(np.uint8) plt.imshow(seg_map) plt.axis('off') plt.title('result image') plt.subplot(grid_spec[2]) plt.imshow(original_im) plt.imshow(seg_map, alpha=0.7) plt.axis('off') plt.title('overlay image') plt.show()
33 NLP
33.1 Keywords
- corpus
input_text_translations = """ The Chinese government’s top management obviously also hopes to avoid the deterioration of the Sino-US conflict. The Sino-US trade war has started. After the Sino-US trade war began, it emphasized that China has "five advantages" in the trade war. He stressed: "We must especially prevent Sino-US cooperation. Trade conflict spreads to the ideological field """ from gensim import corpora, models, similarities from nltk.tokenize import word_tokenize, sent_tokenize from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer test_model = word_tokenize(input_text_translations.lower()) wordstest_model = sent_tokenize(input_text_translations) test_model = [word_tokenize(_d.lower()) for _d in docs] dictionary = corpora.Dictionary(test_model, prune_at=2000000) # for key in dictionary.iterkeys(): # print key,dictionary.get(key),dictionary.dfs[key] corpus_model = [dictionary.doc2bow(test) for test in test_model] tfidf_model = models.TfidfModel(corpus_model) # 对语料生成tfidf corpus_tfidf = tfidf_model[corpus_model] d = {dictionary.get(id): value for doc in corpus_tfidf for id, value in doc} # get the topic-word distribution corpus = [dictionary.doc2bow(text) for text in tokenized_data] dictionary1 = corpora.Dictionary(tokenized_data) dictionary1.filter_n_most_frequent(10) dictionary1.filter_extremes(no_above=0.9) filtered_words = [dictionary[y] for y in [x for x in dictionary.keys() if x not in dictionary1.keys()]] NUM_TOPICS = 5 # Build the LDA model lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary1, per_word_topics=True, alpha='asymmetric', minimum_probability=0.0) topic_distribution = lda_model.show_topics(num_words=50) df_topic_word_dis = pd.DataFrame([x[1].split(' + ') for x in topic_distribution]).T # frequency of word dictionary.dfs[dictionary.token2id["血清α羟基丁酸脱氢酶测定"]]/dictionary.num_docs