Tuesday, December 19, 2017

The TLA Problem -- Over-Engineering Three-Letter Acronyms

Here's something we can gleefully over-engineer. Because anything worth doing is worth over-engineering until it morphs into a different kind of problem.

I'm unclear on the backstory, so try not to ask "why are we doing this?" I think it has something to do with code camp and there are teenagers involved.

What we want to do is generate a pool of 200 TLA's. Seems simple, right?

The first pass may not be obvious, but this works well.

import string
import random
from typing import Iterator

DOMAIN = string.ascii_lowercase
def tlagen() -> Iterator[str]:
    while True:
        yield "".join(random.choice(DOMAIN) for _ in range(3))

tla_iter = tlagen()
for i in range(200):

This makes 200 TLA's. It's configurable, if that's important. We could change tlagen() to take an argument in case we wanted four-letter acronyms or something else.

However. This will generate words like "cum" and "ass". We need a forbidden word list and want to filter out some words. Also, there's no uniqueness guarantee.

The Forbidden Word Filter

Here's a simple forbidden word filter. Configure FORBIDDEN with a set of words you'd like to exclude. Maybe exclude "sex" and "die", too. Depends on what age the teenagers are.

def acceptable(word: str) -> bool:
    return word not in FORBIDDEN

Here's another handy thing. Rather than repeat the "loop until" logic, we can encapsulate it into a function.

from typing import TypeVar, Iterator
T_ = TypeVar("T_")
def until(limit: int, source: Iterator[T_]) -> Iterator[T_]:
    source_iter = iter(source)
    for _ in range(limit):
        item = next(source_iter)
        yield item

Okay. That's workable. This lets us use the following:

list(until(200, filter(acceptable, tlagen())))

Read this from inside to outside. First, generate a sequence of TLA's. Apply a filter to that generator. Apply the "until" counter to that filtered sequence of TLA's. Create a list with 200 items. Nice.

The TypeVar gives us a flexible binding. The input is an iterator over things and the output will be an iterator over the same kinds of things. This formalizes a common understanding of how iterators work. The mypy tool can confirm that the code meets the claims in the type hint.

There are two sketchy parts about this. First the remote possibility of duplicates. Nothing precludes duplicates. And we're doing a bunch of string hash computations. To avoid duplicates, we need a growing cache of already-provided words. Or perhaps we need to build a set until it's the right size. Rather than compute hashes of strings, can we work with the numeric representation directly?

Numeric TLA's

There are only a few TLA's.
$$26^{3} = 17576 $$
Of these, perhaps four are forbidden. We can easily convert a number back to a word and work with a finite domain of integers. Here's a function to turn an integer into a TLA.

def intword(number: int) -> str:
    def digit_iter(number: int) -> Iterator[int]:
        for i in range(3):
            number, digit = divmod(number, 26)
            yield digit
    return "".join(DOMAIN[d] for d in digit_iter(number))

This will iterate over the three digits using a simple divide-and-remainder process to extract the digits from an integer. The digits are turned into letters and we can build the TLA from the number.

What are the numeric identities of the forbidden words?

def polynomial(base: int, coefficients: Sequence[int]) -> int:
    return sum(c*base**i for i, c in enumerate(coefficients))

def charnum(char: str) -> int:
    return DOMAIN.index(char)
def wordint(word: str) -> int:
    return polynomial(26, map(charnum, word))

We convert a word to a number by mapping individual characters to numbers, then computing a polynomial in base 26. And yes, the implementation of polynomial() is inefficient because it uses the ** operator instead of folding in a multiply-and-add operation among the terms of the polynomial.

Here's another way to handle the creation of TLA's.

FORBIDDEN_I = set(map(wordint, FORBIDDEN))
subset = list(set(range(0, 26**3)) - FORBIDDEN_I)
return list(map(intword, subset[:200]))

This is cool. We create a set of numeric codes for all TLA's, then remove the few numbers from the set of TLA's. What's left is the entire domain of permissible TLA's. All of them. Shuffle and pick the first 200.

It guarantees no duplicates. This has a lot of advantages because it's simple code.

This, however, takes a surprisingly long time: almost 17 milliseconds on my laptop.

Numeric Filtering

Let's combine the numeric approach with the original ideal of generating as few items as we can get away with, but also checking for duplicates.

First, we need to generate the TLA numbers instead of strings. Here's a sequence of random numbers that is confined to the TLA domain.

def tlaigen() -> Iterator[int]:
    while True:
        yield random.randrange(26**3)

Now, we need to pass unique items, and reject duplicate items. This requires a cache that grows. We can use a simple set. Although, a bit-mask with 17,576 bits might be more useful.

def unique(source: Iterator[T_]) -> Iterator[T_]:
    cache = set()
    for item in source:
        if item in cache:
        yield item

This uses an ever-growing cache to locate unique items. This will tend to slow slightly based on memory management for the set. My vague understanding is the implementation will double the size when hash collisions start occurring, leading a kind of log2 slowdown factor as the set grows.

The final generator looks like this:

list(until(200, unique(filter(lambda w: w not in FORBIDDEN_I, tlaigen()))))

Reading from inner to outer, we have a generator which will produce numbers in the TLA range. The few forbidden numbers are excluded. The cache is checked for uniqueness. Finally, the generator stops after yielding 200 items.


Of course, we're using timeit to determine the overall impact of all of this engineering. We're only doing 1,000 iterations, not the default of 1,000,000 iterations.

The original version: 0.94 seconds.

The improved number-based version: 0.38 seconds.

So there.  Want to generate values from a limited domain of strings? Encode things as numbers and work with the numeric representation. Much faster.

Saturday, December 16, 2017

Ordered Keys in Dictionaries

Raymond Hettinger (@raymondh)
#python news: 😀 @gvanrossum just pronounced that dicts are now guaranteed to retain insertion order. This is the end of a long journey.

Tuesday, December 12, 2017

The Business of Book Promotions

It is hard (for me) to promote my books. It seems like empty vanity. I realize that it's not -- promotion is essential -- but it's difficult.

Packt just send a raft of detailed information for authors. Some things they suggest I do.

  • ✅Referrals. Want a "free" Python book? The "free" is in scare quotes because you'll have to write a review. Otherwise, the book is yours. DM me @s_lott and I'll put you on the list.
  • ✅Amazon Author Profile: https://www.amazon.com/Steven-Lott/e/B00HNRSLEK Check.
  • Packt Blog Posts. Interesting. I'll have to look into that. I think they can follow the RSS feed from here. If they can, check, this is done.
  • ➡️Other on-line PR. Cool. I give them the blogs I follow. Nice. I can do that.
  • ✅Social Media. I (sort-of) link these blog posts to my Twitter feed through https://dlvrit.com. But I'm half-hearted about it. Packt suggests in including @PacktAuthors so that there's a proper Twitter tie-in. That seems like there's no work to that. Also, there's a Packt Experts Author Community on Facebook.
  • ✅Blog. Got it. You're reading this. And, also, here: https://medium.com/@s_lott.
  • ➡️Author Newsletter. This seems like next-level blogging with more serious editorial planning. Interesting.
  • 🆒Packt Promotions. I like this: they do the marketing. I just have to cooperate by sharing the information.
  • 🆒Virtual Book Release Party. Wait. This could be cool. Functional Python Programming 2e is coming next summer. Hmmm. This could be fun. Banners and raffles for freebies or discounts.
  • 🚻Author Street Team. Mutual Support. Advanced Copies. I would need to keep it organized, right? That's (potentially) a chunk of work. I'll need to contemplate that
  • ✅Conferences. This is fun. For one of my PyCon trips, I got a promo code from Packt for free content during the conference. That was handy to give out. I went to Vistaprint and had a box of cards printed with contact info and the promotion code. 
  • ➡️Packt Live on-line conferences. I've done a few webinars. They're difficult. Writing is easier because it unfolds more slowly. I'll have to look into doing a few more of these.
  • ✅LinkedIn Profile. https://www.linkedin.com/in/steven-lott-029835/ 

I've hit a few. A few more to do.

This is a pretty comprehensive list. It's good to see this kind of author support.

Also. They've done some serious re-engineering on their Author toolchain. MS-Word documents seem to be a thing of the past.

Tuesday, December 5, 2017

Functional Requirements and Use Cases -- even for "simple" things

In the mailbag I found this nonsense, doomed to inevitable failure:
"As I get more serious about this data science stuff, it has become obvious that a windows machine is not the way to go. ...
Q1: What other things should I think about and consider while shopping for a new computer?
Q2: Are there issues w/ running VmWare and Windows 7 w/in VmWare on Ubuntu?
I've omitted many, many words (400 or so.)

Here are all of the functional requirements I could discern:
  • I would like to have 1 machine. I don't want a desktop and a laptop
  • Install VmWare 
  • Install Windows 7 using VmWare
This was all of the functional requirements. The other 400 words involved specifications. Nothing that approaches a use case other than singular, VMWare, and Windows under VMWare. The form factor of laptop, which seems to go without saying, might be a user story, but that's pushing it.

The "a windows machine is not the way to go" and "Install Windows 7" indicate a fairly serious problem. It is not the way to go and it's required. Both. This is doomed to inevitable failure. 

This is not the way to make a decision.

Q1. What other things should I think about? 

Just about every other thing. Start with use cases and functional requirements. Skip over specifications. (In general, never start with specifications because that's where you end: a list of useless numbers that don't bracket what you actually want to do.)

Use Cases Matter. Specifications Don't Matter.

Write down all the Mbs and Tbs you want. Without a use case, they're irrelevant noisy details. Throw the numbers away until you have a list of verbs. Things you will DO. 

With so few actual functional requirements, almost *any* computer (possibly including a Raspberry Pi 3) would pass the suite of acceptance test cases.

✅ One Machine.
✅ VMWare.
✅ Windows.

After a lot more back-and-forth, I discerned one (or maybe two) additional functional requirement(s).
  • I have leo w/ java to gen html.
I know what Leo is in this context. I'm guessing the "java to gen html" is JRst. The lack of clarity is, of course, part of the problem here.

This requirement surfaced in the context of explaining to me why Windows was so important. Really. Windows was required to run two open-source apps. And. "a windows machine is not the way to go." Doomed. To. Inevitable. Failure.

Here's the only relevant functional requirement(s): run Leo and Java. And even then, there's a huge hole in this. Leo is Python-based. Docutils RST2HTML is Python-based. Why not simply use Leo and Python?  What does Java have to do with anything?

Buy this: a Pi-top: https://www.sparkfun.com/products/13896

Q2. Are there issues w/ running...? 

Yes. Always. For everything you can possibly enumerate there are "issues". 

There. Are. Always. Issues.

Use Cases Matter.

Since you don't have any functional requirements or use cases, it's impossible to filter the issues and see if any of the known issues impact what you think you're going to do.

From what I was told, a Pi-top covers everything that's required. It's hard to be sure, of course, when the functional requirements are so vague. But there's no evidence that the Pi-top can't work to fill all of the stated functional requirements.

What To Do Next

It seems obvious, but the next step is to create a test plan. Actually, that was the first step. Since it wasn't done first, now it's the next step.

Write down the things you want to do. Make a list. Ideally a long list of things you will DO. Active voice. Verbs. Actions. Tasks. Activities. It's hard to emphasize this enough.

Then, when considering a computer, see if it can actually do those things. Test it against the requirements to see if it does what it's supposed to do. Among all the machines that pass the tests, you can then sort by price. (Or availability, or reputation, or cool stickers, whatever non-functional requirements seem relevant.)

The questions of Tb and Mb and processor clock speed mean nothing. Nothing. Find the cheapest (smallest) machine that does what you want. Don't find the machine with xMb and yTb of whatever.

There there's this, "As I get more serious about this data science stuff" which seems little more than context. But it's really important. Indeed, it's essential.

If you're going to do machine learning, you don't really want to buy the necessary computer. You want to rent it for the hour or so each day you actually need it. It will be idle 23/24 hours each day (96% of the time.) Why buy that much horsepower which you are never going to use.

If you're going to login to a server you purchased from a cloud computing vendor. Amazon AWS. Microsoft Azure, etc., then, you can probably get by with a tablet that runs SSH and a browser. A tablet with a cool keyboard and a little display rack can be very nice. https://panic.com/prompt/ and https://www.termius.com seem to be all that's required.

Without Use Cases, however, it's impossible to select a computer. Don't spend money without test cases.

Tuesday, November 14, 2017

CI/CD DevOps and Python

See https://www.slideshare.net/ITRevolution/does-sfo-2016-topo-pal-devops-at-capital-one for the 16 gates that separate a good idea from secure, productive use of software. While a lot of DevOps folks like the idea, when it comes to implementing it for Python apps, they get confused.

The confusion seems to stem from Python's lack of a proper "build" step in the CI/CD pipeline. I've had the "everything involves a build" argument and the "well setup.py is analogous to a build" arguments. I have to acquiesce to these positions as part of making progress. In this case, reasoning by analogy can be misleading.

I want to focus on the two gates that are part of the code itself, separate from the rest of the pipeline.

  • Static Analysis 
  • >80% Code Coverage (which implies Unit Tests)

Unit Testing

My preference is to run the unit test suite first and get that out of the way. If the unit test suite fails, or fails to cover 80% of the code, any other considerations are moot.

I like Git triggers based on Pull Requests (PR's) and Merge to Master for checking these two conditions. I like the idea that a PR can't be discussed until unit tests pass. They can also be part of whatever other pipeline is going on, but I like them to be done early and often.

(I worked on a sprint team where the PR unit test wasn't trusted by one of the devs: he'd carefully check out the branch and rerun the unit tests. His comments were good, so the extra effort paid off. I guess.)

After flirting with a lot of frameworks, I'm happiest with py.test. I like the py.test-coverage plug-in and the py.test-BDD plug-in.

Yes. We have acceptance tests for our features written in Gherkin. And we have pytest fixtures that are used by pytest-bdd to process the scenarios in the Gherkin feature files. It actually works out nicely because we have a cucumber.json file that makes everyone happy that we've run an acceptance test suite along with our unit test suite.

What's important is the coverage report is painless and automatic.

And it's compatible with the Ruby-based cucumber tool without involving any actual ruby.

For integration testing, we use Behave. This is a bit more cumbersome than pytest-bdd, but it's appropriate for the bigger-picture testing where we have a docker cluster and have to see a number of "Then" steps to confirm operations spread across a suite of microservices.

The goofy question that often leads to endless confusion is the relationship between unit testing and "build." The setup.py setup definition includes a `tests_require` parameter. This *should be* all that's needed to do `python setup.py test`, which *should be* all that's involved in testing.

Is it a "build"? No. But. You can tell the DevOps folks it's a build if it makes them happy.

Static Analysis

There are several kinds of static analysis. Folks who work in Java are used to having Sonar analysis performed. This is above and beyond the static analysis already performed by the compiler. It seems excessive to me, but folks deploying Java seem to like it.

For Python, there are two important static analysis tools. And this is another source of profound confusion for DevOps folks new to Python.

I like to extract the last line of the pylint output and use that numeric score as the "bottom line" on static analysis part 1. While the default setting is 9.5, that can be a challenge, and we prefer 9.0 as well as some local pylintrc modifications to modify some checkers (e.g., set line length to 120.)

For mypy, it's a little bit more complex. We're still fumbling around here.

Ideally, the type hints are all clean and mypy has nothing to say. We can, of course, fix any errors by claiming everything uses Any and returns Any and every assignment statement sets an Any value. But that's so wrong.

There are (still) modules which require typeshed stub definitions. Ideally, we'd provide these. This would be better than using Any as a hack-around. While good, it's a lot of work.

For now, I think it's sensible to have two "pass" rules for mypy: clean or typeshed error. If mypy is silent, that's perfect. If mypy can't find stubs in typeshed, we can let this go for now and log an issue from the CI/CD pipeline to note the presence of technical debt.

In the best of all worlds, we'd fork the package, fix the type hints, and put in a PR. That's a lot better than using typeshed to work around the lack of hints. 

And, of course, there's the "build" question. For mypy to work, the dependencies (or their typeshed stubs) must be present. We wind up doing a `python setup.py install` to build out the requirements. Is this a "build"? Maybe. You can tell the DevOps folks it's a build if it makes them happy.

If you want idempotent server (or container) builds, you'll need to be sure that you pin specific versions. It can help to break this into two parts:
  • a requirements.txt with specific versions 
  • a generic version-free high-level list in setup.py
The reason for this separation is to make it easy to do a `pip install` or `conda create` from the detailed requirements. Once that's out of the way, the `python setup.py` will run very quickly. If you're working with Docker containers, the `pip install` (or `conda create`) can be part of the Dockerfile, and then tests or static analysis can be run separately, after the initial wave of installations.

Tuesday, November 7, 2017

Python Type Hinting -- generally easy until you find your design flaws

Adding type hints is easy and fun. Seriously. It's not a lot of work.


Until you find a piece of code that does more than what you sort-of thought it kind-of did.

def null_aware_func(x):
    if x is None:
         return x
    return 2.2*x**1.05

This is a stab at a none-aware computation.

Let's add type hints, shall we?

def null_aware_func(x: float) -> float:
    if x is None:
        return None
    return 2.2*x**1.05

This won't fool mypy. Sigh. It passes unit tests, but it's flagged as a problem.

We have a variety of ways of define this function. And that means we need to think carefully about our None-aware design.

Is this really an @overload?

from typing import overload
def null_aware_func(x: None) -> None:
def null_aware_func(x: float) -> float:
    if x is None:
        return None
    return 2.2*x**1.05

And yes, the ... is legit Python syntax. (It's a rarely used token that forms the body of the function.)

Or is this a more advanced type?

from typing import Optional
OptFloat = Optional[float]

def null_aware_func(x: OptFloat) -> OptFloat:
    if x is None:
        return None
    return 2.2*x**1.05

I'd argue that OptFloat is a more sensible definition. However, if this is the only function that's none-aware, perhaps it's an overload.

The deeper question is one of underlying meaning. Why are we doing this? What does it mean?

And. Bonus. Will this be working in a SQLAlchemy environment, where they have their own wrappers for database objects, meaning that `is None` doesn't work and `== None` is required?

What's important is that adding type hints forced us to think about what we were doing. Unlike Java we did this without stopping progress for an extended period of "wrestling with the compiler". We can use Any temporarily because the unit tests all pass. Then, we can pay down the technical debt by fixing the type declaration.

Total. Victory.

Tuesday, October 31, 2017

Some Reading

Higher-Order Functions. A really cool idea. Javascript isn't my favorite language.


This, on the other hand, is huge: trunk-based development.


I'm really tired of having a dev branch with periodic commits to master so we can deploy from master. It's so much nicer to tag a release and deploy that.

Here's the top 10 Python list for September 2017.


Learning Python: Zero to Hero: https://medium.freecodecamp.org/learning-python-from-zero-to-hero-120ea540b567

Why we switched from Python to Go: https://codeburst.io/why-we-switched-from-python-to-go-60c8fd2cb9a9. We distribute a CLI for our API in Go because it's slightly simpler to provide executables in Go. However, our user community has Python already... We need to provide similar functionality in Python.

Technical Debt: https://medium.freecodecamp.org/what-is-technical-debt-and-why-do-most-startups-have-it-9a54458daabf

Software Engineering v. Programming. https://medium.com/@samerbuna/software-engineering-is-different-from-programming-b108c135af26

Annotations in Java? Ugh. The level of complexity seems to have gotten out of control. https://blog.softwaremill.com/the-case-against-annotations-4b2fb170ed67

Programming vs. the text of the code itself. https://medium.com/@karolismasiulis/programming-is-not-about-text-c205ba6aa3ba I'm don't by the reductionist 1-dimensional view of text. Yes -- in a narrow, technical sense, it's one-dimensional. If you want to be really reductionist, it's a stream of bytes, which are really just a base-256 number. I don't think the reductionist argument is helpful. However, I am sure that the declarative/imperative distinction is worthy of a lot of thought, and this is a nice comparison. (In spite of the javascript.)

Tuesday, October 24, 2017

Programming by Search, Copy, and Paste Leads to Epic Fail

In a way, this is about an epic fail attempting copy-and-paste coding. But really, this is about thinking outside the box. The issue -- to me -- comes from failing to see the box. Here's the body of the email, edited slightly.
"...how determine when a file has completed downloading. It would be helpful if code snippets in a unix shell and Python. 
"I did Google but none seemed to address the fundamental race conditions. They all involve a variant of try, sleep and try again. This is problematic for my particular case because the file sizes very significantly."
I'll ignore the grammar problems and focus on the intent of the "I did Google..." part. Based on some personal knowledge, I doubt there was more than a single search string tried. And I doubt that more than a single page of the response was looked at. Those are not important concerns.

The important concern is the shocking vagueness of the problem statement. These words are almost entirely meaningless:
"a file has completed downloading"
Imagine the variety of possible file transfer protocols that could be involved, and how many of them can be properly scripted. Take all the time you want. It can help to make a list of all the protocols that make this is a non-problem.

No protocol was named. Therefore, a protocol was assumed. And the presence of this kind of tacit assumption forms an implicit box restricting what they're doing. The restriction is so unyielding to them than they don't even need to mention it. It's as essential to them as air. They need it, but cannot see it, and refused to acknowledge it.

At this point, all we can do is make random guesses.

("Why didn't you ask them for clarification?" you ask.  Good point. It's a personal failure in this case. The back-and-forth would take days. Eventually, they would send me useless explanations of deep ineptitude or a need to engage in corporate politics. Or both. I'll admit that I'm a jerk about requiring folks to take a first step and make a stab at code. Without code, I find it largely impossible to determine what they're really talking about. The above question is a prime example of a disconnection from reality that's too exasperating to deal with except superficially.)

Identifying the Box

Guess #1. This may be about FTP (or SFTP) file transfers. Further, it may involve FTP file uploads to a server, where the client doesn't disclose a size. Yes, the word "downloading" seems to preclude this guess, but almost all other choices aren't even possible.

If it really was a client side download, this is trivially automated using any of the available FTP client programs, include wget, curl, sftp, etc. The Python ftplib seems to be a fully automated client for FTP. The documentation is packed with examples. It seems unlikely that the question is actually client-side.

It's also possible that a single search failed to reveal all these automatable FTP clients.

Guess #2. "determine when"? Who actually cares when the upload finishes? An upload matters to the next client doing a download, or -- perhaps -- to a process that's supposed to consume the uploaded file. Is that what this is about?

Is the real question "how to trigger processing of an uploaded file when using FTP?"

In this case, we're left with stacks of follow-up questions. Primarily: "Why are you using FTP?"

If they replace their silly FTP (or SFTP) server with a RESTful API, they won't have these problems. It takes a few days to write a secure file-upload Flask container. With a swagger spec. And unit tests. And Gherkin feature definitions, and a behave test suite to be sure it *really* works.  It doesn't need very many routes. On completion of upload, it can fork off subprocesses to process the uploaded files. This is not hard. Really. Flask + Celery will do this.

Understanding the problem seems to require stepping outside of some box. It appears this is a struggle because of a poorly-defined box: a box assumed without being stated.

Working With the Box

At this point, we can only pretend the problem is about triggering processing after an upload. Let's further pretend the FTP is a legal requirement. Or we can pretend that SFTP is imposed by an inept IT department who also loves living inside some poorly-defined box. We're stuck with FTP for inexplicable reasons.

What can we do to game an FTP server to trigger processing of files of unknown sizes?

  • Write our own FTP server. This isn't very hard. It is, however, far simpler to write a RESTful Flask service that handles the file upload as a POST request via curl or wget. Writing an FTP server's a pain in the ass because the FTP protocol is surprisingly complex. Even writing an FTP subset that serves very specific client needs can be painful.
  • Poll the upload directory. This implies a race condition. Polling (and the race condition) have no practical consequences. If you want "real-time", write a RESTful API and don't use FTP. Since you're insisting on FTP, a delay is going to be part of the solution.
I'm more than a little shocked that search was considered as a viable design strategy to solving this problem. It doesn't seem like searching for solutions is required at all. I'm probably overstating this, but it seems sort of trivial and obvious that either a second file is required or a better file protocol is required. This seems to be simple "thinking" not "googling."

There are bunches of ways to approach this. Here are a few ways to use a second file and some kind of naming convention to show that two files are part of one transfer.
  • Send a file with the size of the target file *before* the target file. When the target file matches the stated size, initiate processing.
  • Send a file with the size and MD5 checksum of the target file. etc.
  • Send a file *after* the target file with the size and checksum. When this file shows up, simply confirm that the first file is all there.
Yes, polling is required. However, there's no race condition: there are two separate conditions which must both be met. The files are provided serially, the conditions are met serially.

Here are a two approaches that use a file format that properly handles completeness.
  • Gzip the file. The file receipt polling loop repeatedly tries to unzip it. If the unzip fails, the file is incomplete. 
    • Don't want to spend too much CPU time? Wait until the size has been stable for two polling intervals and then try to unzip then.
  • Tar the file. Yes. A tar archive of a single file can be checked for integrity. When the archive can be checked and shown to be valid, the target element can be extracted and processed.
    • Don't want to spend CPU time validating? Again. Wait for a stable size for a few polling intervals.
And, of course, it's possible to invent an entirely home-brewed file-wrapping protocol. Here's an approach.
  • Wrap the content in MIME-style headers. These can provide a size or a terminator string to help identify the end of the transfer.
The point here is that googling for code isn't part of solving this problem. Indeed, it can't solve this problem. Merely thinking about the nature of the problem ("triggering processing", "knowing the size") seemed necessary and sufficient to frame a solution.

What's Essential

Here's what didn't happen:
  • State the actual problem. 
  • Identify the boxes. Write them down. In words. There may be more than one.
  • Locate code to work with the boxes. Find the libraries or packages. Install them. Write a hello world. example to be sure that the code is understood.
Then -- and only then -- can we start to imagine solutions and ask questions about the boxes or the code that might manage the boxes.

It's impossible to state this strongly enough: We can't think outside the box if we refuse to acknowledge the box.

Tuesday, October 17, 2017

Why I like Functional Composition

After spending years developing a level of mastery over Object Oriented Design Patterns, I'm having a lot of fun understanding Functional Design Patterns.

The OO Design Patterns are helpful because they're concrete expressions of the S. O. L. I. D. design principles. Much of the "Gang of Four" book demonstrates the Interface Segregation, Dependency Injection, and Liskov Substitution Principles nicely. They point the way for implementing the Open/Closed and the Single Responsibility Principles.

For Functional Programming, there are some equivalent ideas, with distinct implementation techniques. The basic I, D, L, and S principles apply, but have a different look in a functional programming context. The Open/Closed principle takes on a radically different look, because it turns into an exercise in Functional Composition.

I'm building an Arduino device that collects GPS data. (The context for this device is the subject of many posts coming in the future.)

GPS devices generally follow the NMEA 0183 protocol, and transmit their data as sentences with various kinds of formats. In particular, the GPRMC and GPVTG sentences contain speed over ground   (SOG) data.

I've been collecting data in my apartment. And it's odd-looking. I've also collected data on my boat, and it doesn't seem to look quite so odd. Here's the analysis I used to make a more concrete conclusion.

def sog_study(source_path = Path("gps_data_gsa.csv")):
    with source_path.open() as source_file:
        rdr = csv.DictReader(source_file)
        sog_seq = list(map(float, filter(None, (row['SOG'] for row in rdr))))
        print("max {}\tMean {}\tStdev {}".format(
            max(sog_seq), statistics.mean(sog_seq), statistics.stdev(sog_seq)))

This is a small example of functional composition to build a sequence of SOG reports for analysis.

This code opens a CSV file with data extracted from the Arduino. There was some reformatting and normalizing done in a separate process: this resulted in a file in a format suitable for the processing shown above.

The compositional part of this is the list(map(float, filter(None, generator))) processing.

The (row['SOG'] for row in rdr) generator can iterate over all values from the SOG column. The filter(None, generator) will drop all None objects from the results, assuring that irrelevant sentences are ignored.

Given an iterable that can produce SOG values, the map(float, iterable) will convert the input strings into useful numbers. The surrounding list() creates a concrete list object to support summary statistics computations.

I'm really delighted with this kind of short, focused functional programming.

"But wait," you say. "How is that anything like the SOLID OO design?"

Remember to drop the OO notions. This is functional composition, not object composition.

ISP: The built-in functions all have well-segregated interfaces. Each one does a small, isolated job.

LSP: The concept of an iterable supports the Liskov Substitution Principle: it's easy to insert additional or different processing as long as we define functions that accept iterables as an argument and yield their values or return an iterable result.

For example.

def sog_gen(csv_reader):
    for row in csv_reader:
        yield row['SOG']

We've expanded the generator expression, (row['SOG'] for row in rdr), into a function. We can now use sog_gen(rdr) instead of the generator expression. The interfaces are the same, and the two expressions enjoy Liskov Substitution.

To be really precise, annotation with type hints can clarify this.  Something like sog_gen(rdr: Iterable[Dict[str, str]]) -> Iterable[str] would clarify this.

DIP: If we want to break this down into separate assignment statements, we can see how a different function can easily be injected into the processing pipeline. We could define a higher-order function that accepted functions like sog_gen, float, statistics.mean, etc., and then created the composite expression.

OCP: Each of the component functions is closed to modification but open to extension. We might want to do something like this: map_float = lambda source: map(float, source). The map_float() function extends map() to include a float operation. We might even want to write something like this.  map_float = lambda xform, source: map(xform, map(float, source)). This would look more like map(), with a float operation provided automatically.

SRP: Each of the built-in functions does one thing. The overall composition builds a complex operation from simple pieces.

The composite operation has two features which are likely to change: the column name and the transformation function. Perhaps we might rename the column from 'SOG' to 'sog'; perhaps we might use decimal() instead of float(). There are a number of less-likely changes. There might be a more complex filter rule, or perhaps a more complex transformation before computing the statistical summary.  These changes would lead to a different composition of the similar underlying pieces.

Tuesday, October 10, 2017

Python Exercises


This seems very cool. These look like some pretty cool problems. It includes debugging and unit testing, so there's a lot of core skills covered by these exercises.

Thursday, September 28, 2017

Learning to Code

I know folks who struggle with the core concepts of writing software.

Some of them are IT professionals. With jobs. They can't really code. It seems like they don't understand it.

Maybe a gentler introduction to programming will help?

I have my doubts. The folks who seem to struggle the hardest are really fighting against their own assumptions. They seem to make stuff up and then seek confirmation in everything they do. The idea of a falsifiable experiment seems to be utterly unknown to them. Also, because they're driven by their assumptions, the idea of exhaustively enumerating alternatives isn't something they do well, either.

For example, if you try to explain python's use of " or ' for string literals -- a syntax not used by a language like SQL -- they will argue that Python is "wrong" based on their knowledge of SQL. Somehow they wind up with a laser-like focus on mapping Python to SQL. They'll argue that apostrophe's are standard, and they'll always use those. Problem solved, right?

Or is it problem ignored? Or problem refused?

And. Why the laser-like focus on mapping among programming languages? It seems that they're missing the core concept of abstract semantics mapped to specific syntax.

Tuesday, September 26, 2017

Learning About Data Science.

I work with data scientists. I am not a scientist.

This kind of thing on scikit learn is helpful for understanding what they're trying to do and how I can help.

Tuesday, September 19, 2017

Three Unsolvable Problems in Computing

The three unsolvable problems in computing:

  • Naming
  • Distributed Cache Coherence
  • Off-By-One Errors

Let's talk about naming.

The project team decided to call the server component "FlaskAPI".


It serves information about two kinds of resources: images and running instances of images. (Yes, it's a kind of kubernetes/dockyard lite that gives us a lot of control over servers with multiple containers.)

The feature set is growing rapidly. The legacy name needs to change. As we move forward, we'll be adding more microservices. Unless they have a name that reflects the resource(s) being managed, this is rapidly going to become utterly untenable.

Indeed, the name chosen may already be untenable: the name doesn't reflect the resource, it reflects an implementation choice that is true of all the microservices. (It's a wonder they didn't call it "PythonFlaskAPI".)

See https://blogs.mulesoft.com/dev/api-dev/best-practices-for-building-apis/ for some general guidelines on API design.

These guidelines don't seem to address naming in any depth. There are a few blog posts on this, but there seem to be two extremes.

  • Details/details/details. Long paths: class-of-service/service/version-of-service/resources/resource-id kind of paths. Yes. I get it. The initial portion of the path can then route the request for us. But it requires a front-end request broker or orchestration layer to farm out the work. I'm not enamored of the version information in the path because the path isn't an ontology of the entities; it becomes something more and reveals implementation details. The orchestration is pushed down the client. Yuck.
  • Resources/resource. I kind of like this. The versioning information can be in the Content-Type header: application/json+vnd.yournamehere.vx+json.  I like this because the paths don't change. Only the vx in the header. But how does the client select the latest version of the service if it doesn't go in the path? Ugh. Problem not solved.
I'm not a fan of an orchestration layer. But there's this: https://medium.com/capital-one-developers/microservices-when-to-react-vs-orchestrate-c6b18308a14c  tl;dr: Orchestration is essentially unavoidable.

There are articles on choreography. https://specify.io/concepts/microservices the idea is that an event queue is used to choreograph among microservices. This flips orchestration around a little bit by having a more peer-to-peer relationship among services. It replaces complex orchestration with a message queue, reducing the complexity of the code.

On the one hand, orchestration is simple. The orchestrator uses the resource class and content-type version information to find the right server. It's not a lot of code.

On the other hand, orchestration is overhead. Each request passes through two services to get something done. The pace of change is slow. HATEOAS suggests that a "configuration" or "service discovery" service (with etags to support caching and warning of out-of-date cache) might be a better choice. Clients can make a configuration request, and if cache is still valid, it can then make the real working request.

The client-side overhead is a burden that is -- perhaps -- a bad idea. It has the potential to make  the clients very complex. It can work if we're going to provide a sophisticated client library. It can't work if we're expecting developers to make RESTful API requests to get useful results. Who wants to make the extra meta-request all the time?

Tuesday, September 12, 2017

The No Code Approach to Software and Why It Might Be Bad

Start here: https://www.forbes.com/sites/jasonbloomberg/2017/07/20/the-low-codeno-code-movement-more-disruptive-than-you-realize/#98cfc4a722a3

I'm not impressed. I have been not impressed for 40 years and many previous incarnations of this idea of replacing code with UX.

Of course, I'm biased. I create code. Tools that remove the need to create code reflect a threat.

Not really, but my comments can be seen that way.

Here's why no code is bad.

Software Captures Knowledge

If we're going to represent knowledge in the form of software, then, we need to have some transparency so that we can see the entire stack of abstractions. Yes, it's turtles all the way down, but some of those abstractions are important, and other abstractions can be taken as "well known" and "details don't matter."

The C libraries that support the CPython implementation, for example, is where the turtles cease to matter (for many people.) Many of us have built a degree of trust and don't need to know how the libraries are implemented or how the hardware works, or what a transistor is, or what electricity is, or why electrons even have a mass or how mass is imparted by the Higgs boson.

A clever UI that removes (or reduces) code makes the abstractions opaque. We can't see past the UI. The software is no longer capturing useful knowledge. Instead, the software is some kind of interpreter, working on a data structure that represents the state of the UI buttons.

Instead of software describing the problem and the problem's state changes, the software is describing a user experience and those state changes.

I need the data structure, the current values as selected by the user, and the software to understand the captured knowledge. 

Perhaps the depiction of the UI will help. 

Perhaps it won't. 

In general, a picture of the UI is useless. It can't answer the question "Why click that?" We can't (and aren't expected) to provide essay answers on a UI. We're expected to click and move on.

If we are forced to provide a essay answers, then the UI could come closer to capturing knowledge. Imagine having a "Reason:" text box next to every clickable button.

We all know what the essay answers will look like. They'll look like bad comments in code. And bad commit comments in Git. And bad documentation.

Some Option: ☑️ Reason: Required
Other Option: ☐ Reason: Not sure if its needed

The problem with fancy UI's and low-code/no-code software is low-information/no-information software. Maintenance becomes difficult, perhaps impossible, because it's difficult understand what's going on.

Tuesday, September 5, 2017

Seven Performance Tips

Packt (@PacktPub)
Want to improve your #Python performance? We've got 7 great tips for you: bit.ly/28YiGeE via @ggzes #CodingTips pic.twitter.com/cGhoGyTSS9

I have one thing to add: Learn to use the profiler and timeit. They will eliminate and hand-wringing over what might be better or worse. The policy is this: Code, Measure, and Choose.

Tuesday, August 29, 2017

The Pipeline Question when Bashing the Bash

Background: https://medium.com/capital-one-developers/bashing-the-bash-replacing-shell-scripts-with-python-d8d201bc0989

And this
The answer to this is interesting because there are two kinds of parallelism. I like to call them architectural and incidental (or casual).  I'll look at architectural parallelism first, because it's what we often think about. Then the incidental parallelism, which I'm convinced is a blight.

Architectural Parallelism

The OS provides big-picture, architectural parallelism. This isn't -- necessarily -- a thing we want to push down into Python applications. There are some tradeoffs here.

One example of big architectural parallelism are big map-reduce processes where the mapping and reducing can (and should) proceed in parallel. There are some constraints around this, and we'll touch on them below.

Another common example is a cluster of microservices that are deployed on the same server. In many cases, each microservice decomposes into a cluster of processes that work in parallel and have a very, very long life. We might have an NGINX front-end for static content and a Python-based Flask back-end for dynamic content.  We might want the OS init process to start these, and we define them in init.d. In other cases, we allocate them to web-based servers where load-balancing handles the details of restarting.

In the map-reduce example, the shell's pipe makes sense. We can define it with a shell script like this: source | map | reduce.  It's hard to beat this for succinct clarity.

In the Ngnix + Flask case, they may talk using a named pipe that outlives the two processes. Conceptually, they work as nginx | flask run.

In some cases, we have log analysis and alerting that are part of microservices management. We can pile this into the processing stream with a conceptual pipeline of nginx | flask run | log reduce | alert. The log reduce filters and reduces the log to find those events that require an alert. If any data makes it into the alert process, it sends the text for human intervention.

There are some distinguishing features.
  • They tend to be resource hogs. Either it's a big map-reduce processing request that uses a lot of CPU and memory resources. Or it's a log-running server.
  • The data being transported is bytes with a very inexpensive (almost free) serialization. When we think of map-reduce, these processes often work with text as input and output. There may be more complex data structures involved in the reduce, but the cost of serialization is an important concern. When we think of web requests, the request, response, and log pipeline is bytes more-or-less by definition. 
  • The parallelism is at the process level because each element does a lot of work and the isolation is beneficial.
  • The compute high-value results for actual users.
The OS does this. The complexity is that each OS does this differently. The Python subprocess module (and related projects outside the standard library) provide an elegant mapping into Python. 

It's not built-in to the language. I think that it's because details vary so widely by OS. I think trying to build this into the language leads to a bulky featyre that's not widely-enough used.

Incidental Parallelism

This is -- to me -- a blight. Here's a typical kind of thing we see in the middle of a longer, more complex shell script.

data=`grep pattern file | cut args | sort | head`
# the interesting processing on $data

Computing a value that's assigned to data is a high-cost, low-value step. It creates an intermediate result that's only part of the shell script, and not really the final result. The parallelism feature of the shell's | operator isn't of any profound value since only a tiny bit of data is passed from step to step.

This can be rewritten into Python, but the resulting code won't be a one-liner. It will be longer. It will also be much, much faster. However, the speed difference is rarely relevant if this kind of processing step inside a larger, iterative process.

A trivial rewrite of just one line of code misses the point. The goal is to refactor the script so that this line of code because a simple part of the processing and uses first-class Python data structures. The reason for doing cut and sort operations is generally because the data structure wasn't optimized for the job. A priority queue might have been a better choice, and would have amortized sorting properly and eliminated the need for separate cut and head operations.

This kind of computation can (and should) be done in a single process. The shell pipeline legacy implementation is little more than a short-hand for passing arguments and results among (simple) functions.

We can rewrite this as nested functions.

with Path(file).open() as source:
    head(sorted(cut_mapping(args, grep_filter(pattern, file))))

This will do the same thing. The gigantic benefits of this kind of rewrite involves eliminating two kinds of overheads.
  • The fork/exec to spawn subprocesses. A single process will be faster.
  • The serialization and deserialization of intermediate results. Avoiding serialization will be faster.
When we rewrite bash to Python, we are able to leverage Python's data structures to write processing that expressive, succinct, and efficient.

This kind of rewriting will also lead to refactoring the adjacent lines of the script -- the interesting processing -- into Python code also. This refactoring can lead to further simplifications and speedups.

The Two Cases

There seem to be two cases of parallelism:
  • Big and Architectural. There are many Python packages that provides these features. Look at plumbum, pipes, and joblib for examples. Since the OS implementation details vary so much, it's hard to imagine making this part of the language.
  • Small and Incidental.  The incidental parallelism is clever, but inefficient. In many cases, it doesn't seem to create significant value. It seems to be a kind of handy little workaround. It has costs that I find to outstrip the value. 
When replacing the bash with Python, some of the parallelism is architectural, and needs to be preserved. Careful engineering choices will be required. The rest is incidental and needs to be discarded.

Tuesday, August 8, 2017

Refusing to Code. Or. How to help the incurious?

The emphasis on code is important. Code defines the behavior of systems -- for the most part Once upon a time, we used clever mechanical designs, or discrete electronic components. The InternetofThings idea exists because high-powered general-purposes CPU's are ubiquitous.

A DevOps mantra is "infrastructure as code". The entire deployment is automated, from the allocation of processors and storage down to pining the health-check endpoint to be sure it's live. Blue-Green deployments, traffic switching, etc., and etc. These all require lots of code and as little manual intervention as possible. 

The gold standard is to use tools to visualize state, make a decision, and use tools to take action. Lots of code.

When I meet the anti-code people, it's confusing.

Outside my narrow realm of tech, anti-code is fine. I have a sailboat, I meet lots of non-tech people who can't code, won't code, and aren't sure what code is.

But when I meet people who claim they want to be data science folks but refuse to code, I'm baffled.

Step 1 was to "learn more" about data science or something like that. I suggested some of the ML tutorials available for Python. Why? It appears that Scikit Learn is the gold standard for ML applications. http://scikit-learn.org/stable/tutorial/index.html

Because they didn't want to code, they insisted on doing things in Excel. Really.

Step 2 was to figure out some simulated annealing process -- in Excel. They had one of the central textbooks on ML algorithms. And they had a spreadsheet. They had some question that can only arise from avoiding open-source code. I suggested they use the open source code available to everyone. Or perhaps find a more modern tutorial like this: http://katrinaeg.com/simulated-annealing.html

Because they don't want to code, they used the fact that scipy.optimize.anneal() was deprecated to indict Python. I almost wish I'd saved all the emails over why basin hopping was unacceptable. The reasoning involved having an old textbook that covered annealing in depth, and not wanting to actually read the code for basin hopping. Or something. 

Step 3 was to grab a Kaggle problem and start working on it. This is too large for a spreadsheet. Indeed, the data sets push the envelope on what can be done on a Windows laptop because the dataframes tend to be quite large. It requires installing Scikit learn, which means installing Anaconda from Continuum. There's no reasonable alternative.

The Kaggle exercise may also involve buying a new laptop or renting time on a cloud-based server that's big enough to handle the data set. ML processing takes time, and GPU acceleration can be a huge help. All of this, however, presumes that there's code to run.

Because they don't want to code, this bled into an amazing number of unproductive directions.  There's some kind of classic "do everything except what you need to do" behavior. I'm sure it has a name. It's more than "work avoidance." It's a kind of active negation of the goals. It was impossible to discern what was actually going on or how I was supposed to help.

I suggested a Trello board. 

The Trello board devolved into dozens of individual lists, each list had one card. Seriously. The card/list thing became a way of avoiding progress. There were cards for considering the implications of installing Anaconda. The cards turned into hand-wringing discussions and weird status updates and memo-to-self notes, instead of actual actions.

Bottom line? 

No code. 

In the middle of the Kaggle something-or-other board, a card appeared asking for comments on some code. :yay2: Something I can actually help with.

The code was bad. And precious. I blogged about this phenomenon earlier. The code can't be changed because it was so hard to create. It was really bad, and riddled with bizarre things that make it look like they'd never seen code before.

Use pylint? This got a grudging kind of reluctant cleanup. But huge_variable_names_with_lots_of_useless_clauses aren't flagged by Pylint. They're still bad, and reading other code would show how atypical these names are. Unless, of course, you hate code; then reading code is not going to happen.

My new model for their behavior? They hate code. So when they do it, they do it badly. Intentionally badly. And because it was so painful, it's precious. (I'm probably wrong, and there's probably a lot more to this, but it seems to fit the observed behavior.)

It gets worse (or better, depending on your attitude.)

Another Trello card appears wondering what [a, b] * 2 or some such Pythonic thing might mean. Um. What?

It appears that they can't find the Standard Library description of the built-in data types and their operators. As if chapter four was deleted from their copy, or something.

The "can't find" seems unlikely. It's pretty prominent. I would think that anyone aspiring to learn Python would see the "keep this under your pillow" admonition on the standard library docs and perhaps glance through the first five sections to see what the fuss was about. Unless they hate code.

I'm left with "won't find."  Perhaps they're refusing to use the documentation? Are they also refusing to use Python's internal help? It's not great, but you can try a bunch of things and get steered around from topic to topic, eventually, you have to find something useful.

Apply my new model: they hate code and Python help() is code.

Do they really hate code that much? I now think they do. I think they truly and deeply hate losing manual, personal. hands-on control over things. If it's not a spreadsheet -- where they typed each cell personally -- it's reviled. (Or feared? Let's not go too far here.)

Test the hypothesis. Ask if they used help().

Answer: Yes. They had tried three things (exactly three) and none of those three had a satisfactory explanation. The help() function did not work. Indeed, two of the things they tried had the same result, and the third reported a syntax error. So they stopped.

They tried three things and stopped.

Okay, then. They hate code. And -- Bonus! -- They refuse to explore. Somehow they're also able to insist they must learn to code. Will the self-beatings continue until the attitude improves?

It's difficult to offer meaningful help under these circumstances. I don't see the value in being someone's personal Google, since that only reinforces the two core refusals to use code or explore by typing code to see what happened.

I like to think that coding is a core life skill. Like cooking. You don't have to become a chef, but you have to know how to handle food. You don't have to create elaborate, scalable meshes of microservices. But you have to be able to find the data types and operators on your own.

And I don't know how to coach someone who is so incurious that three attempts with help() is the limit. Done at three. Count it as a failure and stop trying. "Try something different" seems vague, but it's all I've got. Anything more feels isomorphic to "Here's the link, attached is an audio file of me reading the words out loud for you." 

Other Entries Other Blogs


Plus, of course, lots of other stuff from lots of other folks. Enjoy.

Tuesday, August 1, 2017

JSON vs. XML: The battle for format supremacy may be wasted energy - SD Times


This article seems silly. Perhaps I missed something important.

I'm not sure who's still litigating the JSON vs. XML, but it seems like it's more-or-less done.

XHTML/XML for HTML things.

JSON for everything else.

Maybe there are people still wringing their hands over this. AFAIK, the last folks using SOAP/XML services are commercial and governmental agencies where change tends to happen very slowly.

I remember when Sun Microsystems was a company and had the Java Composite Applications Suite. Very XML. That was -- perhaps -- ten years ago. Since then, I think the problem has been solved. I'm not sure who's battling for supremacy or why.

Tuesday, July 25, 2017

The "My Code Is Precious To Me" Conundrum

I suspect some people sweat so hard over each line of code that it becomes precious. Valuable. An investment wrung from their very soul. Or something.

When they ask for comments, it becomes difficult.

The Pull Request context can be challenging. There the code is, beaten into submission after Herculean toils, and -- well -- it's not really very good. The review isn't a pleasant validation with some suggested rewrites of the docstrings to remove dangling participles (up with which I will not put.) Perhaps the code makes a fundamentally flawed assumption and proceeds from there to create larger and larger problems until it's really too awful to save.

How do you break the news?

I get non-PR requests for code reviews once in a while. The sincere effort at self-improvement is worthy of praise. It's outside any formal PR process; outside formal project efforts. It's good to ask for help like that.

The code, on the other hand, has to go.

I'm lucky that the people I work with daily can create -- and discard -- a half-dozen working examples in the space of an hour.

I'm unlucky that people who ask for code review advice can't even think rationally about starting again with different assumptions. They'd rather argue about their assumptions than simply create new code based on different (often fewer) assumptions.

I've seen some simple unit conversion problems turned into horrible messes. The first such terrifying thing was a data query filter based on year-month with a rolling 13-month window. Somehow, this turned into dozens and dozens of lines of ineffective code, filled with wrong edge cases.

Similar things happen with hour-minute windows. Lots of wrong code. Muddled confusion. Herculean efforts doing the wrong thing. Herculean.

Both year-month and hour-minute problems are units conversion. Year-month is months in base 12. Hour-minute is minutes in base 60. Technically, they're mixed bases, simple polynomials in two terms. It's a multiply and an add. 12y+m, where 0 ≤ m < 12. Maybe an extra subtract 1 is involved.

The entire algorithm is a multiply and an add. There shouldn't very many lines of code involved. In some cases, there's an additional conversion from integer minutes to float hours. Which is a multiply by a constant (1/720.) Or integer months to float years after an epochal year (another add with a negative number and multiply by 1/12.)

I think it's common that ineffective code need to be replaced. Maybe it's sad that it has to get replaced *after* being written? I don't think so. All code gets rewritten. Some just gets written sooner.

I think that some people may need some life-coaching as well as code reviews.

Perhaps they should be encouraged to participate in a design walk-through before sweating their precious life's blood into code that doesn't solve the problem at hand.

Tuesday, July 18, 2017

Yet Another Python Problem List

This was a cool thing to see in my Twitter feed:

Dan Bader (@dbader_org)
"Why Python Is Not My Favorite Language" zenhack.net/2016/12/25/why…

More Problems with Python. Here's the short list.

1. Encapsulation (Inheritance, really.)
2. With Statement
3. Decorators
4. Duck Typing (and Documentation)
5. Types

I like these kinds of posts because they surface problems that are way, way out at the fringes of Python. What's important to me is that most of the language is fine, but the syntaxes for a few things are sometimes irksome. Also important to me is that it's almost never the deeper semantics; it seems to be entirely a matter of syntax.

The really big problem is people who take the presence of a list like this as a reason dismiss Python in its entirety because they found a few blog posts identifying specific enhancements. That "Python must be bad because people are proposing improvements" is madding. And dismayingly common.

Even in a Python-heavy workplace, there are Java and Node.js people who have opinions shaped by little lists like these. The "semantic whitespace" argument coming from JavaScript people is ludicrous, but there they are: JavaScript has a murky relationship with semi-colons and they're complaining about whitespace. Minifying isn't a virtue. It's a hack. Really.

My point in general is not to say this list is wrong. It's to say that these points are minor. In many cases, I don't disagree that these can be seen as problems. But I don't think they're toweringly important.

1. The body of first point seems to be more about inheritance and accidentally overiding something that shouldn't have been overridden. Java (and C++) folks like to use private for this. Python lets you read the source. I vote for reading the source.

2. Yep. There are other ways to do this. Clever approach. I still prefer with statements.

3. I'm not sold on the syntax change being super helpful.

4. People write bad documentation about their duck types. Good point. People need to be more clear.

5. Agree. A lot of projects need to implement type hints to make it more useful.

Tuesday, July 11, 2017

Extracting Data Subsets and Design By Composition

The request was murky. It evolved over time to this:
Create a function file_record_selection(train.csv, 2, 100, train_2_100.csv)
First parameter: input file name (train.csv)
Second parameter: first record to include (2)
Third parameter: last record to include (100)
Fourth parameter: output file name (train_2_100.csv)
Fundamentally, this is a bad way to think about things. I want to cover some superficial problems first, though.

First superficial dig. It evolved to this. In fairness to people without a technical background, getting to tight, implementable requirements are is difficult. Sadly the first hand-waving garbage was from a DBA. It evolved to this. The early drafts made no sense.

Second superficial whining. The specification -- as written -- is extraordinarily shabby. This seems to be written by someone who's never read a function definition in the Python documentation before. Something I know is not the case. How can someone who is marginally able to code also unable to write a description of a function? In this case, the "marginally able to code" may be a hint that some folks struggle with abstraction: the world is a lot of unique details; patterns don't emerge from related details.

Third. Starting from record 2, seems to show that they don't get the idea that indexes start with zero. They've seen Python. They've written code. They've posted code to the web for comments. And they are still baffled by the start value of indices.

Let's move on to the more interesting topic, functional composition. 

Functional Composition

The actual data file is a .GZ archive. So there's a tiny problem with looking at .CSV extracts from the gzip. Specifically, we're exploding a file all over the hard drive for no real benefit. It's often faster to read the zipped file: it may involve fewer physical I/O operations. The .GZ is small; the computation overhead to decompress may be less than the time waiting for I/O.

To get to functional composition we have to start by decomposing the problem. Then we can build the solution from the pieces. To do this, we'll borrow the interface segregation (ISP) design principle from OO programming.

Here's an application of ISP: Avoid Persistence. It's easier to add persistence than to remove it. This leads peeling off three further tiers of file processing: Physical Format, Logical Layout, and Essential Entities.

We shouldn't write a .CSV file unless it's somehow required. For example, if there are multiple clients for a subset. In this case, the problem domain is exploratory data analysis (EDA) and saving .CSV subsets is unlikely to be helpful. The principle still applies: don't start with persistence in mind. What are the Essential Entities?

This leads away from trying to work with filenames, also. It's better to work with files. And we shouldn't work with file names as strings, we should use pathlib.Path. All consequences of peeling off layers from the interfaces.

Replacing names with files means the overall function is really this. A composition. 

file_record_selection = (lambda source, start, stop, target: 
    file_write(target, file_read_selection(source, start, stop))

We applied the ISP again, to avoid opening a named .CSV file. We can work with an open file-like objects, instead of a file names. This doesn't change the overall form of the functions, but it changes the types. Here are the two functions that are part of the composition:

from typing import *
import typing
Record = Any
def file_write(target: typing.TextIO, records: Iterable[Record]):
def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:

We've left the record type unspecified, mostly because we don't know what it just yet. The definition of Record reflects the Essential Entities, and we'll defer that decision until later. CSV readers can produce either dictionaries or lists, so it's not a complex decision; but we can defer it.

The .GZ processing defines the physical format. The content which was zipped was a .CSV file, which defines the logical layout.

Separating physical format, logical layout, and essential entity, gets us code like the following:

with gzip.open('file.gz') as source:
    reader = csv.DictReader(source)  # Iterator[Record]
    for line in file_read_selection(reader, start, stop):

We've opened the .GZ for reading. Wrapped a CSV parser around that. Wrapped our selection filter around that. We didn't write the CSV output because -- actually -- that's not required. The core requirement was to examine the input.

We can, if we want, provide two variations of the file_write() function and use a composition like the file_record_selection() function with the write-to-a-file and print-to-the-console variants. Pragmatically, the print-to-the-console is all we really need.

In the above example, the Record type can be formalized as  List[Text].  If we want to use csv.DictReader instead, then the Record type becomes Dict[Text, Text].

Further Decomposition

There's a further level of decomposition: the essential design pattern is Pagination. In Python parlance, it's a slice operation. We could use itertools to replace the entirety of file_read_selection() with itertools.takewhile() and itertools.dropwhile(). The problem with these methods is they don't short-circuit: they read the entire file.

In this instance, it's helpful to have something like this for paginating an iterable with a start and stop value.

for n, r in enumerate(reader):
    if n < start: continue
    if n = stop: break
    yield r

This covers the bases with a short-circuit design that saves a little bit of time when looking at the first few records of a file. It's not great for looking at the last few records, however. Currently, the "tail" use case doesn't seem to be relevant. If it was, we might want to create an index of the line offsets to allow arbitrary access. Or use a simple buffer of the required size.

If we were really ambitious, we'd use the Slice class definition to make it easy to specify start, stop, and step values. This would allow us to pick every 8th item from the file without too much trouble.

The Slice class doesn't, however support selection of a randomized subset. What we really want is a paginator like this:

def paginator(iterable, start: int, stop: int, selection: Callable[[int], bool]):
    for n, r in enumerate(iterable):
        if n < start: continue
        if n == stop: break
        if selection(n): yield r

file_read_selection = lambda source, start, stop: paginator(source, start, stop, lambda n: True)

file_read_slice = lambda source, start, stop, step: paginator(source, start, stop, lambda n: n%step == 0)

The required file_read_selection() is built from smaller pieces. This function, in turn, is used to build file_record_selection() via functional composition. We can use this for randomized selection, also.

Here are functions with type hints instead of lambdas.

def file_read_selection(source: csv.DictReader, start: int, stop: int) -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: True)

def file_read_slice(source: csv.DictReader, start: int, stop: int, step: int)  -> Iterable[Record]:
    return paginator(source, start, stop, lambda n: n%step == 0)

Specifying type for a generic iterable and the matching result iterable seems to require a type variable like this:

T = TypeVar('T')
def paginator(iterable: Iterable[T], ...) -> Iterable[T]:

This type hint suggests we can make wide reuse of this function. That's a pleasant side-effect of functional composition. Reuse can stem from stripping away the various interface details to decompose the problem to essential elements.


What's essential here is Design By Composition. And decomposition to make that possible.

We got there by stepping away from file names to file objects. We segregated Physical Format and Logical Layout, also. Each application of the Interface Segregation Principle leads to further decomposition. We unbundled the pagination from the file I/O. We have a number of smaller functions. The original feature is built from a composition of functions.

Each function can be comfortably tested as a separate unit. Each function can be reused.

Changing the features is a matter of changing the combination of functions. This can mean adding new functions and creating new combinations. 

Tuesday, July 4, 2017

Python and Performance

Real Question:

One of the standard problems that keeps coming up over and over is the parsing of url's. A sub-problem is the parsing of domain and sub-domains and getting a count.

For example

It would be nice to parse the received file and get counts like

.com had 15,323 count
.google.com had 62 count
.theatlantic.com had 33 count

The first code snippet would be in Python and the other code snippet would be in C/C++ to optimize for performance.


Yes. They did not even try to look in the standard library for urllib.parse. The general problem has already been solved; it can be exploited in a single line of code.

The line can be long-ish, so it can help to use a lambda to make it a little easier to read. The code is below.

The C/C++ point about "optimize for performance" bothers me to no end. Python isn't very slow. Optimization isn't required.

I made 16,000 URL's. These were not utterly random strings, they were random URL's using a pool of 100 distinct names. This provides some lumpiness to the data. Not real lumpiness where there's a long tail of 1-time-only names. But enough to exercise collections.Counter and urllib.parse.urlparse().

Here's what I found. Time to parse 16,000 URLs and pluck out the last two levels of the name?

CPU times: user 154 ms, sys: 2.18 ms, total: 156 ms
Wall time: 157 ms


CPU times: user 295 ms, sys: 6.87 ms, total: 302 ms
Wall time: 318 ms

At that pace, why use C?

I suppose one could demand more speed just to demand more speed.

Here's some code that can be further optimized.

top = lambda netloc: '.'.join(netloc.split('.')[-2:])
random_counts = Counter(top(urllib.parse.urlparse(x).netloc) for x in random_urls_32k)

The slow part of this is the top() function. Using rsplit('.', maxsplit=2) might be better than split('.'). A smarter approach might be find all the "." and slice the substring from the next-to-last one. Something like this, netloc[findall('.', netloc)[-2]:], assuming a findall() function that returns the locations of all '.' in a string.

Of course, if there is a problem, using a numpy structure might speed things up. Or use dask to farm the work out to multiple threads.

Tuesday, June 27, 2017

OOP and FP -- Objects vs. Functional -- Avoiding reductionist thinking

Real Quote (lightly edited to remove tangential nonsense.)
Recently, I watched a video and it stated that OO is about nouns and Functional programming is about the verbs. Also, ... Aspect-Oriented Programming with the e Verification Language  by David Robinson 
It would be nice to have a blog post which summarized the various mindset associated w/ the various paradigms.

I find the word "mindset" to be challenging.

Yes. All Turing Complete programming languages do have a kind of fundamental equivalence at the level of computing stuff represented as numbers. This, however, seems reductionist.

["All languages were one language to him. All languages were 'woddly'." Paraphrased from James Thurber's "The Great Quillow", a must-read.]

So. Beyond the core Turing Completeness features of a language, the rest is reduced to a difference in "mindset"? The only difference is how we pronounce "Woddly?"

"Mindset" feels reductionist. It replaces a useful summary of language features with a dismissive "mindset" categorization of languages. In a way, this seems to result from bracketing technology choices as "religious wars," where the passion for a particular choice outweighs the actual relevance; i.e., "All languages have problems, so use Java."

In my day job, I work with three kinds of Python problems:
  • Data Science
  • API Services
  • DevOps/TechOps Automation
In many cases, one person can have all three problems. These aren't groups of people. These are problem domains.

I think the first step is to replace "mindset" with "problem domain". It's a start, but I'm not sure it's that simple.

When someone has a data science problem, they often solve it with purely function features of Python. Generally they do this via numpy, but I've been providing examples of generator expressions and comprehensions in my Code Dojo webinars. Generator expressions are an elegant, functional approach to working with stateless data objects.

In Python 3, the following kind of code doesn't create gigantic intermediate data structures. The functional transformations are applied to each item generated by the "source".

x = map(transform, source)
y = filter(selector_rule, x)
z = Counter(y)

I prefer to suggest that a fair amount of data analysis involves little or no mutation of state. Functional features of a language seem to work well with immutable data.

There is state change, but it's at a macro level. For example, the persistence of capturing data is a large-scale state change that's often implemented at the OS level, not the language level.

When someone's building an API, on the other hand, they're often working with objects that have a mutating state. Some elements of an API will involve state change, and objects model state change elegantly. RESTful API's can deserialize objects from storage, make changes, and serialize the modified object again.

[This summary of RESTful services is also reductionist, and therefore, possibly unhelpful.]

When there's mutability, then objects might be more appropriate than functions.

I'm reluctant to call this "mindset." It may not be "problem domain." It seems to be a model that involves mutable or immutable state.

When someone's automating their processing, they're wrestling with OS features, and OS management of state change. They might be installing stuff, or building Docker images, or gluing together items in a CI/CD pipeline, setting the SSL keys, or figuring out how to capture Behave output as part of Gherkin acceptance testing. Lots of interesting stuff that isn't the essential problem at hand, but is part of building a useful, automated solution to the problem.

The state in these problems is maintained by the OS. Application code may -- or may not -- try to model that state.

When doing Blue/Green deployments, for example, the blueness and greenness isn't part of the server farm, it's part of an internal model of how the servers are being used. This seems to be stateful; object-oriented programming might be helpful. When the information can be gleaned from asset management tools, then perhaps a functional processing stream is more important for gathering, deciding, and taking action.

I'm leaning toward the second view-point, and suggesting that some of the OO DevOps programming might be better looked at as functional map-filter-reduce processing. Something like

action_to_take = some_decision_pipeline(current state, new_deployment)

This reflects the questions of state change. It may not be the right abstraction though, because carrying out the action is, itself, a difficult problem that involves determining the state of the server farm, and then applying some change to one or more servers.

We often think of server state change as imperative in nature. It feels like object-oriented programming. There are steps, the object models those steps. I'm not sure that's right. I think there's a repeated "determine next action" cycle here. Sometimes it involves waiting for an action to finish. Yes, it sounds like polling the server farm. I'm not sure that's wrong. How else do you know a server is crashed except by polling it?

I think we've moved a long way from "mindset."

I think it's about fitting language features to a problem in a way that creates the right abstraction to capture (and eventually) solve the problem.

I haven't mentioned Aspect-Oriented Programming because it seems to cut across the functional/object state management boundary. It's a distinctive approach to organizing reusable functionality. I don't mean to dismiss it as uninteresting. I mean to set it aside as orthogonal to the "mutable state" consideration that seems to to be one of the central differences between OOP and FP.

In response to the request: "No. I won't map mindset to paradigm."