Read Text File In C# For Mac

The newline or EOF character makes the fgets() function stop reading so you can check the newline or EOF file character to read the whole line. Third, close the text file using the fclose() function. C Read Text File Example. Below is the source code of reading a text file line by line and output it to the screen.

Active1 year, 4 months ago
C# read all text file

Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database.

I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp), but there are umpteen formatting errors and some characters are scrambled and alot of the time there are spaces (' ') EVERYWHERE - inside words, between every letter, huge blocks of them taking up several lines, it all seems a bit random.

Is there any easy way of doing this that I'm completely overlooking (quite likely!) or is it a bit of an arduous task that involves converting the extracted byte values into letters reliably?

Bjarki Heiðar
2,6546 gold badges24 silver badges36 bronze badges
Duncan TaitDuncan Tait
7544 gold badges13 silver badges23 bronze badges

closed as too broad by Bhargav RaoApr 23 at 11:21

Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

6 Answers

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

For
TarydonTarydon

Take a look at Tika on DotNet, available through Nuget:https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:

Update: One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx

David HammondDavid Hammond
2,9731 gold badge20 silver badges18 bronze badges

You can try Toxy, a text/data extraction framework in .NET. In Toxy 1.0, PDF will be supported. For detail, please visit http://toxy.codeplex.com

Tony QuTony Qu

In case you are processing PDF files with the purpose of importing data into a database then I suggest to consider ByteScout PDF Extractor SDK. Some useful functions included are

Lines
  • table detection;
  • text extraction as CSV, XML or formatted text (with the optional layout restoration);
  • text search with support for regular expressions;
  • low-level API to access text objects

DISCLAIMER: I'm affiliated with ByteScout

EugeneEugene

You can try Docotic.Pdf library (disclaimer: I work for Bit Miracle) to extract text from PDF files. The library uses some heuristics to extract nice looking text without unwanted spaces between letters in words.

C# Read Text File Lines

Please take a look at a sample that shows how to extract text from PDF.

BobrovskyBobrovsky
8,98718 gold badges63 silver badges113 bronze badges

If you're looking for 'free' alternative, check out PDF Clown. I personally have used iFilter based approach, and it seems to work fine in case you would need to support other file types easily. Sample code here.

Jussi PaloJussi Palo

C# Read A File

Not the answer you're looking for? Browse other questions tagged c#pdftextextract or ask your own question.